Delay slot
In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding instruction.[1] The most common form is a single arbitrary instruction located immediately after a branch instruction on a RISC or DSP architecture; this instruction will execute even if the preceding branch is taken. This makes the instruction execute out-of-order compared to its location in the original assembler language code. Modern processor designs generally do not use delay slots, and instead perform ever more complex forms of branch prediction. In these systems, the CPU immediately moves on to what it believes will be the correct side of the branch and thereby eliminates the need for the code to specify some unrelated instruction, which may not always be obvious at compile-time. If the assumption is wrong, and the other side of the branch has to be called, this can introduce a lengthy delay. This occurs rarely enough that the speed up of avoiding the delay slot is easily made up by the smaller number of wrong decisions. PipeliningA central processing unit generally performs instructions from the machine code using a four-step process; the instruction is first read from memory, then decoded to understand what needs to be performed, those actions are then executed, and finally, any results are written back to memory. In early designs, each of these stages was performed in series, so that instructions took some multiple of the machine's clock cycle to complete. For instance, in the Zilog Z80, the minimum number of clocks needed to complete an instruction was four, but could be as many as 23 clocks for some (rare) instructions.[2] At any given stage of the instruction's processing, only one part of the chip is involved. For instance, during the execution stage, typically only the arithmetic logic unit (ALU) is active, while other units, like those that interact with main memory or decode the instruction, are idle. One way to improve the overall performance of a computer is through the use of an instruction pipeline. This adds some additional circuitry to hold the intermediate states of the instruction as it flows through the units. While this does not improve the cycle timing of any single instruction, the idea is to allow a second instruction to use the other CPU sub-units when the previous instruction has moved on.[3] For instance, while one instruction is using the ALU, the next instruction from the program can be in the decoder, and a third can be fetched from memory. In this assembly line type arrangement, the total number of instructions processed at any time can be improved by up to the number of pipeline stages. In the Z80, for example, a four-stage pipeline could improve overall throughput by four times. However, due to the complexity of the instruction timing, this would not be easy to implement. The much simpler instruction set architecture (ISA) of the MOS 6502 allowed a two-stage pipeline to be included, which gave it performance that was about double that of the Z80 at any given clock speed.[4] Branching problemsA major issue with the implementation of pipelines in early systems was that instructions had widely varying cycle counts. For instance, the instruction to add two values would often be offered in multiple versions, or opcodes, which varied on where they read in the data. One version of However, there is one problem that comes up in pipeline systems that can slow performance. This occurs when the next instruction may change depending on the results of the last. In most systems, this happens when a branch occurs. For instance, consider the following pseudo-code: top: read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register write the result to memory read a number from memory and store it in another register ... In this case, the program is linear and can be easily pipelined. As soon as the first Now consider what occurs when a branch is added: top: read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register if the result in the 3rd register is greater than 1000, then go back to top: (if it is not) write the result to memory read a number from memory and store it in another register ... In this example the outcome of the comparison on line four will cause the "next instruction" to change; sometimes it will be the following Branch delay slotsOne strategy for dealing with this problem is to use a delay slot, which refers to the instruction slot after any instruction that needs more time to complete. In the examples above, the instruction that requires more time is the branch, which is by far the most common type of delay slot, and these are more commonly referred to as a branch delay slot. In early implementations, the instruction following the branch would be filled with a no-operation, or In the examples above, the read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register if the result in the 3rd register is greater than 1000, then go back to the top read a number from memory and store it in another register (if it is not) write the result to memory ... Now when the branch is executing, it goes ahead and performs the next instruction. By the time that instruction is read into the processor and starts to decode, the result of the comparison is ready and the processor can now decide which instruction to read next, the Finding an instruction to fill the slot can be difficult. The compilers generally have a limited "window" to examine and may not find a suitable instruction in that range of code. Moreover, the instruction cannot rely on any of the data within the branch; if an Another side effect is that special handling is needed when managing breakpoints on instructions as well as stepping while debugging within the branch delay slot. An interrupt is unable to occur during a branch delay slot and is deferred until after the branch delay slot.[5][6] Placing branch instruction in the branch delay slot is prohibited or deprecated.[7][8][9] The ideal number of branch delay slots in a particular pipeline implementation is dictated by the number of pipeline stages, the presence of register forwarding, what stage of the pipeline the branch conditions are computed, whether or not a branch target buffer (BTB) is used and many other factors. Software compatibility requirements dictate that an architecture may not change the number of delay slots from one generation to the next. This inevitably requires that newer hardware implementations contain extra hardware to ensure that the architectural behaviour is followed despite no longer being relevant. ImplementationsBranch delay slots are found mainly in DSP architectures and older RISC architectures. MIPS, PA-RISC (delayed or non-delayed branch can be specified),[10] ETRAX CRIS, SuperH (unconditional branch instructions have one delay slot),[11] Am29000,[12] Intel i860 (unconditional branch instructions have one delay slot),[13] MC88000 (delayed or non-delayed branch can be specified),[14] and SPARC are RISC architectures that each have a single branch delay slot; PowerPC, ARM, Alpha, V850, and RISC-V do not have any. DSP architectures that each have a single branch delay slot include μPD77230[15] and the VS DSP. The SHARC DSP and MIPS-X use a double branch delay slot;[16] such a processor will execute a pair of instructions following a branch instruction before the branch takes effect. Both TMS320C3x[17] and TMS320C4x[8] use a triple branch delay slot. The TMS320C4x has both non-delayed and delayed branches.[8] The following example shows delayed branches in assembly language for the SHARC DSP including a pair after the RTS instruction. Registers R0 through R9 are cleared to zero in order by number (the register cleared after R6 is R7, not R9). No instruction executes more than once. R0 = 0; CALL fn (DB); /* call a function, below at label "fn" */ R1 = 0; /* first delay slot */ R2 = 0; /* second delay slot */ /***** discontinuity here (the CALL takes effect) *****/ R6 = 0; /* the CALL/RTS comes back here, not at "R1 = 0" */ JUMP end (DB); R7 = 0; /* first delay slot */ R8 = 0; /* second delay slot */ /***** discontinuity here (the JUMP takes effect) *****/ /* next 4 instructions are called from above, as function "fn" */ fn: R3 = 0; RTS (DB); /* return to caller, past the caller's delay slots */ R4 = 0; /* first delay slot */ R5 = 0; /* second delay slot */ /***** discontinuity here (the RTS takes effect) *****/ end: R9 = 0; Load delay slotA load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see, and need not wait for, the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000 and R3000 microprocessors) suffers from this problem. The following example is MIPS I assembly code, showing both a load delay slot and a branch delay slot. lw v0,4(v1) # load word from address v1+4 into v0
nop # wasted load delay slot
jr v0 # jump to the address specified by v0
nop # wasted branch delay slot
See alsoReferences
External links
|