|
Page 2 of 10
From a simplified point of view, each processor has four phases in instruction execution. The first of these is input of the operational code, the second one is input of operand address, the third one is transferring the operand into the processor registry and the fourth one is executing the operation. In practice, execution can be simplified down to memory-processor registries transfer, after which the processor does certain calculations on them.
Instruction input
A K8 processor performs instruction input in each processor clock from the L1 (instruction) cache in 16-byte blocks into the buffer. Inside the buffer, instruction input is performed from the blocks, only to be sent to the x86 decoder channels. The input rate (which consists of 16-bytes per clock) enables decoding dispatch in every clock: packages of 3 mid-length instructions (no longer than 5 bytes). In certain situations, the mid-length of the instructions in the chain can be greater than 5 bytes.
For example, the length of SSE2 (simple instructions with registry-registry operands, for example MOVAPD XMM0, XMM1) totals at 4 bytes. During the appliance of indirect addressing schemes, together with the usage of base and movement registers (for example, MOVAPD XMM0, [EAX+16]), the total instruction length is increased to 6-8 bytes. In the 64-bit mode, during usage of additional registers, another 1-byte REX prefix is added to the instruction code. In this way, the length of SSE2 instructions can reach 7-9 bytes in a 64-bit environment. The length of SSE1 instructions, if the instruction is a vector-based one (and it is vector-based if it possesses four 32-bit values) is 1 bit shorter, but scalar (with a single operand) SSE1 instructions can also have a length of 7-9 bytes in the very same conditions.
In this situation, the input rate of 16 bytes per clock is already insufficient for maintaining the speed of decoding x86 SSE instructions three-per-clock. Since the K8 possesses two 64-bit FPU blocks, which are used for SIMD operation execution at the same time, a single vector SSE instruction is decoded into two macro-operations, which are executed in two cycles within the 64-bit FPU blocks. If the length of the SSE operation totals in between 7-9 bytes, and the input rate is 16 bytes per cycle, the maximum decoding speed is 5.1 of these instructions per cycle, i.e. 3 instructions in a single cycle. The limiting factor in K8 is not the input rate, but the width of FPU blocks, which is only 64-bit.
In the picture, you can see an example of the configuration of five long commands in a 32-byte block, which can be chosen for a clock. If the queue of these long instructions occupies several neighbouring 16-byte blocks, in the case of instruction input, the 16-byte blocks impair the rate of 3 instructions per clock. The K10 has this input rate as an imperative. From this point of view, the 32-byte block input, which is presented in the architectural novelties of the K10 sheet, doesn’t seem overrated anymore. The 32-byte block input makes possible to choose four 8-byte commands from the cache. Since the FPU width in Phenom is 128 bits, vector-based SSE operations can be done in a single cycle, just like with Intel’s Core2 processors. This is the reason for the increased factor 2 input rate. Instruction input is sweeping primarily through the L1 cache, so the bus width between the L1 cache and the cores had to be increased from 2x64 bits to 2x128 bits. A microarchitecture done this way can decode up to four 8-byte operations per cycle, but the limiting factor is the number of decoders and FPU blocks and there are only three of these. However, the 32-byte input can indeed increase the speed of execution of the instructional mix of integer commands for floating point operations. Intel Core2 processors take instructions in 16-byte blocks, just like the K8, and can decode four instructions per clock only when the average length of the instructions is no longer than 4 bytes. Otherwise, the decoder can only process 3 instructions per clock. Intel’s design engineers applied a little trick with this congestion, which partially solved the problem. During short knot execution, long SSE instructions are buffered into special internal “fetch buffer”, 64 bytes long, which buffers up to four 16-byte instruction blocks, which enables input of the decoder instructions at a rate of 32 bytes per clock. If the knot cannot fit within these four blocks, it cannot be cached in the buffer, and execution will certainly be limited. The K10 has no problems of this kind, although it is questionable what the percentage of applicable situations is.
|