|
Page 2 of 6
Super Shuffle Engine
Penryn uses 128-bit single-pass shuffle unit, which can perform operation of “mixing” the bits in one cycle. This effectively doubles speed of these operations while using SSE instructions regarding to Conroe/Merom architecture. Shuffle operation is used for repositioning of bits inside XMM register which operates under SSE/2/3/4 instructions. Instructions such are SHUFPS xmm, xmm and PSHUFB xmm, xmm PSHUFB xmm, m128 were present even earlier on Intel and AMD processors, and for operations with bits were needed 2 cycles to complete the operation. Meaning, that without recompilation and additional optimizations it is possible to expect better performances in applications like: Photoshop, various programs for editing and processing images, video compression and other programs that intensively use manipulation operations with bits. Generally, this improvement usually doubles speed of packing, unpacking, shifting and rotating data, size of one, two or four bytes while using SSE instructions.
On picture are shown two 128-bit SSE registers with certain data. After applying C++ macro that uses, for example, SHUFPS instruction result is mixed data from m1 and m2 registers in m3 register.
Fast Radix-16
Penryn processor brings better performances for division operations. This improvement also brings increase of performances without additional optimizations and changes of existed code in software. Improvements can be seen on wide spectrum of applications attended for scientific calculations, 3D transformations and other mathematical intensive functions. These new technologies under the name “Fast Radix-16” are used not only for working with radix point, but also for dividing whole numbers. Former “Radix-4” algorithm, which was used for Intel’s processors, did calculations with two bits dividing results for every iteration. Or more precisely, in one cycle divisor bits moved for 2 while their comparison was made.
Introduction of radix-16 technology made possible realization of calculations per 4 bits for every iteration, which resulted in double reduction of latencies for dividing instructions. Considering that dividing is a demanding operation in CPUs and demands a lot of cycles, optimization and reducing the number of cycles needed for these operations, certainly are good news.
Store Forwarding
This is one of Intel’s microarchitecture great advantages regarding its rivals. All applications, especially ones that use big data sets, can do three basic things with data:
1. loading data from memory
2. performing calculations (executing)
3. saving them in memory
|
SSE operates at 16-bytes of data at the same time. We can load 16-bytes of packed data and execute calculations with one SSE instruction. First type is aligned reading and inputs (MOVDQA, MOVAPD, and MOVAPS) and they are working with 16-byte aligned memory addresses. The second type is unaligned (MOVDQU, MOVUPD and MOVUPS), that operates with unaligned but also with aligned memory addresses. Of course, aligned memory operations are faster from unaligned, but memory that we are working with, isn’t always aligned. Simply, safe but slower solution is used for instructions with unaligned addresses.
To accelerate loading from memory and cache to processor, regardless on result of unaligned input that passed 16-byte load/store package, but still is in processor pipeline, Penryn can delay results input and perform new loading, instead of waiting to end memory input. This leads to significant acceleration during unoptimised, unaligned, operations with memory. Special buffers in processor allow load operations even before the input is done, if following conditions are satisfied:
1. loading data must be aligned with input address
2. loading data must be the same size or less from input data
3. loading mustn’t demand skipping two separate inputs
4. 128-bit data for input and loading, must be aligned with allowed 16-bytes. |
Picture shows different combinations of inputs and loadings, for which input can be or not skipped.
Also, there is a possibility of dynamic acceleration that allows dynamic overclock of one core under “single-thread” loads. This possibility existed on Santa Rosa and Merom 65nm processor, but on Penryn it is much more improved.
|