|
Page 1 of 10 A few months ago, the American company Advanced Micro Devices – AMD, as the second biggest manufacturer in the processor market, launched the new K10 microarchitecture onto the market, in the form of a server solution codenamed Barcelona. Barcelona is the new Opteron which brings many novelties in the fields of virtualization, multi-socket scaling, as well as many improvements in multithread software synchronization.
As far as desktop computers are concerned, the K10 architecture also brings along an entirely new platform, which should unify all vital system components under a single brand – Spider. Under the hood of Spider are an AMD Phenom X4 “quad-core” processor, and AMD-ATi chipset, and an AMD-ATi graphics processor (or more) which can be placed in single or CrossFireX work regime. The result of all this is a complete multimedia/gaming machine entirely harmonized under AMD's roof. The processor itself is a “pure” AMD product and is largely based on the previous generation, K8 architecture, but with a plethora of improvements. The accent was certainly on performance improvements in multithread programs, i.e. programs which are able to use multiple cores effectively. Many would presume that doubling the number of cores has double performance as a natural result. Yet not everything is that simple. It is much alike the problem of making ten workers to do a single worker's job ten times faster. This is uneasy, as you would have to segment the job in such a way that they can be performed in parallel, and putting the product together is also time-consuming to a certain extent; therefore, the performance jump can never be linear.
Phenom and the K10 family in general were designed to comply with the requirements of the increasingly popular multithread software.
The new kid in town!
The launch of a new microarchitecture is always an exciting event. The K10 processor generation is a new architecture, even though it may be based on the previous one. Therefore, it’s not all down to increasing cache memory, speeding up the bus, reduction and improvement of the manufacturing process and similar cosmetic treatment. The entire success of this processor family is depending on its basis. Common practice shows that a good architecture (with good potential) can be well differentiated into lower product categories which are more competitive on the wider market. A high-end processor with a high number of operations per cycle is very fast at a relatively low clock. A processor with less cache and transistors is usually easier to make in large quantities. It is pretty certain that the actual generation of K8 processors will soon be fully replaced by newer and more efficient derivatives of the K10 processors in different pricing categories. However, Phenom does have a few newborn illnesses. The current offer of the K10 architecture is pretty limited because of the complexity of the manufacturing process and insufficient experience obtained from it, so one may get the impression that the “native”, i.e. real monolith “quad-core” processor in 65 nm is too large bite for the company from Sunnyvale. The processor die is an entire 286 square millimetres, which is twice as large as Core2Duo processors and more than double the size of an Athlon X2 (Brisbane). Some time ago, we explained the notion of a defective processor on a wafer and yield, as well as possibilities of deriving a weaker processor from a high-end one. As we already stated, a photolithographic process is done on the wafer, which produces several hundred processors as the final product. Further on, chips are cut from the wafer and tested in special quality control laboratories. For example, a single 300 mm diameter wafer can host ~260 K10 quad-core processors, most of which are fully working, but also several partially defective (but still usable for weaker models) and several fully defective. The yield depends mostly on the degree of perfection of the currently used technology. Unlike Intel’s Core2Quad, the K10 is monolithic in design, which means that a full-blown quad-core processor is placed on a single piece of silicon. Intel, on the other hand, uses the MCM (Multi Chip Module) packaging, which allows them to place two dual-core processors within the same LGA casing and connect them by the Front Side Bus, i.e. the processor bus. This type of connecting dual-core processors into a single quad has a drawback in the limited communication between different cores, as all traffic is done over the chipset. In order for L2 cache data from the first dual-core processor to reach the L2 cache of the second one, it has to pass through to the chipset and back. This, in theory and under a certain type of load on servers, can significantly decrease overall performance of the system. Still, for most desktop users, this is not as relevant, as scaling is rather good in the majority of multithreading operations due to a relatively low number of memory I/O operations. The affinity itself is set up in such a way that the first and the second thread use the first two cores, and the third and fourth thread use the second pair of cores, which reduces the need for data sharing between the cache memory of each core. Thread scaling in an MCM configuration used by Intel is not a stronger side of their design, so even though the technical side of the processor is not too imaginative, it still brings certain advantages, such as the lack of need to manufacture complicated quad-core processor (from the perspective of K10’s complicated manufacturing process).
A microarchitectural revolution or an evolutionary move?
Although the K10 core may seem a lot similar to the old K8 core, the list of changes is fairly long. You can see all the novelties in the chart, and we will explain the benefit of all of them for the ordinary user.
| K10 |
K8 |
32-byte instruction flow
|
16-byte instruction flow
|
Advanced branching prediction - bimodal and indirect
|
Srandard bimodal
|
Sideband Stack Optimizer
|
Doesn't exist. Stack operations ocupate
|
|
Out Of Order Load Execution -input is possible before memory reading
|
Optimizations for reading and writing memory doesn't exist
|
2x128 bit instruction input
|
2x64 bit instruction input
|
128-bit floating point SIMD SSE engine
|
64-bit SIMD SSE engine |
Up to 4x64 bit SSE & 2x128 bit extension per cycle
|
Up to 2x64-bit extension; 128-bit extensions perform in 2 cycles
|
New SSE4a SIMD instructions
|
SSE3 |
DDR2 and DDR3 latency support
|
DDR2 |
Unganged & ganged memory contoller mode
|
Only ganged 1x128 bit
|
HyperTransport 3.0 up to 8GB/s bidirectional
|
HyperTransport up to 4GB/s bidirectional
|
Split power plane - independent frequency change on CPU cores and northbridge
|
Uni-power plane - frequency change for CPU that depends on load
|
Shared L3 cache
|
No L3 cache |
| Faster execution for many SIMD instructions, more Fastpath instructions |
Lot of SIMD instructions are Vectorpath |
|