Here is Part 2 of the lengthy Cell discribing and explaining the technology and features.
Figure 7 - Schmoo plot for the SPE
Figure 7 shows the schmoo plot for the SPE. The schmoo plot shows that the SPE can comfortably operate at a frequency of 4 GHz with Vdd of 1.1 V, consuming approximately 4 W. The schmoo plot also reveals that due to the careful segmentation of signal path lengths, the design is far from being wire delay limited. Frequency scaling relative to voltage continues past 1.3 V. This schmoo plot also contributes to the plausibility of the unconfirmed report that the CELL processor could operate at upwards of 5.6 GHz.
“Unknown” Functional Units: ATO and RTB
Oftentimes when a paper relating to a complex project is written collaboratively by a group of people, details are lost. Still, it appeared as rather humorous that of the six design engineers and architects from the CELL processor project present at Tuesday evening’s chat session, no one could recall what the acronyms ATO and RTB stood for. ATO and RTB are functional blocks labeled in the floorplan of the SPE. However, the functionality of these functional blocks or the meaning of the acronym were neither noted on the floorplan, nor explained in the paper, nor mentioned in the technical presentation. In an effort to cover all the corners, this author placed the question on a list of questions to be asked of the CELL project team members. Hilarity thus ensued as slightly embarrassed CELL project members stared blankly at each other in an attempt to recall the functionality or definition of the acronyms.
In all fairness, since the SPE was presented on Monday and the CELL processor itself was presented on Tuesday, CELL project members responsible for the SPE were not present for Tuesday evening’s chat sessions. As a result, the team members responsible for the overall CELL processor and internal system interconnects were asked to recall the meaning of acronyms of internal functional units within the SPE. Hence, the task was unnecessarily complicated by the absence of key personnel that would have been able to provide the answer faster than the CELL processor can rotate a million triangles by 12 degrees about the Z axis.
After some discussion (and more wine), it was determined that the ATO unit is most likely the Atomic (memory) unit responsible for coherency observation/interaction with dataflow on the EIB. Then, after the injection of more liquid refreshments (CH3CH2OH), it was theorized that the RTB most likely stood for some sort of Register Translation Block whose precise functionality was unknown to those outside of the SPE. However, this theory would turn out to be incorrect.
Finally, after sufficient numbers of hydrocarbon bonds have been broken down into H-OH on Wednesday, a member of the CELL processor team member tracked down the relevant information and he writes:
The R in RTB is an internal 1 character identifier that denotes that the RTB block is a unit in the SPE. The TB in RTB stands for "Test Block". It contains the ABIST (Array Built In Self Test) engines for the Local Store and other arrays in the SPE, as well as other test related control functions for the SPE.
Element Interconnect Bus
The element interconnect bus is the on chip interconnect that ties together all of the processing, memory, and I/O elements on the CELL processor. The EIB is implemented as a set of four concentric rings that is routed through portions of the SPE, where each ring is a 128 bit wide interconnect. To reduce coupling noises, the wires are arranged in groups of four and interleaved with ground and power shields. To further reduce coupling noises, the direction of data flow alternates between each adjacent ring pair. Data travels on the EIB through staged buffer/repeaters at the boundaries of each SPE. That is, data is driven by one set of staged buffer and latched by the buffer at the next stage every clock cycle. Data moving from one SPE through other SPE’s requires the use of repeaters in the intermediary SPE’s for the duration of the transfer. Independently from the buffer/repeater elements, separate data on/off ramps exist in the BIU of the SPE, as data targeted for the LS unit of a given SPE can be off-loaded at the BIU. Similarly, outgoing data can be placed onto the EIB by the BIU.
Figure 8 - Counter rotational rings of the EIB - 4 SPE’s shown
The design of the EIB is specifically geared toward the scalability of the CELL processor. That is, signal path lengths on the EIB do not change regardless of the number of SPE’s in a given CELL processor configuration. Since the data travels no more than the width of one SPE, more SPE’s on a given CELL processor simply means that the data transport latency increases by the number of additional hops through those SPE’s. Data transfer through the EIB is controlled by the EIB controller, and the EIB controller works with the DMA engine and the channel controllers to reserve the buffers drivers for certain number of cycles for each data transfer request. The data transfer algorithm works by reserving channel capacity for each data transfer, thus providing support for real time applications. Finally, the design and implementation of the EIB has a curious side effect in that it limits the current version of the CELL processor to expand only along the horizontal axis. Thus, the EIB enables the CELL processor to be highly configurable and SPE’s can be quickly and easily added or removed along the horizontal axis, and the maximum number of SPE’s that can be added is set by the maximum width of the chip allowable by the reticule size of the fabrication equipment.
The POWERPC Processing Element
Neither microarchitectural details nor the performance characteristics of the POWERPC Processing Element were disclosed by IBM during ISSCC 2005. However, what is known is that the PPE processor core is a new core that is fully compliant with the POWERPC instruction set, the VMX instruction set extension inclusive. Additionally, the PPE core is described as a two issue, in-order, 64 bit processor that supports 2 way SMT. The L1 cache sizes of the PPE is reported to be 32KB each, and the unified L2 cache is 512 KB in size. Furthermore, the lineage of the PPE can be traced to a research project commissioned by IBM to examine high speed processor design with aggressive circuit implementations. The results of this research project were published by IBM first in the Journal of Solid State Circuits (JSSC) in 1998, then again in ISSCC 2000.
The paper published in JSSC in 1998 described a processor implementation that supported a subset of the POWERPC instruction set, and the paper published in ISSCC 2000 described a processor that supported the complete POWERPC instruction set and operated at 1 GHz on a 0.25µm process technology. The microarchitecture of the research processor was disclosed in some detail in the ISSCC 2000 paper. However, that processor was a single issue processor whose design goal was to reach high operating frequency by limiting pipestage delay to 13 FO4, and power consumption limitations were not considered. For the PPE, several major changes in the design goal dictated changes in the microarchitecture from the research processor disclosed at ISSCC in 2000. Firstly, to further increase frequency, the per stage circuit delay design target was lowered from 13 FO4 to 11 FO4. Secondly, limiting power consumption and minimize leakage current were added as high priority design goals for the PPE. Collectively, these changes limited the per stage logic depth, and the pipeline was lengthened as a result. The addition of SMT and the two issue design goal completed the metamorphosis of the research processor to the PPE. The result is a processing core that operates at a high frequency with relatively low power consumption, and perhaps relatively poorer scalar performance compared to the beefy POWER5 processor core.
Rambus XDR Memory System
Figure 9 - The two channel XDR Memory System
To provide machine balance and support the peak rating of more than 256 SP GFlops (or 25~30 DP GFlops), the CELL processor requires an enormously capable memory system. For that purpose, two channels of Rambus XDR memory is used to obtain 25.2 GB/s of memory bandwidth.
In the XDR memory system, each channel can support a maximum of thirty-six devices connected to the same command and address bus. The data bus of each device connects to the memory controller through a set of bi-directional point-to-point connections. In the XDR memory system, address and command are sent on the address and command bus at a rate of 800 Mbits per second (Mbps), and the point to point interface operates at a datarate of 3.2 Gbps. Using DRAM devices with 16 bit wide data busses, each channel of XDR memory can sustain a maximum bandwidth of 102.4 Gbps (2 x 16 x 3.2), or 12.6 GB/s. The CELL processor can thus achieve a maximum bandwidth of 25.2 GB/s with a 2 channel, 4 device configuration.
The obvious advantage of the XDR memory system is the bandwidth that it provides to the CELL processor. However, in the configuration illustrated in figure 9, the maximum of 4 DRAM devices means that the CELL processor is limited to 256 MB of memory, given that the highest capacity XDR DRAM device is currently 512 Mbits. Fortunately, XDR DRAM devices could in theory be reconfigured in such a way so that upwards of 36 XDR devices can be connected to the same 36 bit wide channel and provide 1 bit wide data bus each to the 36 bit wide point-to-point interconnect. In such a configuration, a two channel XDR memory can support upwards of 16 GB of ECC protected memory with 256 Mbit DRAM devices or 32 GB of ECC protected memory with 512 Mbit DRAM devices. As a result, the CELL processor could in theory address a large amount of memory if the price premium of XDR DRAM devices can be minimized. One intriguing note reported by Dave Bursky of Electronic Design Magazine is that the XDR memory system makes use of 72 pairs of differential signals for the data bus. The figure seventy-two implies that the CELL processor does indeed support ECC. Since ECC support is clearly not a requirement of a processor to be used in a game machine, the presence of ECC support, if confirmed, would clearly indicate IBM’s ambition to promote the use of CELL processors for serious computational applications outside of the application domain of the Sony Playstation.
Incidentally, Toshiba is a manufacturer of XDR DRAM devices. Presumably it brought the XDR memory controller and memory system design expertise to the table, and could ramp up production of XDR DRAM devices as needed.
FlexIO System Interface
At ISSCC 2005, Rambus presented a paper on the FlexIO interface used on the CELL processor. However, the presentation was limited to describing the physical layer interconnect. Specifically, the difficulties of implementing the Redwood Rambus ASIC Cell on IBM’s 90nm SOI process were examined in some detail. While circuit level issues regarding the challenges of designing high speed I/O interfaces on an SOI based process are in their own right extremely intriguing topics, the focus of this article is geared toward the architectural implications of the high bandwidth interface. As a result, the circuit level details will not be covered here. Interested readers are encouraged to seek out details on Rambus’s Redwood technology separately.
What is known about the system interface of the CELL processor is that the FlexIO consists of 12 byte lanes. Each byte lane is a set of 8 bit wide, source synchronous, unidirectional, point-to-point interconnects. The FlexIO makes use of 96 differential signaling pairs to achieve the data rate of 6.4 Gb per second per signal pair, and that data rate in turn translates to 6.4 GB/s per byte lane. The 12 byte lanes are asymmetric in configuration. That is, 7 byte lanes are outbound from the CELL processor, while 5 byte lanes are inbound to the CELL processor. The 12 byte lanes thus provide 44.8 GB/s of raw outbound bandwidth and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of 76.8 GB/s. Furthermore, the byte lanes are arranged into two groups of ports: one group of ports are dedicated to non-coherent off-chip traffic, while the other group of ports are usable for coherent off-chip traffic. It seems clear that Sony itself is unlikely to make use of a coherent, multiple CELL processor configuration for Playstation 3. However, the fact that the PPE and the SPE’s can snoop traffic transported through the EIB, and that coherency traffic can be sent to other CELL processors via a coherent interface, means that the CELL processor can indeed be an interesting processor. If nothing else, the CELL processor should enable startups that propose to build FlexIO based coherency switches to garner immediate interest from venture capitalists.
Summary
The CELL processor presents an intriguing alternative in its pursuit of performance. It seems to be a forgone conclusion that the CELL processor will be an enormously successful product, and that millions of CELL processors will be sold as the processors that power the next generation Sony Playstation. However, IBM has designed some features into the CELL processor that clearly reveals its ambition in seeking new applications for the CELL processor. At ISSCC 2005, much fanfare has been generated by the rating of 256 GFlops @ 4 GHz for the CELL processor. However, it is the little mentioned double precision capability and the yet undisclosed system level coherency mechanism that appear to be the most intriguing aspects that could enable the CELL processor to find success not just inside the Playstation, but outside of it as well.
Source
Massive Attack Union Manifesto
Journal Entry #4 Part 2
Nfactor
PS: if you have any questions let me know, thx
Hopefull more info will come our way about Collada/Open GL 2.0's use in the apulets processing through the GPU and CPU interconnects. Plus look to the GDC conference in March 2005, on information on the Shadel Model 3.0 spec raytracing and global illumationation technology being developed and implimented in the NextGen NV chipset for the GPU in Playstation 3. Have fun with the information, and I hope your understanding of the Playstation 3 technology has been helped by this article.