Element Interconnect Bus (EIB)
The EIB is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants. The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as 'units'.
The EIB is presently implemented as a circular ring comprised of four 16B-wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). While this figure is often quoted in IBM literature it is unrealistic to simply scale this number by processor clock speed. The arbitration unit imposes additional constraints which are discussed in the Bandwidth Assessment section below.
IBM Senior Engineer David Krolak, EIB lead designer, explains the concurrency model:
A ring can start a new op every three cycles. Each transfer always takes eight beats. That was one of the simplifications we made, it's optimized for streaming a lot of data. If you do small ops, it does not work quite as well. If you think of eight-car trains running around this track, as long as the trains aren't running into each other, they can coexist on the track.[26]
Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.
Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the overall performance of the EIB as they reduce available concurrency.
Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels.
David Krolak explains:
Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is architected, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.[26]
[edit]
Bandwidth assessment
For the sake of quoting performance numbers, we will assume a Cell processor running at 3.2 GHz, the clock speed most often cited.
At this clock frequency each channel flows at a rate of 25.6 GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2 GB/s. Based on this view many IBM publications depict available EIB bandwidth as "greater than 300 GB/s". This number reflects the peak instantaneous EIB bandwidth scaled by processor frequency.[27]
However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. The IBM Systems Performance group explains:
Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128B, the theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s.[22]
This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.
In practice effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6 GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6 GB/s for reads and writes combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6 GB/s and a peak combined output speed of 35 GB/s.
To add further to the confusion, some older publications cite EIB bandwidth assuming a 4 GHz system clock. This reference frame results in an instantaneous EIB bandwidth figure of 384 GB/s and an arbitration-limited bandwidth figure of 256 GB/s.
All things considered the theoretic 204.8 GB/s number most often cited is the best one to bear in mind. The IBM Systems Performance group has demonstrated SPU-centric data flows achieving 197 GB/s on a Cell processor running at 3.2 GHz so this number is a fair reflection on practice as well.
[edit]
Optical interconnect
Sony is currently working on the development of an optical interconnection technology for use in the device-to-device or internal interface of various types of cell-based digital consumer electronics and game systems.
[edit]
Memory controller and I/O
Cell contains a dual channel next-generation Rambus XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIO-XDR link runs at 3.2 Gbit/s per pin. Two 32 bit channels can provide a theoretical maximum of 25.6 GB/s.
The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.
[edit]
Possible applications
Main article: Cell microprocessor implementations
[edit]
Blade Server
On the 29th Aug 2007, IBM announced the BladeCenter QS21. Generating a measured 1.05 Giga Floating Point Operations Per Second (GigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 Tera Floating Point Operations Per Second (TeraFLOPS) and over 25.8 TeraFLOPS in a standard 42U rack.
IBM Press Release
[edit]
Console video games
Sony's PlayStation 3 video game console contains the first production application of the Cell processor, clocked at 3.2 GHz and containing seven out of eight operational SPEs, to allow Sony to increase the yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS.[18]
[edit]
Home cinema
Reportedly, Toshiba is considering producing HDTVs using Cell. They have already presented a system to decode 48 standard definition MPEG-2 streams simultaneously on a 1920×1080 screen.[28][29] This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.
[edit]
Supercomputing
IBM's new planned supercomputer, IBM Roadrunner, will be a hybrid of General Purpose CISC as well as Cell processors. It is reported that this combination will produce the first computer to run at petaflop speeds. It will use an updated version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128 bit registers, reaching double precision 100 GFLOPs.[30][31]
[edit]
Cluster computing
Clusters of PlayStation 3 consoles are an attractive alternative to high-end systems based on Cell blades. Innovative Computing Laboratory, a group led by Jack Dongarra, in the Computer Science Department at the University of Tennessee, investigated such an application in depth.[32] Terrasoft Solutions is selling 8-node and 32-node PS3 clusters with Yellow Dog Linux pre-installed, an implementation of Dongarra's research. As reported by Wired Magazine on October, 17, 2007, an interesting application of using PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna who replaced time used on supercomputers with a cluster of eight PlayStation 3s. [33]
[edit]
PS3 helps break world record
With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project, Folding@Home, has now been recognized by the Guinness World Records as the most powerful distributed network in the world. The first record was achieved on September 16, 2007, as the project surpassed one petaflops, which had never been reached before by a distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the petaflop mark on September 23, 2007. In comparison, the world's most powerful supercomputer IBM's BlueGene/L performs around 280.6 teraflops which means Folding@Home's computing power is approximately four times BlueGene/L's (although the CPU interconnect in BlueGene/L is more than one million times faster than the mean network speed in Folding@Home.)
[edit]
Mainframes
IBM announced April 25, 2007 that it will begin integrating its Cell Broadband Engine Architecture microprocessors into the company's line of mainframes.[34]
Log in to comment