A Processor Comparision of the XBox 360 and PS3
When I initially looked at the specifications of the Cell (PS3) and Xenos (Xbox 360) processors, it appeared that the cell processor had a big advantage over the xenos processor if both were able to harness the maximum amount of power. After looking into more detail I have come to the conclusion that the xenos processor will probably be able to perfom better than the cell processor under almost all conditions.
Both processors are stripped down and modified versions of the IBM 970 PowerPC. Each core executes at less than 1/2 the speed of the IBM 970 at the same clock frequency due to the fact that the IBM 970 has multiple execution units and will perform out-of-order execution (parallel processing) whereas the cell and xenos processors only have a single execution unit and will perform in-order execution (sequential processing). The following link illustrates the performance of a PS3 at 3.2 GHz and a Power Mac G5 at 1.6 Ghz using the linux operating system.
http://www.geekpatrol.ca/2006/11/playstation-3-performance/
Linux runs on the Power Processor Element (PPE) of the cell processor so the results should be similar to one core of the xenos processor since all three cores are the same. Both processors are clocked at 3.2 GHz.
The similarities of the two processors ends there. The xenos processor has 3 identical PPE cores where as the cell processor has only 1 PPE core and 7 SPE cores.
Cell Processor
- One general purpose PPE core that is used for the OS and the game application.
- 512 MB total memory on 2 buses which can be accesed directly only by the PPE core. 256 MB of processor main memory and 256 MB of memory used by GPU.
- 512 KB L2 cache for the PPE.
- 32 KB L1 instruction cache and 32 KB L1 data cache for the PPE.
- 7 specialized SPE cores. One is used for the OS leaving 6 for the game application.
- 256KB SRAM per SPE. No common memory between SPEs and SPE cannot access the PPEs main memory directly but the PPE can access the SPEs memory directly.
- Communications between SPE memory or to the PPE memory is performed via the Element Interconnect Bus (EIB) by either accessing ports or via DMA.
- SPEs do not have branch prediction capability.
Xenos Processor
- 3 General purpose PPE cores that are used for the OS and game application.
- 512 MB main memory that is shared by all three cores and GPU.
- 1 MB of L2 cache that is shared by the 3 cores (333 KB per core average).
- 32 KB L1 instruction cache and 32 KB data cache for each core.
- 2 Hardware threads per core.
Programming the 360
The OS does not use core 0 and uses only about 3% of the power of core 1 and 3% of the power of core 2. Therefore about 98% of the processor power of all three cores are available for the game application.
Programming the 360 is fairly easy and straight forward since a large amount of shared main memory is available, a relatively large amount of shared L2 cache is available, and information can be quickly and easily passed between different threads (cores) of the application by just passing pointers.
Typically an application will initially be developed using only one thread of a core. Once the application is developed the application can then be segmented to use multiple cores and possibly multiple hardware threads of each core. The easiest seqmentation would be to place the game control plus AI code in one core and graphics rendering code in another core. As soon as the AI code completes its operation, it would queue the information for graphics rendering core and immediately start to process the next frame. The graphics rendering code will be executing code for the current frame and the AI will be executing code for the next frame simultanously.
Segmenting a program beyond that becomes more difficult. The developer would have to first determine where the bottleneck is occuring. If it was in the AI code, he would then have to determine if parallel processing can be performed on the code (ex. In a racing program, it may be possible for the main program to process the AI for 5 racing cars and another core process the AI for the other 5 racing cars on the track at the same time). If the bottleneck was in the graphics rendering code, it may be possible for part of the graphics rendering code to be done in parallel in another core.
When a program is seqmented among all three cores, one of the cores may be active 100% of the time but the other two may only be active a very small time (10%, 20%, 50%, etc.). In this case more segmentation may be required of the core that is active 100% of the time. In this case, a new hardware thread can be added to one of the less active cores to handle 2 processes at one time. Once all the available hardware threads are used and more segmentation is still required, software threads (although not as efficient as hardware threads) can then be added until that core approaches 100% usage.
Once all three cores are executing near 100%, the maximum frame rate, sophistication, and detail capabilities will have been acheived. If the AI is issueing frames faster than the GPU can process them (maximum 60 fps at 720p or 30 fps at 1080i), more detail or sophistication can be added
Programming the PS3
The PS3 is so much more difficult to program than the 360. In a sense it is designed similar to multiprocessor systems used by specialized customers such NASA Ames Research Center. The concept is based on the principle that there is a very large amount of repetive mathematical data that can be performed in a parallel or a segmented sequential fashion (ex. one core multiples two arrays of 10000 numbers and then passes the output array to another core which performs divides on individual elements in the array which will pass the array to another core which performs some other operation on the data, etc. After the first core finishes its operation, it will acquire more data and perform the same operation).
Like the 360, the application would initially be developed using the PPE core. Next you would think that the PS3 (just like the 360) would be able to segment the game control plus AI code into one core and the graphics rendering code into another core. However that is not possible! Since the total application code may be about 100 MB and the SPE only has 256KB of memory, only about 1/400 of the total code can fit in one SPE memory. Also since there isn't any branch prediction capabilities in an SPE, branching should be done as little as possible (although I believe that the complier can insert code to cause pre-fetches so there may not be a big issue with branching).
Therefore the developer has to find code that is less than 256KB (including needed data space) that will execute in parallel.
Even if code can be found that can be segmented, data between the PPE and the SPE has to be passed back and forth via DMA which very slow compared of a pointer to the data like the 360.
If we assume that enough segment code was found that could use all the 6 SPE cores assigned to the game application, now the developer would try to balance the power among the cores. Like the 360, some or all the cores may have a very low utilization. Adding more hardware threads are not possible since each core has only one hardware thread. Adding software threads probably will not work due to the memory constraint. So the only option is an overlay scheme where the PPE will transfer new code using DMA to the SPE when the last overlay finishes processing. This is very time consuming and code has to be found that does not overlap in the same time frame.
Future Generation Consoles
In a few years both Microsoft as well as Sony may want to release the next generation game console. When they do that they usually want to maintain backward compatibility with their present console (multi-core applications). You would think that they could get 3 times the processor power by using the same design but instead running the processor at 9.6 Ghz. If that could be done, there wouldn't be any problems maintaining backward compatability.
Degrading the internals of this generation game console processors was done purposely by both Microsoft and Sony as a cost saving issue. It was cheaper to increase the clock frequency and degrade the internals that it would be to have a full featured PowerPC at 40% of the current clock frequency.
However, over the last several years it has been more and more dificult to increase the clock frequency and performance has been improved primarily redesigning the internals of the processor as well as implenting dual core processors and occasionally quad core processors.
In the case of 360, that should be a fairly easy and cost effective upgrade in several years. Since the Xenos processor is a fairly standard design, Microsoft probably would be able to purchase an off the shelf 3.2 GHz IBM 970 Quad processor for a pretty reasonable price (only a dual processor currently exists at that frequency). In this case three cores should give about 2.5x the power the current Xenos processor plus an additional core for a total of over 3x the processor power. In the worst case, Microsoft could purchase an Intel based quad processor like apple did (single and dual core processors) for all its new systems (I expect that PowerPC prices got too high) and use emulation for all old game applications. The following link indicates that a single core Intel processor with the same clock frequency as a single core PowerPC executes about the same overall even though emulation is being performed by the Rosetta operating system. All new applications would then use Intel native compiliers for better performance.
http://www.macworld.com/2006/03/firstlooks/minibenchmarks/index.php
If developers develop game applications using the SPEs, Sony will probably have a problem. If applications used SPEs it would be difficult to change the processor design although it may be possible (but extremely difficult) to emulate the SPEs and keep a decent level of performance. Upgrading all SPEs and the PPE to be fully featured would probably not be cost effective. Sony may decide to keep the SPEs at their current speed and capabilities and add 4 fully featured PowerPC cores. New compiliers would then probably not allow developers to use the SPEs for future development.
Conclusion
In my opinion, I have serious doubts that very many developers (except exclusive developers) will program the SPEs on the PS3. Complexity is enormous, development time is large, and potential for bugs is also great.
I suspect that Gears Of War is already using the multiprocessor capability but still only using about 50% of the total power available.
It would be hard to imagine that a PS3 game application could perform better than a 360 game application without a great deal of development time.
Important updates
Read the following link for important updates to this document.
http://www.avsforum.com/avs-vb/showthread.php?p=9027534#post9027534
References
http://dpad.gotfrag.com/portal/story/35372/?spage=1
http://www.hardcoreware.net/reviews/review-348-1.htm
http://en.wikipedia.org/wiki/Synergistic_Processor_Element
http://en.wikipedia.org/wiki/Xenos
http://arstechnica.com/cpu/03q1/ppc970/ppc970-2.html