Ruhroh! Things might not be as rosy red as Digital Foundry and Microsoft have made them out to be.
Looks like Scorpio may only have 32 ROPs in the GPU. Which is the same number as in the PS4 Pro.
Quote from the article.
"What makes things especially interesting though is that Microsoft didn’t just switch out DDR3 for GDDR5, but they’re using a wider memory bus as well; expanding it by 50% to 384-bits wide. Not only does this even further expand the console’s memory bandwidth – now to a total of 326GB/sec, or 4.8x the XB1’s DDR3 – but it means we have an odd mismatch between the ROP backends and the memory bus. Briefly, the ROP backends and memory bus are typically balanced 1-to-1 in a GPU, so a single memory controller will feed 1 or two ROP partitions. However in this case, we have a 384-bit bus feeding 32 ROPs, which is not a compatible mapping.
What this means is that at some level, Microsoft is running an additional memory crossbar in the SoC, which would be very similar to what AMD did back in 2012 with the Radeon HD 7970. Because the console SoC needs to split its memory bandwidth between the CPU and the GPU, things aren’t as cut and dry here as they are with discrete GPUs. But, at a high level, what we saw from the 7970 is that the extra bandwidth + crossbar setup did not offer much of a benefit over a straight-connected, lower bandwidth configuration. Accordingly, AMD has never done it again in their dGPUs. So I think it will be very interesting to see if developers can consistently consume more than 218GB/sec or so of bandwidth using the GPU."
This could be a problem. I wonder how this wasn't brought up with Digital Foundry. I am sure Ron will provide 100 graphs to as why this is okay.
Ruhroh! Things might not be as rosy red as Digital Foundry and Microsoft have made them out to be.
Looks like Scorpio may only have 32 ROPs in the GPU. Which is the same number as in the PS4 Pro.
Quote from the article.
"What makes things especially interesting though is that Microsoft didn’t just switch out DDR3 for GDDR5, but they’re using a wider memory bus as well; expanding it by 50% to 384-bits wide. Not only does this even further expand the console’s memory bandwidth – now to a total of 326GB/sec, or 4.8x the XB1’s DDR3 – but it means we have an odd mismatch between the ROP backends and the memory bus. Briefly, the ROP backends and memory bus are typically balanced 1-to-1 in a GPU, so a single memory controller will feed 1 or two ROP partitions. However in this case, we have a 384-bit bus feeding 32 ROPs, which is not a compatible mapping.
What this means is that at some level, Microsoft is running an additional memory crossbar in the SoC, which would be very similar to what AMD did back in 2012 with the Radeon HD 7970. Because the console SoC needs to split its memory bandwidth between the CPU and the GPU, things aren’t as cut and dry here as they are with discrete GPUs. But, at a high level, what we saw from the 7970 is that the extra bandwidth + crossbar setup did not offer much of a benefit over a straight-connected, lower bandwidth configuration. Accordingly, AMD has never done it again in their dGPUs. So I think it will be very interesting to see if developers can consistently consume more than 218GB/sec or so of bandwidth using the GPU."
This could be a problem. I wonder how this wasn't brought up with Digital Foundry. I am sure Ron will provide 100 graphs to as why this is okay.
Scorpio GPU's actually has 44 CU.
Hawaii XT's block diagram.
RB (ROPS) are located inside CU lanes cluster aka "Shader Engine".
PS4 doesn't have a fix CPU and GPU memory bandwidth allocation. According DF, PS4 delivers about R7-265 like performance. Zero copy, Fusion link and respecting CPU cache boundaries reduces main memory access for the CPU.
Tahiti GPU can have 256 bit and 384 bit bus and ROPS count remain the same
256 bit SKU, W8000
384 bit SKU, 7970
---------------.
If CPU writes 1 GB of data to main memory (trigging memory data change event) with cache coherence requirement, only 1 GB of data is required to be updated. 1 GB write is only ~0.03 percent of 30 GB/s cache coherence capability. This is why CPU and GPU main memory bandwidth usage dynamically changes depending on workloads.
Lowest common denominator for any CPU workload is PS4 and XBO would be following it's limits i.e. 30 GB/s cache coherence is nearly pointless with multiplatform games when PS4's CPU cache coherence capability is lower.
GPU read from main memory doesn't trigger cache coherence update.
CPU read from main memory doesn't trigger cache coherence update.
Modern CPU cache was designed to minimise main memory access until other DMA cache coherency client end points (e.g. other CPU, GPU, DSP, micro-controllers) needs the updated data i.e. snoop request from DMA cache coherency client end points (Note 1).
Read https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies for basics on cache writes.
When a system writes data to cache, it must at some point write that data to the backing store as well. The timing of this write is controlled by what is known as the write policy.
There are two basic writing approaches:
Write-through: write is done synchronously both to the cache and to the backing store.
Write-back(also called write-behind): initially, writing is done only to the cache. The write to the backing store is postponed until the cache blocks containing the data are about to be modified/replaced by new content.
A write-back cache is more complex to implement, since it needs to track which of its locations have been written over, and mark them as dirty for later writing to the backing store. The data in these locations are written back to the backing store only when they are evicted from the cache, an effect referred to as a lazy write. For this reason, a read miss in a write-back cache (which requires a block to be replaced by another) will often require two memory accesses to service: one to write the replaced data from the cache back to the store, and then one to retrieve the needed data.
From http://www.realworldtech.com/jaguar/6/
AMD Jaguar's L2 cache has write back policy.
For AMD64 CPU cache protocol, read https://en.wikipedia.org/wiki/MOESI_protocol
As discussed in AMD64 Architecture Programmer's Manual Vol 2 'System Programming',[1] each cache line is in one of five states:
Modified
This cache has the only valid copy of the cache line, and has made changes to that copy.
Owned
This cache is one of several with a valid copy of the cache line, but has the exclusive right to make changes to it. It must broadcast those changes to all other caches sharing the line. The introduction of owned state allows dirty sharing of data, i.e., a modified cache block can be moved around various caches without updating main memory. The cache line may be changed to the Modified state after invalidating all shared copies, or changed to the Shared state by writing the modifications back to main memory. Owned cache lines must respond to a snoop request with data (Note 1).
Exclusive
This cache has the only copy of the line, but the line is clean (unmodified).
Shared
This line is one of several copies in the system. This cache does not have permission to modify the copy. Other processors in the system may hold copies of the data in the Shared state, as well. Unlike the MESI protocol, a shared cache line may be dirty with respect to memory; if it is, some cache has a copy in the Owned state, and that cache is responsible for eventually updating main memory. If no cache hold the line in the Owned state, the memory copy is up to date. The cache line may not be written, but may be changed to the Exclusive or Modified state after invalidating all shared copies. (If the cache line was Owned before, the invalidate response will indicate this, and the state will become Modified, so the obligation to eventually write the data back to memory is not forgotten.) It may also be discarded (changed to the Invalid state) at any time. Shared cache lines may not respond to a snoop request with data.
Invalid
This block is not valid; it must be fetched to satisfy any attempted access.
This protocol, a more elaborate version of the simpler MESI protocol (but not in extended MESI - see Cache coherency), avoids the need to write a dirty cache line back to main memory when another processor tries to read it. Instead, the Owned state allows a processor to supply the modified data directly to the other processor. This is beneficial when the communication latency and bandwidth between two CPUs is significantly better than to main memory. An example would be multi-core CPUs with per-core L2 caches.
While MOESI can quickly share dirty cache lines from cache, it cannot quickly share clean lines from cache. If a cache line is clean with respect to memory and in the shared state, then any snoop request to that cache line will be filled from memory, rather than a cache.
If a processor wishes to write to an Owned cache line, it must notify the other processors that are sharing that cache line. Depending on the implementation it may simply tell them to invalidate their copies (moving its own copy to the Modified state), or it may tell them to update their copies with the new contents (leaving its own copy in the Owned state).
You are assuming AMD engineers are rookies at modern CPU design.
Modern X86 CPUs can reduce their main memory access rates when programmed in a certain way i.e. this is called software optimisations.
CELL SPU programming tricks to fit within 256KB local storage can be applied for modern X86 CPUs.
As an example
Software like Swiftshader (LLVM JIT based software Direct3D9c renderer) has a config file that lets the end user configure Swiftshader's CPU cache size limits.
Setting the correct CPU cache size setting maximises data processing within the CPU's cache, hence minimise spill over to main memory hence reducing main memory access which speeds up Direct3D9c CPU rendering.
XBO is mostly GPU ALU bound.
-------
To minimise CPU's access to main memory tips from Naughty Dog, Read http://www.dualshockers.com/2014/03/11/naughty-dog-explains-ps4s-cpu-memory-and-more-in-detail-and-how-they-can-make-them-run-really-fast/
Keeping high performance data small helps thanks to this, as it can fit in the cache, which can be accessed extremely quickly. Having them small and contiguous in memory is even more beneficial.
--------------------
XBO's GPU MMU interfaces the GPU with cache coherent (30 GB/s) , non-cache DDR3 coherent (68 GB/s) and ESRAM.
Great article. Lems would be wise to adjust their hype accordingly to objective analyses like these and not based on M$ BS smoke and mirrors propaganda. They already overhyped Scorpio's hardware before knowing it and got owned and now that they know it they're overhyping its capabilities and will get owned again. Besides, as correctly stated, without games Scorpio is not much more than an overpriced paperweight.
Great article. Lems would be wise to adjust their hype accordingly to objective analyses like these and not based on M$ BS smoke and mirrors propaganda. They already overhyped Scorpio's hardware before knowing it and got owned and now that they know it they're overhyping its capabilities and will get owned again. Besides, as correctly stated, without games Scorpio is not much more than an overpriced paperweight.
Great article. Lems would be wise to adjust their hype accordingly to objective analyses like these and not based on M$ BS smoke and mirrors propaganda. They already overhyped Scorpio's hardware before knowing it and got owned and now that they know it they're overhyping its capabilities and will get owned again. Besides, as correctly stated, without games Scorpio is not much more than an overpriced paperweight.
I'm not sure if the article is saying anything bad...
"Because the console SoC needs to split its memory bandwidth between the CPU and the GPU, things aren’t as cut and dry here as they are with discrete GPUs. But, at a high level,"
basically he's answering his own speculation, if there is extra memory bandwith its likely for the CPU so there isn't contention with the GPU on memory access. This is a giant non issue.
@ronvalencia Not sure if that wall of graphs and texts says anything meaningful. You should rethink how to approach communicating with human beings.
Great article. Lems would be wise to adjust their hype accordingly to objective analyses like these and not based on M$ BS smoke and mirrors propaganda. They already overhyped Scorpio's hardware before knowing it and got owned and now that they know it they're overhyping its capabilities and will get owned again. Besides, as correctly stated, without games Scorpio is not much more than an overpriced paperweight.
I'm not sure if the article is saying anything bad...
"Because the console SoC needs to split its memory bandwidth between the CPU and the GPU, things aren’t as cut and dry here as they are with discrete GPUs. But, at a high level,"
basically he's answering his own speculation, if there is extra memory bandwith its likely for the CPU so there isn't contention with the GPU on memory access. This is a giant non issue.
@ronvalencia Not sure if that wall of graphs and texts says anything meaningful. You should rethink how to approach communicating with human beings.
XBO GPU's MMU has access to the entire DRAM bandwidth.
@ronvalencia: Ron, I just ignore your posts, nothing personal but your spamming of graphs and numbers strike me as empty babble from someone who knows little about what he's talking about. You're usually pretty off in your predictions.
@EG101: Can you expand on this? What is specifically wrong with the analysis?
I'll quote someone else I got the information from.
Dehnus over at the UVGF forums.
"The ROP count isn't just a solid number, the clock-speed affects ROPs. So let's break this down.
32 at 911mhz = 218gb/s
32 at 1172 = Xgb/s
Theoretical maximum the ROP units can have in Memory bandwidth, it does not work that way, but since AnandTech assumes it does, we'll do it too. As there is overhead, and you'd never want to use your entire bandwidth just for the rasterizer output pipeline alone. The CPU also wants some of that juicy bandwidth you know . But let's just say for the sake of Anandtech it does fill it up completely, and we do know that the ROPs are affected by clock speed of the GPU.
the 32's we can stripe away as they are the same for either system. So we can just solve for X.
X = (1172/911)*218 = 280.456641 or 281gb/s
So if we keep the constants we come at a 281gb/s bandwidth for the Scorpio. Which is correct considering it is clocked about 28% faster.
Now if we subtract this value from the 326gb/s number.
326gb/s - 281gb/s = 45gb/s.
So the "crossbar" is 45gb/s and probably reserved for the CPU to access the memory without bothering the rasterizer operations pipeline. It can also probably be used by the GPU itself (As it goes through the GPU) for other tasks than graphical (for instance physics calculations. This actually is a smart design, as it would allow the CPU to access and manage it's memory via it's own dedicated controller/path while having plenty of bandwith for itself to keep it fed. 45gb/s is plenty even more than enough for a Jaguar. In fact AMD chips are notorious for performing better with higher clocked memory/more bandwith, unlike Intel where it matters less usually, on am AMD it can give you a nice boost. So if it is indeed a dedicated 45gb/s for the Jaguar + (as the power management stuff is something I really want to learn more about, it seems very interesting and novel and unlike the normal Jaguar/Puma chips), then we might have found the reason why it punches above it's weight.
So yes, by just going "OMG it's 218 just like the Pro", Anandtech doesn't do itself any favours and is actively spreading FUD. They of all people should know that ROP is more than just their number alone and that clock speed affects the ROP speed. That is assuming that MS put 32 ROPs on there (that isn't even certain yet). It could be more, and it could be less. I expect MS to cheap out on these things. , they've done that too much as of late. (One Drive and Windows phones are recent examples. ).
PS: With ROP's we are talking about fill rate, not bandwidth . But since AnandTech chooses to go this route, I found it funny to stick with it. The ROP fill rate should align with the memory controller's bandwidth. So you can't just say it has <x> bandwidth and since it is 32 in both they are the same speed. No with a higher clock speed the fill rate increases but with a higher clock speed also the Memory controller's speed increases. It just isn't as simple as "32 = 32", there are a lot more factors to compare."
AnandTech uses the Frequency of the PS4 to get the fill rate of Scorpio instead of the clock rate of Scorpio.
Great article. Lems would be wise to adjust their hype accordingly to objective analyses like these and not based on M$ BS smoke and mirrors propaganda. They already overhyped Scorpio's hardware before knowing it and got owned and now that they know it they're overhyping its capabilities and will get owned again. Besides, as correctly stated, without games Scorpio is not much more than an overpriced paperweight.
I'm not sure if the article is saying anything bad...
"Because the console SoC needs to split its memory bandwidth between the CPU and the GPU, things aren’t as cut and dry here as they are with discrete GPUs. But, at a high level,"
basically he's answering his own speculation, if there is extra memory bandwith its likely for the CPU so there isn't contention with the GPU on memory access. This is a giant non issue.
@ronvalencia Not sure if that wall of graphs and texts says anything meaningful. You should rethink how to approach communicating with human beings.
XBO GPU's MMU has access to the entire DRAM bandwidth.
Ok... not sure if that follow up points helps either...
Xbox Project Scorpio is 50% more powerful then its competitor the PS4 Pro. Like DF said there is absolutely nothing that the PS4 Pro can do that Project Scorpio won't do noticeably better because it's just so much more powerful in every category.
You'll see games running at 1440p-1600p at 30fps to 60fps on PS4 Pro while Project Scorpio is running those same games on better settings at native 4k at 30fps or 60fps. Project Scorpio has 2tflops more, 4gb more ram, a more powerful CPU. Much better bandwidth. It's just a much more powerful system. It's like comparing a V6 to a V8. It's going to be able to do a lot more, a lot better.
Downside of the Xbox Project Scorpio is that it will cost 50% more then its competitor. I will be shocked if it doesn't cost at least $599 minimum. Which I will not buy if it is.
Anandtech used PS4's GPU frequency when figuring Scorpio's GPU Bandwidth. Could have been a simple mistake but their timing and tone of the last paragraph makes it seem like they were upset for some reason. I guess the fact Digital Foundry released the Specs and actually seen the HW running gives them less credibility than Anandtech for some odd reason.
Maybe Anandtech wanted the exclusive or maybe they got paid to spread misinformation........
@uitravioience: Ori And The Blind Forest, Gears Of War4, Halo, Sunset Overdrive, Dead Rising 3, Forza Horizon3, Ryse all great games, plus all the multiplatform games as well.
Ruhroh! Things might not be as rosy red as Digital Foundry and Microsoft have made them out to be.
Looks like Scorpio may only have 32 ROPs in the GPU. Which is the same number as in the PS4 Pro.
Quote from the article.
"What makes things especially interesting though is that Microsoft didn’t just switch out DDR3 for GDDR5, but they’re using a wider memory bus as well; expanding it by 50% to 384-bits wide. Not only does this even further expand the console’s memory bandwidth – now to a total of 326GB/sec, or 4.8x the XB1’s DDR3 – but it means we have an odd mismatch between the ROP backends and the memory bus. Briefly, the ROP backends and memory bus are typically balanced 1-to-1 in a GPU, so a single memory controller will feed 1 or two ROP partitions. However in this case, we have a 384-bit bus feeding 32 ROPs, which is not a compatible mapping.
What this means is that at some level, Microsoft is running an additional memory crossbar in the SoC, which would be very similar to what AMD did back in 2012 with the Radeon HD 7970. Because the console SoC needs to split its memory bandwidth between the CPU and the GPU, things aren’t as cut and dry here as they are with discrete GPUs. But, at a high level, what we saw from the 7970 is that the extra bandwidth + crossbar setup did not offer much of a benefit over a straight-connected, lower bandwidth configuration. Accordingly, AMD has never done it again in their dGPUs. So I think it will be very interesting to see if developers can consistently consume more than 218GB/sec or so of bandwidth using the GPU."
This could be a problem. I wonder how this wasn't brought up with Digital Foundry. I am sure Ron will provide 100 graphs to as why this is okay.
So I think it will be very interesting to see if developers can consistently consume more than 218GB/sec or so of bandwidth using the GPU.
Is the person who wrote this an idiot? 8 gigs is 240 to 250 GB/sec thereabouts roughly. The other 4 gigs are reserved for OS.
So 326GB/sec is not available for games.. Hope the thought about this before they wrote this? And how have the got this information?
PRO has 32 rops why exactly is this a problem if Playstation PRO got it? Nowhere in DF articles do they talk about this.
@EG101: Can you expand on this? What is specifically wrong with the analysis?
I'll quote someone else I got the information from.
Dehnus over at the UVGF forums.
"The ROP count isn't just a solid number, the clock-speed affects ROPs. So let's break this down.
32 at 911mhz = 218gb/s
32 at 1172 = Xgb/s
Theoretical maximum the ROP units can have in Memory bandwidth, it does not work that way, but since AnandTech assumes it does, we'll do it too. As there is overhead, and you'd never want to use your entire bandwidth just for the rasterizer output pipeline alone. The CPU also wants some of that juicy bandwidth you know . But let's just say for the sake of Anandtech it does fill it up completely, and we do know that the ROPs are affected by clock speed of the GPU.
the 32's we can stripe away as they are the same for either system. So we can just solve for X.
X = (1172/911)*218 = 280.456641 or 281gb/s
So if we keep the constants we come at a 281gb/s bandwidth for the Scorpio. Which is correct considering it is clocked about 28% faster.
Now if we subtract this value from the 326gb/s number.
326gb/s - 281gb/s = 45gb/s.
So the "crossbar" is 45gb/s and probably reserved for the CPU to access the memory without bothering the rasterizer operations pipeline. It can also probably be used by the GPU itself (As it goes through the GPU) for other tasks than graphical (for instance physics calculations. This actually is a smart design, as it would allow the CPU to access and manage it's memory via it's own dedicated controller/path while having plenty of bandwith for itself to keep it fed. 45gb/s is plenty even more than enough for a Jaguar. In fact AMD chips are notorious for performing better with higher clocked memory/more bandwith, unlike Intel where it matters less usually, on am AMD it can give you a nice boost. So if it is indeed a dedicated 45gb/s for the Jaguar + (as the power management stuff is something I really want to learn more about, it seems very interesting and novel and unlike the normal Jaguar/Puma chips), then we might have found the reason why it punches above it's weight.
So yes, by just going "OMG it's 218 just like the Pro", Anandtech doesn't do itself any favours and is actively spreading FUD. They of all people should know that ROP is more than just their number alone and that clock speed affects the ROP speed. That is assuming that MS put 32 ROPs on there (that isn't even certain yet). It could be more, and it could be less. I expect MS to cheap out on these things. , they've done that too much as of late. (One Drive and Windows phones are recent examples. ).
PS: With ROP's we are talking about fill rate, not bandwidth . But since AnandTech chooses to go this route, I found it funny to stick with it. The ROP fill rate should align with the memory controller's bandwidth. So you can't just say it has <x> bandwidth and since it is 32 in both they are the same speed. No with a higher clock speed the fill rate increases but with a higher clock speed also the Memory controller's speed increases. It just isn't as simple as "32 = 32", there are a lot more factors to compare."
AnandTech uses the Frequency of the PS4 to get the fill rate of Scorpio instead of the clock rate of Scorpio.
@uitravioience: Ori And The Blind Forest, Gears Of War4, Halo, Sunset Overdrive, Dead Rising 3, Forza Horizon3, Ryse all great games, plus all the multiplatform games as well.
All better games than what the ps4 has to offer.
You listed a 4/10 game. Desperate much? Lems, gamers don't have low standards like you.
@uitravioience: The Xbox has games, it has games coming out, it has a new console which is more powerfull than the Ps4Pro and will have games for it as well. Jelly much?
32 Rops is not a problem. The problem is Anandtech (AAT) used PS4's frequency to calculate max Bandwidth instead of using Scorpio's GPU Frequency. That is why AAT came up with 218GPS instead of the 281GPS Scorpio's GPU should be capable of leaving 45GPS for the CPU and everything else.
32 Rops is not a problem. The problem is Anandtech (AAT) used PS4's frequency to calculate max Bandwidth instead of using Scorpio's GPU Frequency. That is why AAT came up with 218GPS instead of the 281GPS Scorpio's GPU should be capable of leaving 45GPS for the CPU and everything else.
Oh i see, thanks. GPU CU clock speed is higher..
There is no reason to panic either when its 8 gigs for game + 4 for OS.
Log in to comment