Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

  1. Denial

    Denial Ancient Guru

    Messages:
    13,431
    Likes Received:
    2,927
    GPU:
    EVGA RTX 3080
    I'm not really sure what the point of this post is. Guru3D is a hardware enthusiast site and as such has some technical topics like this one. If you want to talk about playing games there is a section for that, it's called "Games, Gaming & Game-demos"

    This is the equivalent of going into a car enthusiast shop only to let them know the average customer doesn't give a crap about how a car works.
     
  2. Hootmon

    Hootmon Maha Guru

    Messages:
    1,232
    Likes Received:
    6
    GPU:
    XFX THICC III Ultra
    ^Yeah. I come here to get a better idea of how my toys actually work 'under the hood'.
     
  3. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,814
    Likes Received:
    707
    GPU:
    Inno3D RTX 3090
    That's what I thought too. Thanks for the input. As for Volta, I tend to agree with you, but very coarse estimations I've done after the revelations of the deep learing part, I would say that they have gone the Fiji route, creating a much bigger die with more complicated SM that is "smarter".
     
  4. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,814
    Likes Received:
    707
    GPU:
    Inno3D RTX 3090
    You are in a website called "Guru3D". What do you expect exactly? Also benchmarking, testing and researching in general is very different from gloating. It's ok not to understand and ask about things, it's not ok to be a dick to others who do, just because you don't make the effort.
     

  5. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,031
    Likes Received:
    393
    GPU:
    RTX 3080
    You mean Volta is the Fiji route? It most likely isn't. From what I've heard and what was disclosed already Volta seems to be for Pascal more of what Maxwell was for Kepler -- a relatively big architecture rebuild to accommodate newly emerged use cases while staying on the same production node. They'll most certainly be relatively bigger both because 16FF will get cheaper and better and because Volta will probably be more complex than Pascal per ALU/SIMD. This also raises an interesting case of GP100 being pretty much as big as it can be already - and it's possible that top Volta part will repeat the same fate as top Maxwell part - meaning no DP h/w. This time however it may not be strictly for gaming as they may target it on deep learning markets also.
     
  6. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    It is been disclosed already volta will be on the amd way.
     
  7. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,814
    Likes Received:
    707
    GPU:
    Inno3D RTX 3090
    What I mean by "more like Fiji", I don't mean architecturally (we don't really know much about that). I meant that it looks like they are giving it more die space per SM. They are trading die space for performance on the same wattage, to increase perf/watt, like the Nano vs the 290x.
     
  8. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    Kinda yeah. Volta be like fermi 2.0 but different way, amd way. Already disclosed by several outsider members.
     
  9. Agent-A01

    Agent-A01 Ancient Guru

    Messages:
    11,404
    Likes Received:
    921
    GPU:
    ASUS 3080 Strix H20
    You pull a lot of crap out of your ass and when someone asks you always say google it.
     
  10. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    Did i say it ? :D Can you read those word in my comment? I think you the one who pulls some crap out of the dirty hole.
     

  11. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,814
    Likes Received:
    707
    GPU:
    Inno3D RTX 3090
    [​IMG]
     
  12. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    Harrison Ford and him in the chase ? I forgot the name of the movie. Maybe i wrong, seen too many movies .
     
  13. narukun

    narukun Master Guru

    Messages:
    217
    Likes Received:
    24
    GPU:
    EVGA GTX 970 1561/7700
    i guess ngreedia washed out the async topic with the new architecture, im not buying polaris or pascal, both seems to be with very short life.
     
  14. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,031
    Likes Received:
    393
    GPU:
    RTX 3080
    [​IMG]

    Nano's perf/watt advantage over 290X was due to HBM and low clocks, nothing to do with die space or anything. It's very unlikely in general that going with more transistors instead of less may affect perf/watt positively. Volta's SM will probably be more complex than Pascal's but this isn't why it can be more efficient in power usage, it's just an evolution of SM functionality.
     
  15. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,814
    Likes Received:
    707
    GPU:
    Inno3D RTX 3090
    The advantage due to HBM is calculated to roughly ~15 watt, here.

    [​IMG]

    If we add those roughly 15 watt to the average TDP while gaming, the R9 Nano would be at the GTX 780 ballpark at around ~230 watt, regarding average power consumption, and much lower than the average 286 watt of the 290x.

    [​IMG]

    Saying that it's lower clocks that do it, reinforces even more what I said. You sacrifice die space (Fiji is 8.9Bn transistors, Hawaii is 6.2Bn), to gain performance at the same wattage. Even the difference in transistor count is percentage-wise close to the difference in TDP. AMD directly increased the number of CUs with Fiji, NVIDIA seems to be adding to the capabilities of the SM units with Volta. That's the only thing I see them having in "common".

    My guess is that you heard the news about the Xavier SoC. We can actually get a fairly rough estimation of Volta SM transistor counts out of it. NVIDIA says that the SoC has an 8-core ARM CPU, I/O, a 512-core Volta GPU and an 8k video decoding/encoding engine. Apart from the 8-core ARM CPU and the I/O, this is basically a Volta GPU. NVIDIA gives an estimate of 7 billion transistors for the whole package.

    We need to do some estimation here about the size of the I/O controller and the 8-core ARM CPU itself. Seeing how the Intel Core i7-6700k is at 1.75Bn transistors with the GPU included, with all the PCIe I/O and the rest, I would say that subtracting 2Bn transistors from the Xavier SoC to count them as the ARM CPU and I/O, is more than generous.

    That leaves us with 5Bn transistors for 512 CUDA cores.

    A 512-CUDA core GPU with Pascal would mean an 8 SM configuration, as Pascal has 64 CUDA cores per SM. The GTX 1080 with 2560 CUDA cores would be roughly standing at 25 billion transistors instead of 7.2, if they were the same kind of CUDA cores found in the Xavier SoC.

    Whatever die size that Volta ends up, it seems like NVIDIA is increasing CUDA core complexity almost exponentially. It is also a trend reversal, as CUDA core complexity has dropped dramatically from Fermi to Kepler. Fermi had 480 CUDA cores and GPU logic packed in 3.2Bn transistors, Kepler had 1536 CUDA cores and GPU logic packed in 3.54Bn transistors. You can see how much less complex the CUDA cores became, since no matter the differences of the rest, Kepler CUDA cores appear to be at least 2x "simpler" than the Fermi CUDA cores.

    Something similar, to an even larger degree, seems to be happening with Volta. Even if we discard half the transistor count of the new CUDA cores as deep-learning-only, this leaves us with at least a 1.8x the complexity on the cores themselves. If we are a bit less generous with the ARM 8-core transistor count (as I believe we should, it cannot be by itself more complex than a 4-core Skylake+GT2 GPU), then we arrive at the ~2-2.5x ballpark of the difference between Kepler and Fermi.

    This is conjecture, of course, but the trend reversal toward much more complex compute units seems to be there.
     
    Last edited: Oct 17, 2016

  16. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,031
    Likes Received:
    393
    GPU:
    RTX 3080
    This makes no sense whatsoever. You don't get performance on the same wattage by increasing the number of transistors. The more transistors you have - the higher the power will be needed to drive them. Fiji is a generation ahead of Hawaii, it have memory power saving techniques beyond the HBM (and I'm pretty sure that HBM's advantage is more than 15W otherwise AMD wouldn't even care about it as the whole reason for HBM on Fiji is to win some power ceiling back).

    It's also rather misleading to use a binned chips based Nano targeted at a specific niche with specific performance requirement for such comparisons especially since it's tuned to lower its clocks instead of consuming more power when needed. Look at where Fury cards are on your graph to see how they're "trading transistors for wattage".

    It's also completely baffling why would NV even need to make such a trade in Volta as last time I checked it was AMD who was a generation if not two behind on power consumption. So before they'll actually have a reason to do anything like this they may actually not do it?


    This is conjecture, you're right. We can't make anything out of Xavier announcement since we don't know anything on Volta used in Xavier or even about how flexible Volta will be in its tuning to different markets. If you'd use GP100 a year ago to project the possible GeForce Pascal line-up you'd be wrong on pretty much every detail. I'm willing to bet that the above is exactly that - wrong on every detail on what GeForce Volta will actually be.

    Also you have to look at the SM level really to make any judgement of how SP's ("CUDA cores" is a bleh naming since it doesn't mean anything outside of CUDA) complexity changed over time, and since they didn't actually change their capabilities between, say, Fermi and Kepler, it's completely misleading to say that they got "simpler". They got smaller in transistors, yes, but that's about it.

    Volta really have only two options of what may or may not happen to its SMs: a) they'll all become FP64 with FP32x2, FP16x4 and Int8x8 - in which case your 512 SP figure for Xavier may actually mean 1024 for FP32. b) They won't change from Pascal at all - meaning that they'll be FP32 with FP16x2 and Int8x4 - in that case I don't expect them changing in size.

    You should also remember that the biggest die size contributor lately was on-chip memory in both registers, LDSes and caches. Without knowing what Volta will be here it's nearly impossible to make any kind of prediction on the complexity of the actual execution logic, especially so since Volta may well have a specially designed automotive part in Xavier which will be as different from desktop Voltas as GP100 is different from GP10x chips. It's not without a precedent really with Tegra X1 having FP16x2 SPs a year before they came with Pascal to GP100 for example.
     
  17. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,814
    Likes Received:
    707
    GPU:
    Inno3D RTX 3090
    I agree with you, except the point about HBM being there just for the power envelope. We don't know how Fiji would perform without it, and it also gave AMD the chance to expand their HBM patents, and have practical applications for the interposer. From their side, it's a good move on multiple fronts.

    The Nano is not really binned, it is simply clocking differently. Most Fury cards can be undervolted very safely. I just mentioned it as an example of how extra numbers of transistors, along with careful clocking can improve performance per watt on the same node. The Nano is the epitome of that, not the Fury X, even though they are based on the same GPU.

    AMD is a generation or two back in power consumption, and my guess is that NVIDIA would prefer it that it stays so.

    I actually disagree about GP100. If you removed the FP64 units and used the ratio that NVIDIA gave for them (2:1), you end up with a Titan X with HBM 2.0. The GPU itself could be traced out after removing the FP64 parts. It wasn't precise, or a science, but it wasn't far off at all either.

    We would need a 1536 SP Fermi, or a 480 SP Kepler to really figure that out. Guessing that with less than half the transistors nothing was lost, is also a guess. There were obvious trades made there. The closest I could find was the GTX 560 Ti vs the GT 640. But it's a bad comparison since there are differences in every other possible aspect of the card. Still, the Fermi card has almost triple the transistor count, which could account for the extra functional units. By all metrics, and by how NVIDIA packed these respective cards with their number of SPs, it seems that when on a similar SP count, Fermi cards expected to get much higher performance than a similar Kepler card. Hence the Fermi cards got more ROPs, memory bandwidth and the rest to go with a Fermi 384 SP configuration, vs a Kepler 384 SP configuration which seems to be the lowest of the low.

    What I want to say here is that if 384 Fermi SP and 384 Kepler SP had similar performance, they would be installed on similar performing cards. Instead of that, we see the Fermi parts used on a high-performance card and the Kepler parts used for the lowest of the low. The amount of hardware packed around these cores tells the story of what NVIDIA believes about them.

    For case A, it makes sense. Still it's 5 billion transistors for 1024 SP. That's again 1.5x on the Pascal SP on the transistor count. Case B would make no sense, because there is no way that an 8-core ARM and some I/O use 5 billion transistors. My bet is case B, case A makes no sense on the physical level of the SoC announced.

    Now we're on the same page. You're obviously right about the chance there is something special about the automotive part, but on the other hand the only way this becomes mainstream is if it's cheap to produce, so I don't expect anything specifically for cars in it. Increasing caches, registers and LDSes, isn't effectively giving transistor count for performance? All these are part of the SPs and the SMs, right? So you might have a similar SM with 4x the cache sizes, that would make it a "new" SM, right?
     
    Last edited: Oct 18, 2016
  18. Denial

    Denial Ancient Guru

    Messages:
    13,431
    Likes Received:
    2,927
    GPU:
    EVGA RTX 3080
    Idk about 5 billion transistors, but the cores on the Xavier unit are Denver based. Denver cores are huge - they were 3.2x larger than the ARM equivalent when they launched. Apple's A10 is ~3B transistors with the iGPU so it may not be far fetched that the Xavier CPU is up around 3/4B transistors.
     
  19. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,814
    Likes Received:
    707
    GPU:
    Inno3D RTX 3090
    Thanks for that insight. I wasn't able to find any separate CPU transistor counts for the Denver part of Tegra. My guess would be that the majority (like ~70%) would be for the GPU. Although it's just a guess.

    EDIT: I just found this. It's not Denver, but it's a very high performance ARM chip with "big" CPUs on. As you can see, the CPUs themselves are a really small part of the package.

    [​IMG]
     
    Last edited: Oct 18, 2016
  20. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,031
    Likes Received:
    393
    GPU:
    RTX 3080
    Well, consider that Fiji is on par with GM200 which is using 384 bit GDDR5 and it's actually pretty seriously limited by HBM's top 4GB VRAM config. There's like no discernible advantage of HBM on Fiji beyond the power saving, even if you go into 4K it's hardly faster, usually unplayable and hitting VRAM limit. So if you think about it it just has to be more than 15W because going with HBM on Fiji makes no sense otherwise. I also believe that I've heard some number on that from AMD around Fiji's launch and it was considerably bigger but I don't remember it unfortunately.

    It really is binned, they've been specifically selecting GPUs with lowest leakage to be used in Nano cards. I believe this was known in general. Of course you can get a lot less power consumption when you downclock+undervolt but that's true for any chip out there.

    NV would prefer to make money. Sacrificing die area to be two generations ahead of AMD in power efficiency won't earn them anything. They can just as easily be one generation ahead or on par even if that would result in better sales for them.

    It was completely far off as you don't have to remove FP64 units, you have to turn them into FP32x2 instead, you have to remove HBM2, you have to cut down LDS per SM, you have to remove FP16x2 and add Int8x4. It's like a totally different chip compared to the rest of GP10x lineup. Same can easily be the case for Xavier.

    So GTX560Ti is a 2nd gen Fermi part based on the second chip of the line-up, why would you compare it to GT 640 which is a first gen of Kepler and on the lowest Kepler part to boot? They also kinda launched on 2.5x different price points. Not a good comparison parts really.

    I also don't see any point in comparing numbers. Should we compare Hawaii's 512 bit bus to Fiji's 4096 bit and ask why Fiji isn't 8x faster? When these numbers are from two totally different tech parts there's no point in comparing them anyhow beyond the price/performance comparisons.

    This has nothing to do with beliefs. And actually GF114 and GK107 are pretty close in complexity if you consider that GK107 has half the TMUs and half the MCs/ROPs. The obvious difference is the clocks which Fermi had about 1.5x more for the execution units - but it also consumed 3x times more (!) if we do compare these cards. So it's really not as much about h/w packed around as it's about target clocks and power levels. Queue your Nano example in here.


    Again, you're using general transistor complexity without actually knowing what that complexity is comprised of. A GPU is not only SMs and recently it wasn't even SMs mostly which were growing the die sizes. Do you actually think that Pascal SMs gotten more complex than Maxwell's? There's almost no changes there, even the async/preemption is just an improvement on what was already there in Maxwell chips. This doesn't stop GP104 from being pretty close to GM200 in spite of having 2/3 of VRAM interfaces and 5/6 of SMs. There's a lot of stuff going on outside of SMs that grow the die complexity, and this stuff is arguably easier to differentiate between different markets.

    Why would automotive part needs to be cheap to produce? It's still cutting edge, it's concept stuff, the ability to sell these chips with good margins is why NV have moved them from mobile to automotive in the first place. 50% of info on Xavier was specifically automotive but you don't expect anything specifically automotive from it? Wat?

    It's too early to make any guesswork on Volta. Right now all we know is that there's some Volta part in Xavier, it has 512 SPs and provides 20 TOPs in 20W package. We can't even be sure what TOPs they are talking about. I mean, a 512 SP part have to run on 4,8GHz to be able to output 20 TOPs of even Int8 performance if we consider that each SP is able to perform 2 ops on 4 of them each clock. This doesn't sound realistic, does it? GP104 running in Int8 mode on 2GHz would be able to reach 40 TOPs but it has FIVE TIMES more SPs than that Xavier Volta part. Even if we consider that these 512 SPs can be for FP64 and it's actually x8 that for Int8 we're still ~5 TOPs short of 20 on 2 GHz -- again, hardly a realistic clock for an embedded SoC GPU.

    So we really should just not speculate on Volta at this point as NV is just ****ing with us disclosing these pieces of information which are actually specifically engineered to confuse the hell out of everyone, especially AMD of course. It's just too early to have any kind of sensible talk on Volta capabilities and whether it will or won't become more complex in SMs or other parts and if it will then why.

    Denver1 cores are ~2.5x bigger than ARM's A15 design. It's hard to say anything on Denver2 in Xavier obviously but I doubt that it's become smaller.

    [​IMG]
     
    Last edited: Oct 18, 2016

Share This Page