1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

NVIDIA Next-Gen Ampere GPUs to arrive in 2020 - based on Samsung 7nm EUV process

Discussion in 'Frontpage news' started by Hilbert Hagedoorn, Jun 5, 2019.

  1. Denial

    Denial Ancient Guru

    Messages:
    12,156
    Likes Received:
    1,329
    GPU:
    EVGA 1080Ti
    I agree but I've had this argument before with people on this forum and I took your position - that there are changes outside just RT that mostly inflates the die size. It wasn't really googleable though, you definitely had to kind of do your own research and kind of come to your own conclusion on it. I found most "tech review" sites actually conclude the opposite - which you can find in various articles on the 1660. I think this is mostly because everyone believes Tensors are used for RT.. but as of right now not a single game uses them outside of DLSS. So theoretically they can build an RTX GPU sans Tensor/DLSS and save some space.
     
  2. Astyanax

    Astyanax Ancient Guru

    Messages:
    2,068
    Likes Received:
    496
    GPU:
    GTX 1080ti
    its funny that people conclude this, since BFV RT doesn't use Tensor at all, but yes they can be used in both RT and non RT cases without being specifically in Tensor / Denoise mode since they double as the FP16 units on RTX gpus
     
  3. Fox2232

    Fox2232 Ancient Guru

    Messages:
    9,365
    Likes Received:
    2,020
    GPU:
    -NDA +AW@240Hz
    Quite possible. Turing itself is pretty beefy:
    GTX 1080 Ti : 11.8B transistors; 3584 SP; 224 TMUs; 88 ROPs
    RTX 2080 : 13.6B Transistors; 2944 SP; 184 TMUs; 64 ROPs

    Turing paid heavy price for 2:1 ration of FP16 to FP32. Price which AMD paid with Vega as even Polaris had 1:1 ratio.

    RTX 2070 : 10.8B Transistors; 2304 SP; 144 TMUs; 64 ROPs; 2,3 + 4MB cache => per SM: 0.300B Transistors; 64SP; 4TMUs ; 1.78 ROPs; 64kB + 114kB Cache
    GTX 1660 Ti: 6.6B Transistors; 1536 SP; 96 TMUs; 48 ROPs; 1,5 + 1.5MB cache => per SM: 0.275B Transistors; 64SP; 4TMUs; 2.00 ROPs; 64kB + 64kB Cache

    If we excuse small L2 cache difference and bit different ROP ratio, actual number of building blocks per SM is same. (I know that ROPs are not in SM, but this is kind of normalization.)

    From my point of view removal of Tensors and RT saved around 8~9% of transistor count for castrated Turing. In other words I expect that RTX 1660 Ti would have 7.2B Transistors only. (192 Tensors + 24 RT Cores.)
     
  4. DrKeo

    DrKeo Member

    Messages:
    35
    Likes Received:
    12
    GPU:
    Gigabyte G1 970GTX 4GB
    You can't do direct comparisons between the 1660 Ti and the RTX 2070 because the 1660 Ti has extra FP16 cores to compensate for the lack of Tensor cores. In addition, some of the big die size is due to the INT32 cores which are new to Turing. So even without any Tensor or RT cores, Turing will still be big.
    [​IMG]
     

  5. Stefem

    Stefem Member

    Messages:
    40
    Likes Received:
    4
    GPU:
    16
    INT units aren't new but the added logic for concurrent execution along FP operations require die area, tensor cores also are beast that must keep feed to be worth, in some situation they can be much faster and power efficient than normal FP16 units but they require more cache and internal/external bandwidth
     
  6. Fox2232

    Fox2232 Ancient Guru

    Messages:
    9,365
    Likes Received:
    2,020
    GPU:
    -NDA +AW@240Hz
    Both have same 2:1 FP16:FP32 ratio. It is not like they removed FP16 from RTX either. What you can see on image is that Tensor part kind of beefs FP16 area.
     
  7. Denial

    Denial Ancient Guru

    Messages:
    12,156
    Likes Received:
    1,329
    GPU:
    EVGA 1080Ti
    I'm pretty sure the FP16 on RTX cards are handled entirely in Tensor cores - "big turning" has no dedicated FP16 cores like "small turing" does but they do operate at the same ratio.
     
  8. tsunami231

    tsunami231 Ancient Guru

    Messages:
    9,363
    Likes Received:
    288
    GPU:
    EVGA 1070Ti Black
    I dont get it not that I think I will do Evga set up again any time soon if ever
     
  9. mackintosh

    mackintosh Master Guru

    Messages:
    222
    Likes Received:
    42
    GPU:
    MSI GTX 1080 Armor
    I'm mostly on a 2 to 3 year upgrade cycle nowadays, 2020 suits me perfectly fine for my next "tock".
     
  10. angelgraves13

    angelgraves13 Maha Guru

    Messages:
    1,177
    Likes Received:
    247
    GPU:
    RTX 2080 Ti FE
    I'm pretty sure Ampere will be a new architecture. Double the RT cores, probably about 40% faster for the Ti model vs 2080 Ti and maybe 40% energy savings compared to Turing. It'll no doubt have a ton of new features that haven't been announced as well, I'd assume.
     
    Last edited: Jun 7, 2019

  11. Fox2232

    Fox2232 Ancient Guru

    Messages:
    9,365
    Likes Received:
    2,020
    GPU:
    -NDA +AW@240Hz
    You make sense, just the area is rather large. Basically Tensors doing twice as much FP16 on same area as Normal shaders do FP32. Enabling FP32 to do 2x FP16 would save space, but would require more complex scheduling.

    Maybe Ampere will save some transistors right there. Or maybe not. But nVidia likes their statistics. Like that there is ~36 INT32 per 100 FP32 instructions in average game. Maybe they'll build something around ratio they expect to be common in 2 years.

    AMD can manipulate their computational ratios per CU and TMU is variable thing now too. Even if variability is not highest and I can imagine better.
    (But I am not absolutely sure as limitations in AMD's patents were kind of written in magical terms like "plurality" while mentioning just 2 scenarios.)
    At minimum AMD can now have CU consisting of: 1x S-SIMD16 + 1x S-SIMD4 each paired with 1 TMU.
    Can they make 1x S-SIMD16 + 3x S-SIMD4 and pair them with 4 TMUs? Likely yes Would it be best? I am not really sure since I do not know capabilities/limitations of those S-SIMD4, but it would nicely decrease transistors feeding each TMU. Enabling more CUs => TMUs.

    Since the "GCN compatibility" likely means use of 4x S-SIMD16 with 4 TMUs per CU. We'll have to wait to see what S-SIMD4 brings.

    But it would be lovely if AMD started to control that unnecessary transistor investment into FP64 per CU. Reduced FP32 per CU a bit, and boosted FP16. And if not per CU, then at least per TMU as that's what matters in the end.
     
  12. Astyanax

    Astyanax Ancient Guru

    Messages:
    2,068
    Likes Received:
    496
    GPU:
    GTX 1080ti
    Yep, its why they are so big, the more functions you add to a subset of silicon, the larger it is.

    You know nvidia could make a programmable decoder for video's but it would probably dwarf the SM area.

    it would also remove from one of the core architecture improvements, being able to dual execute fp16 and fp32 without having to switch contexts.
     
  13. XenthorX

    XenthorX Ancient Guru

    Messages:
    2,548
    Likes Received:
    532
    GPU:
    EVGA XCUltra 2080Ti
    2 years later, there'll be lines to draw on how many games used raytracing anyway.
    Imagin Nvidia promoting its new cards using BF5 again lol, would be terrible for them.
    Edit: Or Battlefront 3 ?!
     
  14. Fox2232

    Fox2232 Ancient Guru

    Messages:
    9,365
    Likes Received:
    2,020
    GPU:
    -NDA +AW@240Hz
    [QUOTE="Astyanax, post: 5677544, member: 273678"it would also remove from one of the core architecture improvements, being able to dual execute fp16 and fp32 without having to switch contexts.[/QUOTE]
    True. But save 20% area per CU => Make more CUs => Boosting all there is to CU (Including TMUs)
    It would still be weaker in FP16 operations, but not as much as today. Kind of like 1660Ti vs 2060. In non DXR one definitely gets more from 1660Ti implementation. I saw video comparison and while 2060 was 15% faster in average, it ate 30% more power at same clock.

    While I believe that we need like 4x FP16 to FP32 ratio and 8x FP32 to FP64 ratio, I do not think that nVidia's implementation is efficient.

    Because they could have go yet another way about it. Keeping those Tensor Cores which do 2x FP16 in comparison to FP32. And enable FP32 to execute dual FP16 like AMD did.
    Now, that would mean peak FP16 being 4 times FP32. I am sure that FP16 is future, but nVidia did not really bring best option.
    Having concurrency is good for Games which rely heavily on FP16 and FP32 mix, but there are not that many games. Giving people more TMUs and FP32 by saving area. Or potentially 4x FP16 would be much better.

    I wonder what's FP16 to FP32 to INT32 ratio suitable for metro with max DXR. Because there 1660Ti suffers badly against 2060. Maybe it is so heavy on FP16 shaders + raytracing, that all those FP32 just idle while FP16 does all the work.

    It kind of reminds me of Pixel shaders and Vertex shaders on HW level and then move to fully programmable shaders. For gaming FP64 is no good and having fixed FP32 which can't be utilized for something else efficiently is not good idea. Having FP32/16 unit + FP16/INT8/4 and whatever else Tensors can/will do. Delivers much more variable use.
     
  15. Astyanax

    Astyanax Ancient Guru

    Messages:
    2,068
    Likes Received:
    496
    GPU:
    GTX 1080ti
    rapid packed math isn't what you think it is.

    using 32bit units for 16bit tasks is less efficient than dedicating units to the task.
     
    Stefem likes this.

  16. Fox2232

    Fox2232 Ancient Guru

    Messages:
    9,365
    Likes Received:
    2,020
    GPU:
    -NDA +AW@240Hz
    Still better than having FP32 sit and idling. FP16 is apparent way to go. Because push for 8K is not going to work without it in any good way.
     
  17. fantaskarsef

    fantaskarsef Ancient Guru

    Messages:
    10,467
    Likes Received:
    2,695
    GPU:
    1080Ti @h2o
    afaik the step up program was bound to purchases within the last 12 months of the next generation arriving... when the time between generations increases (like it has since Maxwell's release), it is ultimately impossible to be an early adopter and use the step up program. I thought that's basically what it was for, not buy in late into the cycle and then step up.
     
  18. Stefem

    Stefem Member

    Messages:
    40
    Likes Received:
    4
    GPU:
    16
    It's not a magic bullet, there's a finite range you can represent and it's much narrow on FP16 than on FP32 and sometimes the lower precision generate rendering artefacts, with the rise of HDR which needs (hardly surprising) a wide range sure it will be useful to extract any possible performance but its use will be somehow limited.
    I'm also a lot sceptical about the extreme high resolutions thing (and I laugh at the console maker claim just because it support 8k output, even a GT 1030 support 8k at 60 fps that way :rolleyes:) at 8k you end up "wasting" computation time on a lot of identical pixel and I'm really curious on what DLSS can do as the higher the resolution the better it works so here tensor cores becomes even more interesting.
    Dedicated half precision units occupy some die space but are more efficient, tensor cores are extremely dense but are a bit wasted if you can't fully leverage them, although have the interesting advantage over standard FP16 of the ability to output the results of operations on FP16 value in FP32 which help a lot to avoid excessive approximation in machine learning but may become also useful in other ways if NVIDIA enable this feature on consumer cards.
     
  19. Unreal--

    Unreal-- New Member

    Messages:
    6
    Likes Received:
    2
    GPU:
    gtx1050ti
    I hope nvidia double the raytracing performance
    In the next gen while keeping the price reasonable.
    So That the developer could use the raytraced global ilumination in games.
     
  20. angelgraves13

    angelgraves13 Maha Guru

    Messages:
    1,177
    Likes Received:
    247
    GPU:
    RTX 2080 Ti FE
    If they can do 2.5X RT cores then RTX can be “free” and again only limited by rasterized performance.
     

Share This Page