Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

  1. EdKiefer

    EdKiefer Ancient Guru

    Messages:
    3,141
    Likes Received:
    400
    GPU:
    ASUS TUF 3060ti
    Here my take on asynchronous compute , this feature helps with making sure the GPU is a high utilization (GPU% usage).
    Now AMD seem to have a lot of efficiency problems with keeping usage up , this is most likely why there pushing it in DX12 .
    If card is already at a high usage, then IMO your not going to see improvements with this feature and since Nvidia Dx11 does very well here along with multi-core support, you can't get water from a rock .

    This is also probably why we saw AMD with high power usage compared to Nivia for same performance as there cards usage were probably lower .

    Dx12 is double edged sword its all up to dev to code game engine for hardware, no mater company .
     
  2. TheRyuu

    TheRyuu Guest

    Messages:
    105
    Likes Received:
    0
    GPU:
    EVGA GTX 1080
    I don't think the sample size is large enough to draw that conclusion yet but the current evidence does seem to point in that direction. If the trend continues in other titles then maybe.

    At the same time I wouldn't discount Nvidia's DX11 performance either. They have a very efficient DX11 implementation and their general superiority in the software stack (not just the driver) can't be overstated.
     
  3. nevcairiel

    nevcairiel Master Guru

    Messages:
    875
    Likes Received:
    369
    GPU:
    4090
    Just look at the DX11 vs. DX12 comparison in the latest AT benchmark. It seems obvious to me why DX12 doesn't really benefit NV this much - their DX11 massively destroys AMD as it is. If you look at overall performance, I think the NV cards score in a place where they are supposed to be from their raw hardware power compared to AMD. Only that AMD needs DX12 to get there, and NV can compete with DX11 against that.
     
  4. -Tj-

    -Tj- Ancient Guru

    Messages:
    18,107
    Likes Received:
    2,611
    GPU:
    3080TI iChill Black

    This should explain a lot. I really doubt its going to be any different with Pascal. Its a new cuda feature after all, started with GK110.

    It is HW all the way and apparently very powerfull too, the only SW part is driver telling HW when to utilize it.

    Devs need to implement it too, kind of like cuda based water in just cause2 etc.

    By Amd they need to implement it too although through dxapi by nv through cuda, nv will give devs instructions just like ms or amd does for its async in dx api.




    Oxide glued to amd and nv had to "beg" them to utilize their approach, but back then it was still a bit so so in SW driver part, now it apparently is fixed, but oxide didnt use it yet. Or something in these lines.

    I personally wouldnt jump to any conclusions based on one crappy test benchmark.

    They're shaddy, seen how they crippled starswarm further on purpose when dx11 nv part ran circles around mantle.
     
    Last edited: Feb 26, 2016

  5. otimus

    otimus Member Guru

    Messages:
    171
    Likes Received:
    1
    GPU:
    GTX 1080
    DX12 seems great and all, really, it does, but the real problem is going to be if it DOES turn out being a real boon for AMD, and not much for the current Nvidia. People take this to mean good things for AMD, but really, it just means bad things for DX12 adoption. It'll just be Mantle all over again until Nvidia pushes out better hardware, which in turn will mean almost no one will seriously use DX12 on anything worth mentioning, outside of some AMD partners and Microsoft, until, like... prolly 2019. Wether AMD fans like it or not, Nvidia has a hell of a lot of marketshare, and expecting developers to just say "Whatever, screw everyone, let's make their games not run as good because PROGRESS! <3 AMD!" isn't very realistic at all.

    I just really hope that's not the case at all. We desperately need things like Vulkan and DX12.
     
  6. SabotageX

    SabotageX Active Member

    Messages:
    78
    Likes Received:
    16
    GPU:
    EVGA RTX 3090Ti
    Probably only Pascal will support it. They will hold until Pascal's launch and then finally tell us that Maxwell doesn't support it and you have to upgrade.
     
  7. dgrigo

    dgrigo Guest

    Messages:
    17
    Likes Received:
    0
    GPU:
    TitanX
  8. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,132
    Likes Received:
    974
    GPU:
    Inno3D RTX 3090
    There is no way it will be like DX10. All the major engines support it already, all the major developer houses too. We have the magic combination of a lower process node after 4 years almost (28 to 16/14nm), and the introduction of two mainstream low-level APIs. One of the reasons that DX12 was introduced was for easier porting and reduction of driver complexity that has reached ridiculous points with DX11. Even niche things like emulators have already adopted it. I'm playing Mario Kart Wii using DX12 since December. This is not going back. The only company who seems to be reluctant about it, and hold it back, is NVIDIA. My guess is for the reasons explained below.

    NVIDIA has had problems with context switching since Kepler. They traded off the scheduling hardware for better control of the scheduling via the driver and better thermals. NVIDIA has had simultaneous computation since Fermi, the problem is when you have to switch from one context to another. GCN has zero performance penalty, while NVIDIA themselves say that context switching is a very costly operation.
    [​IMG]

    Make no mistake, these choices have done really well for NVIDIA. They own their current market share to them. NVIDIA GPUs have been more efficient and therefore cooler and faster than their AMD counterparts, since more or less Kepler. The problem that NVIDIA faces now is that these cards were cooler for a reason. That reason being lacking specific hardware. Everyone is working in 28nm, there is no magic, and neither the AMD designers are idiots to produce hotter cards. It was a matter of choice. AMD chose to invest in a single architecture spanning consoles and all their GPUs, NVIDIA chose to go for increased CPU efficiency under DX11.
    The charts here are most enlightening about the differences.
    This is exactly what I believe too. As I said above, there is a reason for the better thermals of Maxwell, that reason being that it literally has less hardware which is being 100% utilized under DX11. That means it shouldn't see any differences with the lower level APIs. As for AMD, it seems that their deep and parallel architecture has had problems being fed (I was the guy who made the overhead thread in the AMD subforum, I would know), and lower level APIs are finally able to feed the cards properly, thus the tremendous performance increases, since the cards have much more HW onboard than their NVIDIA counterparts.
     
  9. Keesberenburg

    Keesberenburg Master Guru

    Messages:
    886
    Likes Received:
    45
    GPU:
    EVGA GTX 980 TI sc

    don't be disappointed AMD must miss some dx 12 features that run on Nvidia and intel only. JK3 getting that features tomb raider en much more games. :banana:
     
  10. Carfax

    Carfax Ancient Guru

    Messages:
    3,973
    Likes Received:
    1,462
    GPU:
    Zotac 4090 Extreme
    NVidia need to find a way to enable the use of the GMU on DX12. That's the only way this thing makes sense. It seems like a glaring design oversight if they can't get it to work properly with DX12.
     

  11. RzrTrek

    RzrTrek Guest

    Messages:
    2,547
    Likes Received:
    741
    GPU:
    -
    Nvidia's shareholders agree...

    Backward compatibility?

    What's that?

    :banana:
     
  12. Carfax

    Carfax Ancient Guru

    Messages:
    3,973
    Likes Received:
    1,462
    GPU:
    Zotac 4090 Extreme
    Not really true. NVidia uses something called a GMU or Grid Management Unit which is hardware, and with a similar function as AMD's ACEs. The problem is, is that for some reason it's not compatible with DX12.

    Hopefully this may change in the future though, because it's the only way that NVidia will ever have true concurrent graphics/compute workloads on Maxwell v2.

    GCN has a performance penalty for context switching, it's just less than NVidia's.

    As I said earlier, NVidia has a functioning hardware scheduling unit called the GMU which can do concurrent graphics and compute tasks, but the only problem is, is that it only works under CUDA at the moment.

    The GMU works, which is why Maxwell v2 has much higher performance when using hardware accelerated PhysX (uses CUDA) than Kepler. A single GTX 980 is getting almost as many FPS as GTX 780 Ti SLi:

    [​IMG]

    Like I said, that's only for DX12. The GMU has a limitation on it for whatever reason which makes it incompatible with DX12..

    After thinking on it some more, I no longer believe my initial claim. It seems that something is holding back Maxwell v2, but I seriously doubt it's because it's tapped out in DX11.

    GPUs are practically never tapped out, especially in DX11. And while AMD does indeed have a much more parallel architecture than Maxwell v2, it also has lower clock speeds.

    One of the reasons why Maxwell v2 is so fast, is because it has significantly higher clock speeds than AMD's Fury. Aftermarket GTX 980 Tis/980s/970s easily boost above 1400MHz at stock clocks and sustain it for instance..
     
  13. AsiJu

    AsiJu Ancient Guru

    Messages:
    8,958
    Likes Received:
    3,474
    GPU:
    KFA2 4070Ti EXG.v2
    - nvm -
     
    Last edited: Feb 26, 2016
  14. RzrTrek

    RzrTrek Guest

    Messages:
    2,547
    Likes Received:
    741
    GPU:
    -
    Was that image taken before or after the kepler fix?
     
  15. Alessio1989

    Alessio1989 Ancient Guru

    Messages:
    2,959
    Likes Received:
    1,246
    GPU:
    .
    By a development point of view, Fermi drivers are more awaited than "full" hardware multi-engine support on Maxwell 2.0. On pre-Maxwell 2.0 GPUs, no improvement at all are expected when running graphics and compute jobs in concurrency (just a little less driver overhead on Maxwell 1.0).
    Anyway, remember that "async-compute" support can improve performance only when complementary hardware resources are accessed from graphics and compute jobs. If both type of jobs need to access to the same hardware resources, no performance improvements can be obtained.
     

  16. Carfax

    Carfax Ancient Guru

    Messages:
    3,973
    Likes Received:
    1,462
    GPU:
    Zotac 4090 Extreme
    Pretty sure it was after. That benchmark was taken at the beginning of this year. It was part of Gamegpu's 2015 re-benchmarks.. Here is the same benchmark but at 1440p. The GTX 980 Ti is almost doubling the Kepler Titan..

    [​IMG]
     
  17. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,132
    Likes Received:
    974
    GPU:
    Inno3D RTX 3090
    When you say "same hardware resources" and since we speak about compute, I assume you refer to shading units, right?

    It is quite obvious that this unit handles compute queues. The problem is the switching between graphics and compute tasks. The image I put in my previous reply is NVIDIA themselves saying that they can only context switch in draw call boundaries, and even then there is a performance hit.
    [​IMG]

    CUDA is a compute-only thing really. It is quite obvious that that unit can't handle switching between graphics and compute queues, but compute only. The ACEs can do both.

    Take a look at this thread at Beyond3d. You can see that all NVIDIA hardware is very fast on compute-only or graphics-only tasks, but when switches between them are involved, the latencies go up 60% at least. If you read the thread a bit and see people reporting their results, you will see that there is zero performance penalty for GCN for context switching. To quote one of the posts with a 290:
    Code:
    Compute only:
     1. 52.71ms
    Graphics only: 26.25ms (63.90G pixels/s)
     Graphics + compute:
     1. 53.32ms (31.47G pixels/s)
    No, it cannot. It can only switch between compute tasks. That's why it's not enabled for DX12.

    That reason is that it's compute switching only, and for lightweight PhysX tasks. The top Tesla card for NVIDIA is a Kepler one, least you forget. And the main differentiating factor between Maxwell and Kepler is the go-away with even more hardware scheduling, giving space for more efficient graphics units in Maxwell. Let me quote Anandtech's Maxwell architecture review:

    There is no shared resources management, and the scheduling units have even less capability than Kepler (hence no Maxwell Tesla).

    You don't seem to hold any faith to the people who designed this awesome hardware. All indicators show that Maxwell 2.0 literally works 100% under DX11, which is a miracle on its own. If it didn't, it would get performance increases with DX12. I believe that NVIDIA has a PR problem and nothing else.
     
  18. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,938
    Likes Received:
    1,047
    GPU:
    RTX 4090
    H/w is just some microcode commands. What it accessible from CUDA can be accessible from DX12 or whatever. The question is - will the HyperQ h/w be fit to handle DX12 requirements? This is an interesting question and I'm pretty sure that there are some problems with this - hence the long wait for concurrent compute implementation from NV.

    In any case it is certain that GCN has a much more advanced implementation of the feature and due to architectural choices it will always get more performance out of it. But the whole situation is down side up at the moment thanks to AMD's rather aggressive PR on the topic.

    We have two DX12 architectures: Maxwell and GCN 1.1+
    - Maxwell is more advanced in rendering features supported (FL12_1) and is able to reach it's peak performance more often in any API.
    - GCN is more advanced in how it handles tasks execution (ACEs) but it actually needs to run several compute tasks in parallel to a graphics one to reach it's peak performance and for that it requires the use of DX12 and Vulkan APIs.

    That's the main difference between the two. If NV will just leave Maxwell as is in Pascal and just add HBM2 and GDDR5X support they'll need to spend more transistors on SIMDs to overtake GCN in DX12 and Vulkan. Considering that Fiji contains ~10% more transistors than GM200 - this shouldn't be a problem. A Maxwell chip with ~10% more general performance will be on Fiji's performance level even in DX12 AotS benchmark which isn't a very representative title anyway. Same can be said about GM204 vs Grenada chips.

    So what we have here is an architecture which is good in new APIs only versus an architecture which is good everywhere. And there is a rather big possibility that the latter architecture won't be kept as is in Pascal and will be further improved. I don't know, I don't see any big wins for AMD here - they'll be on par at best with the same chips they have now.

    Firstly, there is a Maxwell Tesla. Four of them, actually: M4, M40, M6 and M60.
    Secondly, the reason why GK110/GK210 is still used for Tesla range is because the Maxwell range has very limited support for double precision operations which was done because otherwise it wouldn't be much different from Kepler in its gaming performance. The very same thing was done by AMD in Fiji as both were limited by 28nm production process. Compute task scheduling has nothing to do with it as all Maxwell chips support HyperQ as well as GK110.
     
    Last edited: Feb 26, 2016
  19. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,132
    Likes Received:
    974
    GPU:
    Inno3D RTX 3090
    There is a point about people who are buying cards this year, cards that they may want to hold for 2-3 years. As things are starting to look, all of AMD's offering in all price ranges except the top one, look better. Would anyone really consider a 980 over a 390x? Or over a Fury (as both have 4GB VRAM?). Or a 960 over a 380x? Even below things are harder. The 270 demolishes the 750Ti, and the gap just increases with DX12. The only space that an NVIDIA card might make more sense is the ultra top, and the only reason for that is the 6vs4GB of VRAM.
     
  20. -Tj-

    -Tj- Ancient Guru

    Messages:
    18,107
    Likes Received:
    2,611
    GPU:
    3080TI iChill Black
    By maxwell they removed DP (double precision) to gave more room to SP. Not that thy cripled SP or compute in general further.

    Also that pic you keep posting is only there are loooong commands, otherwise there are no overhead switching issues.
     
    Last edited: Feb 26, 2016

Share This Page