Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

  1. fantaskarsef

    fantaskarsef Ancient Guru

    Messages:
    15,766
    Likes Received:
    9,667
    GPU:
    4090@H2O
    I would not have any problem with putting cards from two vendors in my rig. Would give me bragging rights in both sub forums.


    Not sure about 1) (if they ever make it work, this limitation won't be the thing). But with 2), it's not only about installing the drivers, but also setting them up, tweaking them, and try outs with new driver versions (basically Nvidia I'm aware of).

    Takes double the time to even run such a setup properly than with any normal CFX / SLI config, let alone if you have troubles with games, and you can't even tell for sure which of the drivers is causing the problem :eyes:
     
  2. stereoman

    stereoman Master Guru

    Messages:
    887
    Likes Received:
    182
    GPU:
    Palit RTX 3080 GPRO
    Mixing cards from different vendors is a nice feature but I'm more interested in the memory pooling more than anything, I've always thought what a huge waste it is to only be able to use memory from one card in an sli setup, can't wait till they implement this feature.
     
  3. fantaskarsef

    fantaskarsef Ancient Guru

    Messages:
    15,766
    Likes Received:
    9,667
    GPU:
    4090@H2O
    Yep, one of my biggest hopes for dx12 / vulkan. Not too optimistic about it though...
     
  4. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,132
    Likes Received:
    974
    GPU:
    Inno3D RTX 3090
    They kinda both depend on the same feature, which is SFR. You will usually get to get the one with the other. The mixed GPUs really make sense at the point where the integrated Intel GPU can shave off some post processing and give an extra 5-15% on performance, which would be great since it's hardware that most people already have but don't use.
     

  5. Alessio1989

    Alessio1989 Ancient Guru

    Messages:
    2,959
    Likes Received:
    1,246
    GPU:
    .
    Memory pool sharing is possible with linked adapters setups (ie: SLI or Crossfire) and the level of sharing is determined by the related tier (https://msdn.microsoft.com/en-us/library/dn914408.aspx).
     
  6. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,938
    Likes Received:
    1,047
    GPU:
    RTX 4090
    This source is talking to you right now -) You may believe what you want obviously.

    You keep saying something about some schedulers (and I can just repeat myself here and say again that this Anand's quote don't have anything to do with global thread / queues scheduling at all) while in reality it's not an issue of scheduler but more an issue of the architecture design in general. Even if Maxwell would have a global scheduler structure similar to what GCN cards have this would not lead to nearly the same performance gains from running compute asynchronously on Maxwell. Because there are actually reasons why concurrent compute is showing gains on GCN and these reasons are absent on Maxwell architecture almost completely.

    That would be Kepler and Fermi. Kepler and Fermi can't run compute queues concurrently with graphics. And I'd bet that this quote is about them.

    So here's the thing - you can't do "in the CPU" something which is happening in the GPU. Driver while it is running on the CPU controls the GPU. If the GPU doesn't allow for something to happen in principle - it can't happen no matter how fast of a CPU you have.

    High level thread scheduling is happening in both the driver and GPU.

    Maxwell (and even Kepler) cards are showing a lot of benefit from an API which is using CPU resources better - the benefits are actually pretty close to what GCN h/w is showing in CPU limited situations:

    [​IMG] [​IMG]

    And yeah, you are getting the maximum utilization from Maxwell under DX11 already because of how Maxwell h/w operates and how NV's DX11 driver is able to use CPU resources much more effectively than AMD's. Hence the general lack of performance boosts from concurrent compute in DX12 on Maxwell, hence the apparent tie between DX11 and DX12 on NV's h/w.
     
  7. -Tj-

    -Tj- Ancient Guru

    Messages:
    18,107
    Likes Received:
    2,611
    GPU:
    3080TI iChill Black
    Last edited: Mar 1, 2016
  8. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,132
    Likes Received:
    974
    GPU:
    Inno3D RTX 3090
    And I could claim I'm another source, but that doesn't work that way and for good reason.

    So, in the end, no matter the exact reason you are basically agreeing with me that Maxwell doesn't seem to get as much of an uplift from DX12, as GCN does. My main point is that Maxwell is already maxed out under DX11, and it won't see any great performance increases under DX12. I don't see us disagreeing with that.


    Look at the Do's and Don'ts Guide for DX12, from NVIDIA themselves, they are quite interesting. On the first list there is the "admission" that the NVIDIA DX11 driver is handling multithreaded submission for the developers already.
    So basically they say that under DX11, their driver pretty much manages what a lower level API would do.

    The ending quote is very interesting, and it's not for Kepler only, since this is their DX12 guide and they don't seem to be making any differentiation. In fact in the end they state it's for both Maxwell and Kepler.
    That kinda answers it about Async Compute right there. The benchmarks seem to indicate the same also.

    This is Kepler, and if my comment about it's scheduler being more flexible than Maxwell's is correct, then Kepler might see more increases as a percentage compared to DX11, than Maxwell will.
    Furthermore, the image supplied was is from Dolphin. Dolphin being an emulator has various restrictions, the main being that it can't really use more than three threads for any kind of logic. A very good summary of the situation in Dolphin is being given in this post, which is part of the Multithreading topic on Dolphin's forums.
    To get a quote from Dolphin's FAQ
    If I understand correctly, the DX12 renderer uses less CPU than the DX11 one, therefore helping in a CPU bound situation. You pasted the pictures from the post of the guy who wrote the DX12 renderer, yet you didn't paste his own words where he says that the extra performance is gained by the extra CPU time saved for emulation of graphics intensive tasks, and not because there was any kind of amazing uplift on GPU performance.
    So, in the end, we are saying the exact same thing :)
    Maxwell won't see any great performance boosts from DX12, so people should stop being surprised from benchmarks showing small/zero performance improvements from DX12 on Nvidia hardware.
     
    Last edited: Mar 1, 2016
  9. Stormyandcold

    Stormyandcold Ancient Guru

    Messages:
    5,872
    Likes Received:
    446
    GPU:
    RTX3080ti Founders
    Well, I finished my popcorn. Going to play my dx11 games now. Apparently, it's fun.
     
  10. Alessio1989

    Alessio1989 Ancient Guru

    Messages:
    2,959
    Likes Received:
    1,246
    GPU:
    .
    Guys, the number of shader-group units is completely irrelevant. What does matter is hot the different jobs are scheduled and handled by the hardware. Having 31-32-63-64 computing groups does mean nothing.
     

  11. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,132
    Likes Received:
    974
    GPU:
    Inno3D RTX 3090
    Such elegance and grace.

    Did any of us argue that? (I honestly don't understand why you would refer to that).
     
  12. Deasnutz

    Deasnutz Guest

    Messages:
    174
    Likes Received:
    0
    GPU:
    Titan X 12GB
    Surely upcoming games like Hitman and Gears of War will have it enabled right? I can't find confirmation other than from the red side.
     
  13. Alessio1989

    Alessio1989 Ancient Guru

    Messages:
    2,959
    Likes Received:
    1,246
    GPU:
    .
    Mapping a ludicrous number of command queues to the number of shader group uint (or whatever the IHV call them) will not gain any performance but will results in overhead (both driver and hardware). Most of the time using 1 gfx and 1 compute queues (2 if with different priorities for different jobs) will deliver better performance. Same for copy queues. DX12 allows to set queue execution priority too (actually there are only two values allowed, 0 - aka normal priority - and 100 - aka high priority) but that is just an hint to the driver.
    The driver-hardware should be able to handled and split the queues onto it's architecture in the best way (except when the driver is buggy).
    Also, please not there is only 1 graphics queue allowed per device node, and as far I am aware none of current GPUs should be able to run multiple graphics queue in parallel per node (and this is why there is only one graphics queue).
     
    Last edited: Mar 1, 2016
  14. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,938
    Likes Received:
    1,047
    GPU:
    RTX 4090
    GM200 is just a beefier version of GM204, there is no serious feature difference like it was between GK10x and GK110.

    You really can't because you don't know what you're saying -)

    There are several reasons why DX12 may provide an uplift to a videocard compared to DX11.

    Concurrent async compute is the lesser one of them all as it's the most complex to implement without issues and provides rather moderate performance increases even on GCN h/w (link). Maxwell doesn't support concurrent async compute as well as GCN does because Maxwell doesn't have the same issues which GCN has in it graphics pipeline. Because of the lack of these issues Maxwell don't have as advanced global scheduler as GCN has - it simply doesn't need one.

    The second and most important reason for DX12 uplifts is the much more efficient CPU utilization under DX12 -- especially multicore CPU utilization. Here both vendors are enjoying more or less the same uplifts in CPU limited situations. The difference here lies in AMD being more CPU limited in DX11 on average because their DX11 driver is, well, crap. Hence there are titles - AotS being one of them - where NV is GPU limited in DX11 while AMD is CPU limited. In such cases you will see uplifts on AMD and won't see them on NV because NV's GPUs are already maxed out and they don't need more CPU power. In a case where both vendors will be CPU limited - like the Dolphin emulation benchmark I've provided above - you will see both vendors having a similar performance boost from switching to DX12. This is the biggest performance boost which DX12 provides over DX11.

    The third reason for performance increases in DX12 is the new h/w features introduced in DX12 h/w - specifically those found in FL12_1 feature level which AMD chips do not support at all at the moment. These features are mostly there for performance optimizations and I expect them to provide NV with more or less the same performance increases as AMD will get from their concurrent async compute implementation. Note that these features can be used to get a better performance in DX11/DX12/OGL/VK - all modern APIs, not just the newer ones.

    Wrong again -)
    They are saying that the threading of requests from an app to the GPU which is done by their DX11 driver the app developer will have to perform manually in the app's code in DX12. DX12 driver don't do any threading and that's actually the reason why it has so low CPU overhead - it just don't do hell of a lot of work which is being done in DX11 driver.
    NV is warning developers that they can't rely on their driver to perform the threading in DX12 - that's all. That's by DX12 design, nothing unique to NV's h/w here.

    This quote specifically mentions same command queue which already means that it has nothing to do with async compute.
    As for the quote itself - context switch of the pipeline is a costly operation and you should avoid it as much as possible on any h/w, GCN included.
    The main feature of concurrent async compute is that it happens without context switch because you have several command queues running in parallel - one with the graphics context and others with compute contexts - and the chip's multiprocessors are able to be fed with commands from both queues without actually switching the queue context.
    So while this quote has relation to switches between graphics and compute contexts in general - it doesn't say much about concurrent async compute. What's even more interesting - you can actually run compute inside the graphics queue context without switching - which is what NV seems to be doing with PhysX since ages ago.

    Kepler doesn't support compute queues in parallel to graphics at all. Your comments on this are completely irrelevant.

    What difference does it make? If you're getting more CPU cycles in DX12 then you can spend these CPU cycles either in your application to speed up some CPU intensive stuff or in the DX12 driver to push more commands to the GPU. In both cases you will get a speed up if you were CPU limited. If you weren't - you won't, on any GPU, as you were GPU limited and this limit hasn't changed.

    Kinda. But I'm saying that it will see less improvements because of reasons (most of which are the same why Maxwell is so good in DX11 right now which is actually a big plus) and you're saying it won't because it doesn't support concurrent async compute - which is wrong on many levels.

    Hitman most likely will as it's an AMD's Gaming Evolved game.
    Gears most likely don't as they seem to not run properly on AMD's current DX12 driver at all at the moment. I wouldn't expect a game which can't get basic rendering in DX12 right to use such a complex feature as async compute.
     
    Last edited: Mar 2, 2016
  15. narukun

    narukun Master Guru

    Messages:
    228
    Likes Received:
    24
    GPU:
    EVGA GTX 970 1561/7700
    Hey guys a random question here.

    I got my 970 OC'ed to 1561mhz, if i have that clock it means that i have 5.195 Gflops?

    core clock * shaders units * 2 / 1.000.000 = gflops

    right?

    I'm asking that because i'm not getting close in FPS to the R9 390 in The Division (cutscenes) and im supposedly getting the same level of gflops. It has to do with shaders i guess?

    //R9 390 5,120 GFLOPS
     
    Last edited: Mar 8, 2016

  16. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,938
    Likes Received:
    1,047
    GPU:
    RTX 4090
    Flops are not the sole metric of performance. R390 has twice more the RAM and more memory bandwidth.
     
  17. Keesberenburg

    Keesberenburg Master Guru

    Messages:
    886
    Likes Received:
    45
    GPU:
    EVGA GTX 980 TI sc
  18. Denial

    Denial Ancient Guru

    Messages:
    14,207
    Likes Received:
    4,121
    GPU:
    EVGA RTX 3080
  19. Keesberenburg

    Keesberenburg Master Guru

    Messages:
    886
    Likes Received:
    45
    GPU:
    EVGA GTX 980 TI sc
    But just it was a fury they edit the name of the videokard from fury to 390. So i think AMD is Faster with the fury?
     
    Last edited: Mar 10, 2016
  20. Undying

    Undying Ancient Guru

    Messages:
    25,502
    Likes Received:
    12,902
    GPU:
    XFX RX6800XT 16GB
    Edit the name? The 390 is performing great in that benchmark. When they put Fury it will probably match even 980ti.

    It would be funny to see how 970 stands vs 390.
     

Share This Page