Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

  1. HeavyHemi

    HeavyHemi Ancient Guru

    Messages:
    6,954
    Likes Received:
    959
    GPU:
    GTX1080Ti
  2. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,895
    Likes Received:
    767
    GPU:
    Inno3D RTX 3090
    As for Polaris needing Async less:

    [​IMG]

    It practically has the same gains as Fiji. The gains also seem to scale up with frequency.
     
  3. Stormyandcold

    Stormyandcold Ancient Guru

    Messages:
    5,701
    Likes Received:
    363
    GPU:
    MSI GTX1070 GamingX
    Wouldn't the gains scale-up with frequency for all cards?
     
  4. HeavyHemi

    HeavyHemi Ancient Guru

    Messages:
    6,954
    Likes Received:
    959
    GPU:
    GTX1080Ti
    It's like groundhog day over and over.
     

  5. Stormyandcold

    Stormyandcold Ancient Guru

    Messages:
    5,701
    Likes Received:
    363
    GPU:
    MSI GTX1070 GamingX
    Yep :bang:
     
  6. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,895
    Likes Received:
    767
    GPU:
    Inno3D RTX 3090
    The gains as absolute numbers, do. The percentage gains don't really. The max percentage difference is between a 4096 shader/1050MHz Fury X and a 2304 Shader/1266MHz Polaris chip. As Polaris clocks up (even by a tiny bit), the percentage gains are similar to the rest, indicating that Polaris does improve as much as the previous architectures. I posted it because dr_rus said that Polaris is more efficient and needs this method less, which seems to be incorrect. The differences in the benefit from Async between Polaris and Fiji seem to be within the margin of error, no matter the CU configuration or the frequency. It seems to be giving static percentage gains to each card.
     
    Last edited: Sep 12, 2016
  7. Carfax

    Carfax Ancient Guru

    Messages:
    2,913
    Likes Received:
    465
    GPU:
    NVidia Titan Xp
    @ Feynman or dr_rus, can one of you guys verify or explain what this guy is saying about Pascal's asynchronous compute capability with Vulkan?

    Source

    So he's saying that asynchronous compute will not be of much benefit for Pascal in Doom, and that it will actually add latency..
     
  8. Carfax

    Carfax Ancient Guru

    Messages:
    2,913
    Likes Received:
    465
    GPU:
    NVidia Titan Xp
    Strange, Guru3d got a much greater difference with AC on/off than PCperspective:

    [​IMG]
     
  9. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,108
    Likes Received:
    462
    GPU:
    RTX 3080
    No, it's not. You're confusing context switch latency with compute stream execution latency, as anyone who continue to post these graphs do really. Since GCN execution units do not switch depending on current context, executing a compute wavefronts stream may and will happen on them in parallel to executing the current context (presuming graphics here although it may be other compute as well), thus the latency you're referring to is not a context switch latency, it's the "async" compute execution latency.

    When you actually force GCN to "switch" contexts by running preemption call you're getting worse latency than what NV had back on Maxwell. Thankfully, you don't really need to do that on GCN so this comparison is basically pure theory, although there are some applications where this is playing against GCN (ATW and high priority contexts being crap on GCN till recently being one example).

    ACE don't store any contexts, the work incurring into a context switch is done on a CU level. ACE is just a global dispatch port which issues commands from a queue to available CUs where these commands are going into the CUs active workload from which the CU's local dispatch units fetches the waves to be executed on the execution units in a serial fashion. ACE may send a preemption call to a CU which will force the CU to finish the current job, save the current state to memory (L1->L2->VRAM->RAM, depending on where there's space), accept the new workload and start processing it. Nothing is different here compared to NV afaik.

    ACE do not perform scheduling tasks, they just dispatch commands from queues on a FIFO basis. The global scheduler (aka Graphics Command Processor) is performing scheduling tasks. HWS is an extended ACE which is able to handle two queues simultaneously (kinda confirms that they don't really need that many global dispatch units right now) and have additional functionality of CU reservation, context prioritization and such. It is also microcode programmable and thus can be updated with drivers while ACEs can't.

    All queues are exposed by Maxwell driver, "serialization" (not really a proper term as GCN execute everything serially as well) happens in NV's driver, not DX12. This also has nothing to do with context switching since there are none as Maxwell is running both graphics and compute from the same graphics h/w context.

    This isn't a context switch, warps/wavefronts being swapped out is a normal mode of operation for a GPU since this is the main way to hide the latency of a memory access.

    You don't need that at all, you can schedule compute work inside the graphics h/w context just fine and in this case Maxwell will run compute pretty much the same way GCN does it, by launching the compute warps on whatever SM is ready to execute them. It won't get benefits from async compute because Maxwell's general utilization on graphics is already very high and this type of scheduling makes asynchronous execution hard but that's pretty much it - and most benchmarks confirm this, with Maxwell loosing 1-3 frames when going from DX11 to DX12, and this loss is mostly attributed to DX12 CPU scheduling overhead, it will probably go down even further with driver/OS updates.

    You can't. When you throw different workloads on the same device these will fight with each other, they will trash your cache access patterns, they will create mutual stalls etc. You seem to think that it's some magical unicorn - to be able to run different contexts on the same CUs at the same time - while in practice it's not, and even with great amounts of tweaking you're still getting unpredictable results because you can't tweak for the future h/w for example and you can't even tweak for all memory b/w of the current h/w on the market - and these small differences will affect the execution latency of your secondary stream and they will influence the execution of your primary stream as well. There is no magic bullet for this problem.

    GCN in general requires a lot more tweaking to extract the peak performance than NV's h/w. When you throw the async compute into this mix it's quickly approaching nightmare levels of tweaking needed. The big win for AMD here is the fact that everyone need to do this because they need to extract that performance from consoles. PCs are getting some leftovers which sometimes work and sometimes don't even on GCN h/w. When it's a PC exclusive you usually have a catastrophe (like TWW for example where AMD is still slower in DX12 than NV is in DX11) because most people on PC don't care about all these peculiar GCN optimizations and hacks needed to make it perform on the same level as NV's h/w.

    So saying that you can just "throw" anything on GCN is true only in the sense that it won't lead to a BSOD probably and will be executed even in its worst form - but chances are that NV h/w will be several times faster doing the same thing even if it may BSOD or TDR at first.

    NV's h/w is more specialized in general while GCN was a lot more "generic" back when it was introduced and this division keeps even now with Pascal vs Polaris and will probably only go away with Volta. The overhead of complexity of that "generalization" of GCN is why it was underperforming back against Kepler and why it needs some s/w compute "hacky" optimizations now to fix the slow frontend and help the slow backend. NV's h/w is very "timely" usually, they never include stuff which can't be used right now and tend to fix issues with a simplest possible option (like the static->dynamic SM allocation between Maxwell and Pascal). That's why they earn more money as they make simpler h/w which perform better in modern workloads.

    Pascal don't have any issues which you're describing even though it still is unable to run different contexts on one SM but right now this is a questionable benefit, especially for gaming, where a mix of smart static scheduling with high utilization of your execution units coupled with simpler h/w do in fact beat the GCN's "generic" approach which while looking great on paper lead to energy, efficiency, utilization and frequency issues.

    3D Mark is not the best example here, if you look at the games using async on AMD h/w you'll notice that most of them show smaller gains on 480 than on Tonga/Hawaii/Fiji.

    This is all in all an expected behavior, it may become even worse if AMD will ever implement the same clock boost functionality as NV as async compute tend to affect the maximum boost clocks negatively (due to higher h/w utilization percentage) and the net gain with async may actually end up being negative compared to async off with higher clocks.

    I have no idea what he's talking about but in every talk about Doom it's worth to remember that AMD's gains there are 50/50 due to OGL driver crappiness and console intrinsics being used, AMD's h/w gains ~5% from async compute there so if NV's h/w will gain ~2-3% then I'd say it's about as much as we should expect in any case.
     
    Last edited: Sep 12, 2016
  10. Yxskaft

    Yxskaft Maha Guru

    Messages:
    1,481
    Likes Received:
    119
    GPU:
    GTX Titan Sli
    With the exception of Kepler, Nvidia also has nice gains in DOOM Vulkan now for both Maxwell and Pascal, surpassing the OpenGL performance.
    So perhaps finally people can stop claiming Nvidia's OpenGL drivers are so good that only AMD needs Vulkan.
     

  11. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,108
    Likes Received:
    462
    GPU:
    RTX 3080
    NV always had nice gains in Vulkan in Doom in CPU limited situations. It's the GPU limited situations where NV's OpenGL driver really shine compared to AMD:

    [​IMG]
     
  12. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    Short, because i came tired again. ACE can be programmed . Look for how and what to do. Already happened at Fuji . So many things happened, just tired to go thru and write a lot here.
     
  13. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,108
    Likes Received:
    462
    GPU:
    RTX 3080
    Fiji have two HWS / four ACEs, and ACEs can't be programmed. Seriously, it's time to stop posting.
     
  14. narukun

    narukun Master Guru

    Messages:
    217
    Likes Received:
    24
    GPU:
    EVGA GTX 970 1561/7700
  15. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,108
    Likes Received:
    462
    GPU:
    RTX 3080
    Unless I'm mistaken, both Maxwell and GCN can preempt on draw call / wavefront group boundary only. However, Maxwell's context switching is faster in general. All modern GPUs have on-die memory for such operations, and I don't think that GCN have anything specifically dedicated for this as this would be very much a waste of transistors for GCN's way of operation. GCN do not perform context switching in a typical gaming workload as it can run several contexts in parallel with little to no issue.

    Well, this quote doesn't say much beyond the fact that ACEs are dispatching commands to CUs which is what I'm saying. If this can be called scheduling then sure, it's scheduling. But usually scheduling involves a lot more work than just dispatching the streams to available resources. The fact that each ACE may have up to 8 streams available don't say much either as this is a known fact. They do mention that ACEs build the task queues by themselves which I thought they don't so we can agree that there's some scheduling happening there at least. This however goes against the PrMinisterGR claim that DX12/VK are somehow suited to GCN's internal scheduling as this means that GCN h/w will still reschedule the queues the way it wants.


    That's not context related, that's just there to keep wavefronts which are running on the CU right now active. Flushing of context works in the same way on all GPUs - the data goes to L1/L2/Memory, wherever it will fit.


    They actually are, from the API side of things - and that's the only side of things for any s/w running on the system. You can't have a DX12 GPU without exposing the multiengine functionality, it won't be DX12-compatible. How the commands going into these queues are being executed is completely up to the h/w however, and on Maxwell all queues are being rescheduled by the driver into the h/w graphics engine.

    This seems to be a source of some misunderstanding around the whole async compute support still even though there are several good posts on this. ALL DX12 h/w MUST expose the required engines to the API. This means that there are compute queues exposed on Maxwell and even Kepler when you check their capabilities under DX12 (you can't actually check this as it's required and presumed to just be there but for the sake of better understanding let's assume that you can).

    What the h/w does with these queues are completely up to the h/w (and the driver which is a part of the h/w). That's why you can have a DX12 game using async compute running fine on Maxwell as Maxwell's driver just reschedule the compute queues into the graphics engine of the GPU.

    This is also why Ashes, for example, can't really tell if the h/w can or can't gain performance from async compute and just switch to a single queue submission model when detecting any NV h/w ID - including Pascals which actually can gain that performance.

    This is also why you're getting a little bit different results in Time Spy on Maxwell when changing the async compute between on and off - when it's "on" it's the driver which is doing the "serialization" on Maxwell; when it's "off" it's the 3DMark engine which is switching to a single graphics queue submission mode - and depending on who's more effective in this you may get a little bit different results.
     

  16. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,108
    Likes Received:
    462
    GPU:
    RTX 3080
    HWS was added in GCN3 (Tonga, Fiji) and is sort of a programmable (by the driver, not the software) upgradeable ACE which can handle two streams instead of one. HWS is what made high priority queues and CU reservation possible. I also haven't heard anything on there being any changes here in Polaris - beyond the microcode update which was pushed to GCN3 cards as well with a new driver. Each ACE can handle up to 8 queues and work with one of them each clock, HWS is able to handle 16 queues and work with two of them each clock.

    AMD is being purposefully misleading here. Context switch on the scheduling side of things (ACEs) means precisely jack **** and can happen however fast it want - it won't matter since the real context switch which actually consume lots of cycles is happening on the execution units side of things, where you have to flush the units, saving their state. With GCN's execution units being agnostic to the context they are running GCN's "context switch" is precisely that - a change of the working stream by the ACE and dispatching of this new stream to the same CUs. This is not actually a "context switch" per se so you can't say that it's fast or slow compared to Maxwell since it's just a completely different operation.

    DX12 spec says nothing about what queues and how many of them a h/w should have. All it says is that the driver must expose graphics, compute and copy engines to the API. How the driver map these to the h/w is completely up to the driver and the h/w. Maxwell's way of handling DX12 is 100% within spec. Kepler's way of handling DX12 is 100% within spec and Kepler do not have any compute engines which it can use in parallel with graphics at all.

    I'm pretty sure that it's the driver and not the D3D. D3D don't know the capabilities of the h/w, it works with whatever the driver reports, and there is no way of reporting a lack of compute queue - that would be out of spec.

    Afaik, Fable Legend's async was programmed into the graphics engine, it was just overlapping some calls inside one graphics queue (which is also a possibility completely forgotten in this whole talk; people seem to think that async compute hack is the only way to have things running in parallel on a GPU while this can't be farther from the truth).
     
  17. aufkrawall2

    aufkrawall2 Maha Guru

    Messages:
    1,112
    Likes Received:
    231
    GPU:
    3060 TUF
  18. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,108
    Likes Received:
    462
    GPU:
    RTX 3080
    Actually, **** off. I won't try to explain things to you anymore. Go be certain that boost is dependent on temperatures and not the other way around.
     
    Last edited: Sep 29, 2016
  19. aufkrawall2

    aufkrawall2 Maha Guru

    Messages:
    1,112
    Likes Received:
    231
    GPU:
    3060 TUF
    I don't need your help with how to benchmark, I consider you totally incompetent in such matters.
     
  20. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,108
    Likes Received:
    462
    GPU:
    RTX 3080
    AotS doesn't use async compute on any NV h/w via a vendor id detection no matter what you change in any config file. So your benchmark shows precisely jack **** aka margin of error. Good luck.
     

Share This Page