Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

  1. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    Good news indeed, this has little to do with async shaders though
     
  2. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,125
    Likes Received:
    969
    GPU:
    Inno3D RTX 3090
    Only this is the whole point. Instruction-level granularity. It means that it can switch tasks at the instruction level, like GCN. Maxwell and the rest (I'm not sure about Fermi), can't.
     
  3. Denial

    Denial Ancient Guru

    Messages:
    14,206
    Likes Received:
    4,118
    GPU:
    EVGA RTX 3080
    Actually I'm not sure this is true. My understanding is:

    Async Compute on AMD basically means that they can queue and execute both graphics/compute commands at the same time. The problem on Nvidia wasn't the execute part -- Nvidia can actually execute 1 graphics and 31 compute commands simultaneously. The problem with Nvidia though, is that they couldn't execute another graphics until the compute commands were finished. So if the compute command was taking a long time, it would stall the entire pipeline until it was done.

    With Pascal they can swap the compute commands to memory, pipe in another graphics, and restart the compute where it left off in conjunction with the graphics.

    There will probably be a switching penalty for the swap, but it will be significantly less then what it is on Maxwell, because Maxwell has to wait until the compute is completely finished, or time it out, before working another graphics command.
     
  4. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,886
    Likes Received:
    1,015
    GPU:
    RTX 4090
    Instruction level pre-emption is good news for async timewarp only. And only if this actually apply to graphics context. The way they've worded it it may only be true for compute context and compute vs compute kernels.

    No, it's not. Pre-emption isn't about task switching and it's definitely not about running something in parallel.
     

  5. Denial

    Denial Ancient Guru

    Messages:
    14,206
    Likes Received:
    4,118
    GPU:
    EVGA RTX 3080
    Since Maxwell, Nvidia's hardware supported running both graphics and compute commands in parallel, but they had to wait until the compute/graphics command was finished before processing another graphics. In Pascal this is fixed. Hence why they gave the example they did.
     
  6. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    you're describing Async SHADERS on AMD

    ILP means finer granularity, but to enable the async shader functionality available on gcn you'd need a different scheduling pipeline

    maxwell can do concurrent graphics + compute but it would need CPU work to resolve synchronization points between the two queues because the GMU doesn't support fences

    dr. Rus is correct and the biggest effect this would have is the responsiveness of async TW for VR, it is about context-switching though, don't know why he said it isn't.

    it's definitely not about running things inPARALLEL because if they were running in parallel there'd be no need for the context switch
     
    Last edited: Apr 22, 2016
  7. Denial

    Denial Ancient Guru

    Messages:
    14,206
    Likes Received:
    4,118
    GPU:
    EVGA RTX 3080
    But that doesn't sound like it's particularly hard to do in a driver. On the other hand, if you can't copy out the compute mid stream it would be impossible to do.

    Why would they give that graphics example if this wasn't the case?

    It's an in order pipeline. The entire thing stalls if one stalls. With Pascal, it can copy the stalled thread, refill the pipeline and continue the stalled thread where it left off.
     
    Last edited: Apr 22, 2016
  8. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    I don't see any references to graphics in the whitepaper related to preemption, what they're saying is if you're using a Pascal GPU as both a display driver and compute processor your life is going to be easier now.

    Also notice how in the whitepaper they mention context swapping to DRAM ? That's WAY slower than context switches on GCN that swap to L1 cache (or maybe registers attached to ACEs, i can't remember)

    Fable Legends did it, they had async shaders working on NV hw by coding it statically , it's not impossible just very hard and very time consuming for very little gain
     
  9. Denial

    Denial Ancient Guru

    Messages:
    14,206
    Likes Received:
    4,118
    GPU:
    EVGA RTX 3080
    Well DRAM in this case is HBM2 on the GP100 and yeah, I agree, it's going to be slower -- but still faster then stalling the pipeline like it does in Maxwell.

    And yeah, idk, that's not my understanding of the pipeline.

    Right now, currently in Maxwell, the graphics pipeline is setup to run 1 graphics command and 31 compute commands in parallel. But, if one of those compute commands takes 4ms and the graphics only takes 1ms, the entire pipeline takes 4ms because it cannot issue another command to either, until that longer compute is done.

    AMD's GCN on the other hand can issue 1 graphics and 8 computes whenever it wants. So going back to our workload, GCN would do the 4 graphics commands and 1 compute command in 4ms, while Nvidia only did 1 graphics and 1 compute in 4ms. This is because AMD can continue to queue in graphics while the compute is running in parallel.

    On Pascal, Nvidia can now pause the compute pipeline, drop it to HBM2/GDDR5, toss in another graphics command along with the unfinished compute, and continue going. It doesn't stall the pipeline to wait until the compute finishes.

    Currently devs can work around this issue on Nvidia's GPU by keeping the compute commands small by slicing them up, if the command only takes 2ms to execute, then in that 4ms window Maxwell can do 2 graphics and 2compute, instead of 1 and 1. But, as you said, that requires extra work on the devs part and for them to specifically do it only for Nvidia. Where on GCN it's irrelevant. With Pascal they don't have to worry about it, because when graphics command finishes it will just save the state of the compute, pipe another graphics in, then continue where it left off. So the compute instructions can be as big as the dev wants.

    The end result isn't as good as AMD's implementation, but the penalty should be a lot less severe for more complex compute instructions then it is on Maxwell and the dev shouldn't have to spend any time optimizing for Nvidia's compute implementation.
     
    Last edited: Apr 22, 2016
  10. Stormyandcold

    Stormyandcold Ancient Guru

    Messages:
    5,872
    Likes Received:
    446
    GPU:
    RTX3080ti Founders
    1st sign before "no AC" on Maxwell2 announcement, due to speed penalty soon?
     

  11. Denial

    Denial Ancient Guru

    Messages:
    14,206
    Likes Received:
    4,118
    GPU:
    EVGA RTX 3080
    The only way I can see Nvidia doing anything, given Maxwell's current limitations, is some kind of Binary Translation of DX12 Compute commands into an optimized format, with preferably smaller compute slices.

    It wouldn't get around the lack of what people are calling "Async Compute" but it would improve efficiency in situations where the pipeline is stalled. As the stall would be shorter.

    Edit: Also please note, if my understanding of how the pipeline works is wrong, then all of what I'm saying is completely wrong and I'm a moron. I'm sure some Nvidia engineer is sitting here reading all our posts face palming.
     
    Last edited: Apr 22, 2016
  12. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    the main thing is pascal still doesn't have compute+3d engine concurrency unless the GMU is used because they used the same unit

    if gmu were used you would have 31 asynchronous compute engine running concurrently with the 1 3d engine with as many queues as memory can support, but you will not have native fence support so cpu will have to resolve synchronization points , but this is presumably how CUDA / gameworks effects have been working until now

    have to check maxwell/kepler whitepapers to refresh my memory

    with maxwell 2 (gm2xx) hyper-q can lift reservation of execution units by GCP(graphics engine) dynamically
     
    Last edited: Apr 22, 2016
  13. pharma

    pharma Ancient Guru

    Messages:
    2,485
    Likes Received:
    1,180
    GPU:
    Asus Strix GTX 1080
  14. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    It's weird because in their original AotS review the difference between 980Ti and Fury X at 1440p CRAZY was way more pronounced, something like 35 vs 46. I'm guessing they didn't use a 980Ti with different clocks, and why would the Fury X perform worse on a 5820k@ 4.5 vs a 6700k at 4.5

    who knows

    By the way I'm becoming increasingly convinced AotS is a great benchmark, also for stability testing
     
  15. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,886
    Likes Received:
    1,015
    GPU:
    RTX 4090
    Commands are created by API and they can be created in parallel since DX11.
    Having several command queues of compute context in addition to one queue of graphics context on a GPU is supported since Maxwell.
    This doesn't mean that they can be executed in parallel though and as of right now Maxwell can't do this - although technically it should be able to.
    Pre-emption means that the previous workload is stopped and the new one is started. This isn't about running anything in parallel as you obviously can't run something in parallel to something which was stopped. The lower granularity of pre-emption means that the latency of starting a new work which have a higher priority than the one a GPU is currently running will be lower, and the only graphics related task where this is needed is async timewarp in VR.

    No it's not. There are thousands of active threads on a chip at any given moment. When some thread is stalled another thread is launched to hide the latency of a stall inside one context, be it graphics or compute. Pre-emption allow the user to control which threads have higher priority compared to other and thus should be launched as soon as they appear in the queue.

    It doesn't. Secondary compute queues on Maxwell are statically compiled into the main graphics queue at the moment. There are no context switching. You can run compute in graphics context and that's what current drivers are doing.

    Queues are not commands. At the moment Maxwell can't do this as this is disabled in drivers which just pack compute commands into the graphics context instead. If one of those compute commands will stall while fetching data another kernel of the same context will be launched to hide the latency. The whole thing is much more complex than most of you think. The static compiler creates a lot of concurrency by itself even without any compute.

    This is fully dependent on the workload in question and on if GCN will have the underutilized resources to perform compute in parallel to graphics and if this won't trash caches so hard that net result will be negative. As I've said, this is way more complex than whatever AMD PR was telling people.

    This isn't async compute. Async compute is the opposite of pre-emption basically as "async" means that you can launch something before a previous task was finished. There's no stopping, a task is launched while another task is active, in parallel, on the same h/w.

    Maxwell can execute compute in parallel to graphics. Whether it will remains to be seen as NV have a lot of rather obvious reasons to not enable it on Maxwell and keep it for Pascal only. Some of these reasons are technical, some are pure marketing.

    I was talking about task switching and for all we know this pre-emption isn't it as you can run a high priority kernel from the same queue - as NV is doing with async timewarp on Maxwell and Kepler right now.

    DRAM means the whole memory hierarchy as cache is transparent to the code. Swapping out to DRAM means REG->L1->L2->DRAM.

    I don't think that GMU's capabilities has anything to do with Maxwell's AC support or lack of thereof.
     

  16. TheRyuu

    TheRyuu Guest

    Messages:
    105
    Likes Received:
    0
    GPU:
    EVGA GTX 1080
  17. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    I think he makes some good points then ruins it with superficial statements like 'pascal is more gcn-like', just because the alu:register ratio has increased it to match it... Pascal and gcn are still hugely different.

    Warp concurrency is indeed an area in which gcn has an advantage, but compute performance is determined by many more factors, including memory atomics, datapath, etc, etc.

    If gcn is the better architecture for compute why is nvidia so dominant, both in terms of market share and performance in compute benchmarks?

    Some interesting insights, but rather superficial analysis I say
     
  18. Carfax

    Carfax Ancient Guru

    Messages:
    3,956
    Likes Received:
    1,450
    GPU:
    Zotac 4090 Extreme
    I honestly wouldn't give that guy much credence. Some thing he says are undoubtedly true, but he's too biased to be considered a good source for information.
     
  19. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,886
    Likes Received:
    1,015
    GPU:
    RTX 4090
    He's starting with warps per SM which isn't something which can even be controlled by s/w in graphics so I'm not sure how he gets the result of "Maxwell loses performance once we move higher than 16 concurrent Warps per SM". The graph he's showing is about shared memory throughput which isn't really SM performance. Technically speaking the more warps you have active the higher is the registers/memory pressure as each warp comes with its own state of memory allocation. This has zero to do with async compute though.

    The rest of his memory rant is just plain bull****.
     
  20. Carfax

    Carfax Ancient Guru

    Messages:
    3,956
    Likes Received:
    1,450
    GPU:
    Zotac 4090 Extreme
    I wouldn't mind seeing you debate him on Anandtech forums. Those forums are heavily biased towards AMD, so you'd have your work cut out for you :D
     

Share This Page