Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

  1. pharma

    pharma Ancient Guru

    Messages:
    2,496
    Likes Received:
    1,197
    GPU:
    Asus Strix GTX 1080
    Any Maxwell 2 chips you can still buy are simply clearing inventory according to what Nvidia mentioned a month ago. Fury is part of the new AMD lineup and based on the Nvidia presentation will not be a competitive part.
     
  2. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,940
    Likes Received:
    1,048
    GPU:
    RTX 4090
    It will have async support as any DX12 chip on the planet has it.

    It will have concurrent async compute support as well. Why they didn't mention it during the unveil - that's a good question, NV's marketing is all kinds of crap lately, but I think that it just shows how little they care about that "rrrrevolutionary" feature which gives some +10% of performance even on Radeons and will essentially give less than that on Pascal.

    Here's a nice post from sebbbi explaining some stuff around the whole async compute debacle. It also kinda explains how AMD can slow down competitors GPUs even with one queue, without any async compute, i.e. in DX11 (and this is basically what's happening to Kepler / Maxwell because of console code optimizations in some recent titles).
     
    Last edited: May 7, 2016
  3. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,132
    Likes Received:
    974
    GPU:
    Inno3D RTX 3090
    Let's hope it will, it will bring some standardization and will probably make these chips much more viable for a longer period. I'm looking at that 1070 quite intently.
     
  4. fellix

    fellix Master Guru

    Messages:
    252
    Likes Received:
    87
    GPU:
    MSI RTX 4080
    Looks like Pascal is able to preempt individual SMs at run-time and switch between graphics and compute tasks. On Maxwell, the SM task assignment is static for the duration of the current batch.
     

  5. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,940
    Likes Received:
    1,048
    GPU:
    RTX 4090
    I'm still pretty puzzled by why they decided to disclose this during today closed off press event instead of stating this from stage yesterday.

    It's also kinda interesting that they don't actually enable this on Maxwell in the current drivers I think, even on a per application basis.
     
  6. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    Decided to run these tests again on fresh windows install

    Very interesting results.

    I will quote myself here and post results from launch of AotS

    Currently running same system (slightly different ram settings only) on W10 & 365.10, game version may have been updated

    I tested at 1300/7000

    I test async on vs async off

    i'll leave you guys to guess which is which

    [​IMG]
    [​IMG]
    1490/8000
    [​IMG]
     
    Last edited: May 7, 2016
  7. netkas

    netkas Guest

    Messages:
    55
    Likes Received:
    0
    GPU:
    Many
    From my point of view the whole async compute story is about one thing - increasing gpu's silicon usage by driver/app. As AMD stated many times, their gpu is not fully loaded at many stages of rendering pipeline. Nvidia had same issue. They have used different approach to workaround it.

    AMD decided to add ACEs to its gpus to process some additional workload on CUs in parallel to graphics. but this approach needed support from 3d api and applications. Initially this worked out on new gen consoles.

    Nvidia decided to do it a different, easier way. Lets do the "low silicon load" stages faster. Higher clocks. And When gpu in not reaching its power budget ( e.g. not under 100% stress load) then let it overclock itself to finish some tasks faster, so it will spend less time in situations when only a small part of silicon is doing the job. Hello turbo boost 1/2/3.

    Amd won, thanks to consoles and mantle (giving a birth to dx12/vulkan) , now nvidia has to catch up.
     
    Last edited: May 9, 2016
  8. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,940
    Likes Received:
    1,048
    GPU:
    RTX 4090
    Bunch of PR flavored bull**** again. You don't need ACEs to launch anything asynchronously and NV GPUs are doing this just fine -- SMs can launch different jobs when some h/w is available since prehistoric times.

    What really happened is this: NV has been working on improving the utilization of their h/w since G80 introduction and has reached nearly peak levels in Maxwell (and Pascal probably but that remains to be seen). There's nothing easy in this task and it is a more proper way because it reaps results in all applications straight away, without the need for any support on the s/w side.

    AMD's graphics utilization hasn't really improved since Tahiti where it was already quite a bit worse than on Kepler (which was a third generation of NV's architecture). So since they obviously can't (or don't want to?) improve the graphics part of their GPUs they've decided to use the ability which they have to their advantage - they can slice the compute kernels into the same SM which is running graphics. This way an SM which is partially or fully idling in graphics due to inefficiency of their architecture can fill this bubble with compute kernel.

    This is a worse approach because a) it is completely based on s/w support - not just game code but APIs as well - and because of this will only help in games which are using these APIs and using this ability; it also will make these games run worse on other h/w. b) It's very unpredictable in performance not only on all the AMD h/w available right now but on any future h/w from AMD or any other vendor as well.

    The ability to run different contexts on the same SM is a good ability and this is something which NV h/w lacks still and will have to get at some point. But using this ability to speed up the GPU because of how bad it is in handling the singular graphics context is a backwards solution of the problem. A good GPU with such ability would not be getting performance boosts from such contexts interleaving as it will be able to run at a maximum capacity in one context as well.

    So the situation right now is this:
    - AMD's h/w sucks in graphics context but is able to execute any contexts on the same SM at any given time. If they'll improve their graphics engine they won't be gaining any performance from running compute in parallel anymore.
    - NV's h/w is almost always at 100% utilization in graphics or compute but it can't run them both on the same SMs at any given time. If they'll improve their SM execution abilities they won't be getting a performance loss when some compute warp is needed to run in a middle of a graphics shader.

    My own stance here is that NV's approach is a better one while AMD's approach will fade away in time but right now, with current console h/w being the main target for engine optimization it will affect NV's (and Intel's btw) h/w performance negatively - although not by much as there are ways of getting around this with a clever driver / h/w scheduling. Pascal's improvements should help a bit as well.

    Man, can you PLEASE reduce that image in your post? Horizontal scroll isn't something I'm used to on my 30" display.
     
    Last edited: May 9, 2016
  9. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    Async on 1490/8000
    [​IMG]


    Async off
    [​IMG]
     
  10. TheRyuu

    TheRyuu Guest

    Messages:
    105
    Likes Received:
    0
    GPU:
    EVGA GTX 1080
    They actually disclosed this during the GP100 reveal back in the beginning of April[1]. Although it's not specifically referring to the ability to preempt individual SM's the wording (sort of) implies it's something new over Maxwell (which didn't have this functionality):
    This doesn't mention anything about async only pure compute tasks but it is still a new addition over Maxwell since you can't preempt at the SM level on Maxwell.

    CUDA does use this ability (or at least what I think you're referring to at least). The on board ARM chip will manage how CUDA work is dispatched although obviously individual SM's cannot be preempted on Maxwell.

    [1] https://techreport.com/news/29946/pascal-makes-its-debut-on-nvidia-tesla-p100-hpc-card
     

  11. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,940
    Likes Received:
    1,048
    GPU:
    RTX 4090
    I'm not so sure that this is even related to the topic as pre-emption means that some warp goes into execution with a high priority before the previous one finishes. It's basically the opposite of running something concurrently on the same SM.

    This can also be unrelated because of two things: a) being true only for one context execution. You can pre-empt a workload if it's from the same context - cool for compute, running some debug stuff and async timewarp but pretty useless in graphics otherwise. b) There are no indication at the moment that the feature is present outside of GP100 chip which has its own CUDA compute capability compared to the rest of GP10x line.

    This has nothing to do with CUDA and/or ARM chip in Tesla P100 system. The chip there is needed most likely because GP100 can't work without a host CPU running the OS and managing system memory.

    What I'm talking about is the details of concurrent compute implementation provided by NV during Pascal launch. They are comparing Pascal improvements in this area to Maxwell's capabilities but AFAIK Maxwell isn't actually using this capability as it's disabled in the drivers right now.
     
  12. Ieldra

    Ieldra Banned

    Messages:
    3,490
    Likes Received:
    0
    GPU:
    GTX 980Ti G1 1500/8000
    He's talking about the gmu on maxwell, it's an arm uc
     
  13. Keesberenburg

    Keesberenburg Master Guru

    Messages:
    886
    Likes Received:
    45
    GPU:
    EVGA GTX 980 TI sc
    Low score, is you'r cpu on stock?
     
  14. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,940
    Likes Received:
    1,048
    GPU:
    RTX 4090
    Highly doubtful. Why would they use a CPU for a dedicated h/w engine?
     
  15. Keesberenburg

    Keesberenburg Master Guru

    Messages:
    886
    Likes Received:
    45
    GPU:
    EVGA GTX 980 TI sc

  16. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,940
    Likes Received:
    1,048
    GPU:
    RTX 4090
    That's because these 980Ti results are from a heavily OC 980Ti card (>1500 MHz). The 1080 results in AotS database seems to be from early engineering samples as well.

    Wait for proper benchmarks ffs. These "leaks" are all kinds of confusing and misleading.
     
  17. Keesberenburg

    Keesberenburg Master Guru

    Messages:
    886
    Likes Received:
    45
    GPU:
    EVGA GTX 980 TI sc
    What are the clock speeds, i can not read it in this article. Do you have pictures, or other proof?
     
  18. Denial

    Denial Ancient Guru

    Messages:
    14,207
    Likes Received:
    4,121
    GPU:
    EVGA RTX 3080
  19. Keesberenburg

    Keesberenburg Master Guru

    Messages:
    886
    Likes Received:
    45
    GPU:
    EVGA GTX 980 TI sc
    Thats not a poof It's just speculating form people here.
     
  20. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,940
    Likes Received:
    1,048
    GPU:
    RTX 4090
    The 980Ti results used in this post are the top results of 980Ti in AotS benchmark database. You can be damn sure that they are from a heavily OC 980Ti card. And why do you ask us on details of the information presented in that post? Go ask its author.

    That's actually proof enough as it shows that even on 1525MHz Core & 3950Mhz Memory OC 980Ti isn't hitting the 50 fps number used in mobipicker.com post.
     

Share This Page