Discussion on Async Compute from AMD Prospective

Discussion in 'Videocards - AMD Radeon Drivers Section' started by Eastcoasthandle, Sep 20, 2018.

  1. Eastcoasthandle

    Eastcoasthandle Guest

    Messages:
    3,365
    Likes Received:
    727
    GPU:
    Nitro 5700 XT
    I want to know more about how AMD uses Async Compute vs how Nvidia does it. With Turing cards released yesterday it's very possible we will see games that utilitze AC in DX12 and Vulkan. However, AC isn't used the same between AMD and Nvidia and each other's implimentation hampers performance when that's the only derivative used in the game. IE: If developer uses AMD's methods AMD gains the best performance.

    Below is the information I've gathered about it so far. However, far from complete. Any input would be appreciated.
    ----------------------------------------------------------------------------------------------------------------------------------------

    It was all explained way back when FM got called out for using async computing however via concurrent execution in TimeSpy (something nvidia GPUs are good at). The thread was eventually locked. You can guess as to why.
    https://steamcommunity.com/app/223850/discussions/0/366298942110944664/

    It was also discussed in other places:
    https://www.extremetech.com/gaming/...tions-of-bias-in-new-directx-12-time-spy-test
    https://www.overclock.net/forum/227...doom-vulkan-benchmarked-23.html#post_25351958

    There were several other threads and articles about it and FM had to respond to it.
    https://benchmarks.ul.com/news/a-closer-look-at-asynchronous-compute-in-3dmark-time-spy

    But in a nutshell TimeSpy did not implement code that was friendly to AMD uArch approach to async compute which involves parallel execution.
    ----------------------------------------------------------------------------------------------------------------------------------

    Even though this is over-simplified. I want to provide what I found out about the difference between AMD use of Async Compute and Nvidia's use of Async Compute when used in DX12 or Vulkan.

    From AMD's point of view you want a developer to favor parallel execution in Dx12/Vulkan
    From Nvidia's point of view you want a developer to favor concurrent execution which uses context switching in DX12/Vulkan.


    Here is an example of the difference between Async Compute: concurrent vs parallel execution

    [​IMG]
    The above is not my drawing it can be found here: https://www.overclock.net/forum/25351958-post222.html

    The lines in the graph above represent (if I'm not mistaken) both Graphics and Compute. In a nutshell:

    AMD GPU uArch can handle, heavy, loads of both Graphics and Compute at the same time 'in parallel'.
    https://en.wikipedia.org/wiki/Parallel_computing

    VS

    Nvidia uArch can handle, lighter, loads of both Graphic and Computer 'concurrently' through context switching.
    https://en.wikipedia.org/wiki/Concurrent_computing


    Games like Doom, AoTS, Strange Brigade, Wolfenstein II, Sniper Elite 4 (from what I've understood) do use async compute. There are some results found here: https://www.computerbase.de/2018-09...el-vs-low-level-api-epic-infiltrator-techdemo
     
    Last edited: Sep 20, 2018
  2. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    Did you mean "(something nvidia GPUs are not good at)" ? Because older generations did not even have adequate queues and it was all dummy implementation.
    (Reason for GTX 9X0 cards getting big poop instead of bonus in Time Spy.)

    Image shows difference. Async Compute behaves in only one way: Context Switching. (Check queues for it and details on what content can be queued.)
    Only real parallelism you get in graphics is (by nV's statement) their parallel rasterization and raytracing execution as it runs on different HW blocks.
     
  3. Eastcoasthandle

    Eastcoasthandle Guest

    Messages:
    3,365
    Likes Received:
    727
    GPU:
    Nitro 5700 XT
    Well I'm really seeking understanding of it all. The information I found is all I could find at the time. Any additional input would be insightful.

    At a guess. I'm assuming that we might see an uptick on async compute games with the release of Turing.
     
    Last edited: Sep 20, 2018
  4. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    @Hilbert Hagedoorn : Please, do as all Little (bIG) favor and add Async ON/OFF results for Time Spy. It looks like other webs did not do comparison of benefit between 9x0/10x0/20x0 too.
    And 20x0 does not have comparison at all. Likely because 10x0 cards already shown some benefit.
    Question is, how big it is for 20x0(Ti) in comparison to older generations.
     

  5. kondziowy

    kondziowy Master Guru

    Messages:
    247
    Likes Received:
    147
    GPU:
    7800XT
    Not parallel execution is beyond my understanding. How can you gain performance when there is nothing running concurrently. It sounds like a workaround to schedule tasks on 1 thread, even though they were supposed to be running on 2 threads. How can you even write a program that gains performance thanks to this implementation, when no horse power is added.
     
  6. Eastcoasthandle

    Eastcoasthandle Guest

    Messages:
    3,365
    Likes Received:
    727
    GPU:
    Nitro 5700 XT
    Granted, I'm learning this too. Here is what I found.

    https://hothardware.com/news/amd-touts-asynchronous-shader-technology-in-its-gcn-architecture




    https://www.extremetech.com/extreme...ading-amd-nvidia-and-dx12-what-we-know-so-far




    https://gpuopen.com/concurrent-execution-asynchronous-queues/
     
  7. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    Gains come from pushing other scheduled work into those times when GPU would otherwise do nothing. It is utilization thing. That's why you have those different queues which accept different types of work.
     
    -Tj- likes this.
  8. OnnA

    OnnA Ancient Guru

    Messages:
    17,787
    Likes Received:
    6,687
    GPU:
    TiTan RTX Ampere UV
    Technicaly:
    Fiji have 8 ACEs (1-way so it can do 8 tasks simultaneously)
    Vega have 4 nACEs (4-way so it can do 16 tasks simultaneously)
    Many don't know that last Old gen GCN was Fiji, Vega is NextGen GCN.

    Best games (so far) uses up to 4 Tasks.
    Doom, Wolf2 and Forza are the most Next Gen'ish in terms of Asynchronous compute (there are more of course).

    So we can have theoretically in one pass simultaneously:
    0. GPU makes its stuff....
    1. MSAA or other AA
    2. Shader specifics
    3. Shadows
    4. Ambient Occlusion, like HDAO or other

    -> And this is only 4 way ;)

    5. TressFX
    6. Raytrace FX
    and list can go on up to 8/16 tasks

    -> I'd Love to see one game that can do such endeavour :D
    (first 4 is Doom/Wolf2 Async BTW)
     
    Last edited: Sep 23, 2018
  9. Eastcoasthandle

    Eastcoasthandle Guest

    Messages:
    3,365
    Likes Received:
    727
    GPU:
    Nitro 5700 XT
    @OnnA

    Good info, thanks.
    Wait, are you saying that Forza 7 is async compute? I could never find anything relating to that with their Forzatech engine. I've checked a few places and cannot find where the Forzatech engine support A/C. What about Horizon 4? I'm not 100% sure but I think that also uses the Forzatech engine.
     
  10. -Tj-

    -Tj- Ancient Guru

    Messages:
    18,095
    Likes Received:
    2,601
    GPU:
    3080TI iChill Black
    This.

    NV gpu's are already fully utilized that's why it's minimal difference with it enabled.


    Turing, Volta has a separate pipeline for compute for that true async. Lol took them a while though,
     
    Fox2232 likes this.

  11. Eastcoasthandle

    Eastcoasthandle Guest

    Messages:
    3,365
    Likes Received:
    727
    GPU:
    Nitro 5700 XT
    In Parallel? From what I've seen I thought it still uses concurrent.
     
  12. OnnA

    OnnA Ancient Guru

    Messages:
    17,787
    Likes Received:
    6,687
    GPU:
    TiTan RTX Ampere UV
    Almost every MS Game (Windows 10 DX12) uses AC.
    Vulcan & DX12 games often have min. of 2 Things throwed up into AC -> don't forget about Console roots that Low Level API has :D
    thus Asynchronous compute is Present & Used in those titles.

    Note.
    Im very surprised by the FH4 Performance (even at 4k with Extreme/Ultra, it's playable)
    And you can get that to the Bank, it Packs more than 2 things in AC :rolleyes: i bet 4 like in Doom & Wolf.
     
    Last edited: Sep 21, 2018
    Eastcoasthandle likes this.
  13. OnnA

    OnnA Ancient Guru

    Messages:
    17,787
    Likes Received:
    6,687
    GPU:
    TiTan RTX Ampere UV
    We need to wait for Oxide Games to make statement of this but maby it just reroute for Int. 16/32 pipeline (alternative tasks manager)

    -> http://oxidegames.com/
     
  14. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    Have you seen Turing results? I am only guessing that it has it since last gen benefited from async too. But would love to see how big difference there is. I hope Hilbert saw my request and can shine some light into it.
    And I hope people can keep it here at civic level as all reasons for all kind of bad emotions ended with GTX 10x0 releases.
     
  15. Eastcoasthandle

    Eastcoasthandle Guest

    Messages:
    3,365
    Likes Received:
    727
    GPU:
    Nitro 5700 XT
    Awesome sauce. I thought that may have been the case but I was trying to look for details on the console game engines used. But yeah, that makes sense to me. So that's why Forza 7 and Horizon 4 work so well on AMD cards...


    I hope they address it then. Perhaps with Fox2232 request we might get some insight here.

    EDIT:

    I also found this video explaining async compute. It's dated so chime in if you like.





     
    Last edited: Sep 21, 2018
    RealNC and OnnA like this.

  16. -Tj-

    -Tj- Ancient Guru

    Messages:
    18,095
    Likes Received:
    2,601
    GPU:
    3080TI iChill Black
    I saw few async on/off tests (3dmark) and the difference is around the same as by Pascal, minimal.
     
    Fox2232 likes this.
  17. RealNC

    RealNC Ancient Guru

    Messages:
    4,893
    Likes Received:
    3,168
    GPU:
    RTX 4070 Ti Super
    That video makes a lot of sense, actually.

    Another video from the same guy that seems to explain a lot when considering the perf advantage of nvidia on DX11 games:



    AMD seemed to have enough computation power on their chips to perform as well as nvidia, but they never did. They needed DX12 to actually get the perf out of their chips.
     
    Last edited: Sep 22, 2018
    Eastcoasthandle likes this.
  18. Eastcoasthandle

    Eastcoasthandle Guest

    Messages:
    3,365
    Likes Received:
    727
    GPU:
    Nitro 5700 XT
    @RealNC

    Very good video. Thanks. This is what I've gathered from it:

    So the reason why nV doesn't have issues like AMD when it comes to DX11 is that their drivers have an active server process AKA CMDList that monitors draw calls. If draw calls are not intended for the command list the active server process intercepts it, slices up the workloads sending it to worker threads which are eventually sent to the command list. This is particularly helpful in games that use only 1-2 CPU cores. Very interesting indeed.

    This tells me that there is a added cost/overheads for this for DX11. But do to the nature of the active server process AKA CMDList in the Geforce driver this is mitigated. This sparked the "Dx11 MT CMDList Support" claims in the Civ game back then.

    And it further explains why nV cards get similar test results in single and multi-thread scores in 3DMark Suite: API Overhead Test. AMD cards do not and overall do not do well. Which help foster the believe that AMD Drivers have a huge API/Driver overhead. Which now clearly shows to be false. It's Geforce, through the use of Active Server Process that needs game engines to only use 1-2 HW's/cores so it can use the renaming makes it more of a overhead in DX11. It's just utilized very well.

    Geforce scheduler is software based. However being software really doesn't matter it could be done in hardware. What does matter is it's ability to reschedule workloads and determines how work is distributed to each SMX group.

    AMD, although it's scheduler is hardware based, cannot reschedule workloads in that fashion. AMD HWS (hardware scheduler) passes the workload to the GPU. That's the difference in what makes DX11 work so much better on Geforce.

    AMD's Uarch simply cannot properly handle command lists the way DX11 wants. The GCN arch is not capable of multi threading well under DX11 restrictions. See below (from the video):


    [​IMG]
    What AMD need game engines that are multi-threaded/multi core. With less focus of "everything on primary thread" and order of what goes to what core. This is why you see AMD scores increase in the API Overhead test when MT is used vs Single thread.

    nV wants everything on the primary thread so their active server process can do the work (as exampled in the above pic). IE: Geforce driver that 'hacks' the way DX11 is suppose to work on DX11 game engines. Albiet it's starting to look like DX11 is simply inefficient without those Driver 'hacks'.

    Because 1-2 cpu core game doesn't take up much CPU cycles leave the rest of the CPU open for CMDList to utilized the remaining available HW threads. Nothing wrong with that until CPU hits 100% utilization.

    nV found a way to utilize the remainder of the CPU with Active Server Process/CMDList. It utilizes command lists in such a way to increase overall performance of the game at the cost of higher CPU core(s) usage. Which BTW was/is mistaken for "the game using it" when using OSD monitoring apps like AB. Take look of the pic below as an example (from the video)


    [​IMG]
    Very interesting indeed. So Geforce does indeed need more cores while the game only needs 1-2 cores. I have to wonder why they waited so long to optimize their Geforce drivers for AMD multi core CPUs until the last 2or so weeks ago. But I digress.





    [​IMG]
    The above shows us how Geforce Multi-threads games that have very low multi core CPU utilization (from video)



    [​IMG]
    This is what AMD GPUs need. A game engine that fully utilize more then just 1-2 cores/HW threads. On the other hand, Mantle AKA Vulkan and DX12 mandates that the game engine full utilizes a multi core processor. And thus, would benefit AMDs GCN Uarch through it's ACE. However, do to nV mind share and dominate market position adaption of DX12/Vulkan game engines have been slow. Be it for this or other reasons. IE: technical know how, cost, etc. Which isn't an excuse just explanation (from video).



    [​IMG]
    This explains (to me at least) what happens when nV software scheduler encounters DX12/Vulkan game engine using the API (from video). It creates more of an overhead as it wasn't design to handle draw calls in parallel. Which can cause regression in performance uptick. Which might explain those instances where nV gpus don't see full 100% GPU utilization. However, I'm sure there is a work around.



    This appears to show lower GPU utilization with DX12 engine using Async Compute in parallel.





    Here is an Updated to the Geforce drivers using the same game. However, there is Stuttering/Hithcing observed.




    This isn't to indicate that nV cards cannot function well in DX12/Vulkan. But it does explain the history of DX11 to DX12 and it's slow adaption rate IMO.
    [​IMG]

    Fascinating video to say the least. Take note that well threaded games are more often console ports then games designed for PC specifically. It doesn't mean they don't exist. But it says a lot of AMD's influence on those console ports.

    On the flip side of the coin if you load more game logic to the primary thread you hurt AMD performance. Like Physx, tessellation, etc. If you add more game logic to the primary thread on an engine only using 1-2 threads you can stall draw calls, etc on AMD cards. This is because AMD's GPUs want game engines to be more Multi-threaded. Take a look at the pic below (from video)


    Edit for examples of how AMD GPU's can be crippled in DX11 but how it "magically" is fixed under DX12:


    From the developer of pCars:
    ...
    https://hardforum.com/threads/proje...ormance-issues.1861353/page-2#post-1041593386

    In my own opinion CPU is "code word" for Primary Hardware Thread.This is a prime example of what is conveyed in the video and how AMD GPUs were hampered.



    https://www.reddit.com/r/pcgaming/comments/366iqs/nvidia_gameworks_project_cars_and_why_we_should/

    Videos of performance uptick are found in this post:
    https://forums.guru3d.com/threads/a...ownload-discussion.422961/page-3#post-5585792
     
    Last edited: Sep 23, 2018
    Jackalito and iakoboss7 like this.
  19. JonasBeckman

    JonasBeckman Ancient Guru

    Messages:
    17,564
    Likes Received:
    2,961
    GPU:
    XFX 7900XTX M'310
    Newer games also use additional threads for deferred contexts under DX11 which I don't know how badly affected AMD is from this but it doesn't sound like they're benefiting much if at all from their use although it doesn't outright harm performance either.

    https://github.com/GPUOpen-LibrariesAndSDKs/AGS_SDK/issues/20
    https://developer.nvidia.com/sites/...dev/docs/GDC_2013_DUDASH_DeferredContexts.pdf

    So far from what I've been reading up on many of the newer DX11 games rely on some of these, biggest would be Assassin's Creed Origins using up to 8 of them and thus also scaling better on a 8 core CPU although it uses many other threads too and balances the CPU load pretty effectively overall. (Though it doesn't scale down as effectively so quad cores get a bit limited here.)
    Although while AMD keeps up alright in the game most of the time cities tend to throw GPU usage into a bit of a chaotic mess as the CPU has to pull a increased workload in these areas.

    I know the current version of Anvil Engine was built to be forward compatible and make use of newer tech including DX12 and probably also Vulkan although I'm expecting Odyssey as the upcoming game using the engine will probably keep to DX11 again but we'll see. From how I'm seeing it having a good D3D12 or VLK API implementation would really help and as seen from Shadow of the Tomb Raider this also benefits NVIDIA GPU's giving a nice boost in performance.

    Could have improved since that Github discussion though but I wouldn't expect too much, feels like the D3D11 driver overall could do with a overhaul but these are complicated and time consuming long-term projects.
    But DX11 isn't going away so it would be beneficial to have a optimized code for the API for the driver though the numerous sometimes required driver hacks for compatibility with some games probably interfere a bit.
    (Crossfire/SLI for example since DX11 doesn't natively support multi-GPU setups unlike DX12 and VLK as of version 1.1 and higher and that's just one part although a pretty large one.)

    EDIT: Although I still have much more to learn on this subject so it's pretty basic but it's interesting to read up more tech info on how each GPU vendor is doing certain things and how the hardware and drivers actually work even if a lot is also covered by various disclosure agreements and is closed source.
    (Or very high-level and incredibly technical, also a important factor.)
     
  20. Eastcoasthandle

    Eastcoasthandle Guest

    Messages:
    3,365
    Likes Received:
    727
    GPU:
    Nitro 5700 XT
    I remember something along the lines of MS motivations of adding DX12 to win10 was do to mantle.
    Be it true or false it was clear way back that AMD GPU dept needed a way to fully utilize their ACE. And DX11 wasn't it.
    So it would seem that AMD has been fighting this battle for a long time.

    @JonasBeckman
    I'm sure there are some bottlenecks in Assassin's Creed Origins were Geforce's CMDList can shine. But some are pointing at VMprotect as the cause. But it's not clear.


    4Core/4thread cpus don't fair will in this game with nV GPU




    Look at that 100% usage on the Primary Core. 8700k w/GTX 1080



    But if you look at a similar rig with a 8700K and a Vega 64 then CPU usage is more normal



    Here is a Vega 56 with 1800x



    As you can start to see it seems the high cpu usage is coming from nvidia users. Although AMD isn't faster, it cpu utilization is much, much lower. Those CPU cores being maxed out are IMO are the result of an overhead of the CMDList from those with Geforce drivers. By all means, I'm not implying that AMD doesn't have issues but this particularity of 100% (or close to it) cpu usage is something I've seen reported from a lot of games. And there is a well known thread on Geforce discussing it.
    All games stuttering with FPS drops since Windows 10 Creators Update
    4,772 Replies
    1,113,647 Views

    Which to this day, based on my understanding hasn't been resolved. I'm not saying that CMDList is the direct nor only cause. That's a topic for another thread. However, I am seeing how this approach of addressing DX11 shortcomings can reach Moore's Law. And for that, it is with my opinion, this might spark more interest in Turing's use of DX12./Vulkan with async compute support.

    I am also not saying that AMD doesn't have these issues. More research is definitely required.

    However, AMD's approach doesn't make the game run "faster" on their GPU and I do use quotations as the examples show it faster but not as smooth. Not to say AMD doesn't have their own issues to contend with. Just pointing out the obvious with the video's provided.
     
    Last edited: Sep 23, 2018

Share This Page