Log in or Sign up

Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

Page 3 of 57

dr_rus Ancient Guru

Messages:

3,930

Likes Received:

1,044

GPU:

RTX 4090

PrMinisterGR said: ↑

There is a point about people who are buying cards this year, cards that they may want to hold for 2-3 years. As things are starting to look, all of AMD's offering in all price ranges except the top one, look better. Would anyone really consider a 980 over a 390x? Or over a Fury (as both have 4GB VRAM?). Or a 960 over a 380x? Even below things are harder. The 270 demolishes the 750Ti, and the gap just increases with DX12. The only space that an NVIDIA card might make more sense is the ultra top, and the only reason for that is the 6vs4GB of VRAM.
Click to expand...

I've heard this point of view many times but the fact is - it's just plain wrong.

Here are the results of the latest VR benchmark for example:

Fact is - NV's cards are very competitive to AMD's in price/perf even if you disregard the more advanced software ecosystem and better overall compatibility with APIs (+FL12_1, +PhysX, +CUDA) and severely higher power requirements of AMD's cards.

AMD has to sell a much more advanced cards with twice the VRAM to even compete with what NV is offering - and how is that good for AMD exactly? Even in this situation where a choice between 970 and 390 seems to be an easy one - AMD still looses the market. They need something which will plain be better, always and everywhere, like 970Pro again to turn the tides - and I don't see anything from them that actually can.

Last edited: Feb 26, 2016

dr_rus, Feb 26, 2016

#41
Alessio1989 Ancient Guru

Messages:

2,952

Likes Received:

1,244

GPU:

.

PrMinisterGR said: ↑

When you say "same hardware resources" and since we speak about compute, I assume you refer to shading units, right?
Click to expand...

I do, but that is just my guess (I do not have any documentation about AMD implementation): graphics works involve different part of the hardware, like geometry and rasterizer, as well texture filtering units while compute jobs do not (AFIK). Setting up the pipeline for graphics works may take some time where the some part of the GPUs essentially does nothing. This time is perfect for working in concurrency with computes jobs (ie: compute shaders).

Last edited: Feb 26, 2016

Alessio1989, Feb 26, 2016

#42
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

-Tj- said: ↑

By maxwell they removed DP (double precision) to gave more room to SP. Not that thy cripled SP or compute in general further.

Also that pic you keep posting is only there are loooong commands, otherwise there are no overhead switching issues.
Click to expand...

Nope, it clearly states that commands cannot be given without big penalties whithin draw calls. Meaning that the card has to give the draw command before it takes compute tasks. That's it.

dr_rus said: ↑

I've heard this point of view many times but the fact is - it's just plain wrong.

Here are the results of the latest VR benchmark for example:

Click to expand...

What you post shows performance per dollar, and on the top five 3 of the cards are AMD. Let's not even say that this is a synthetic benchmark and it doesn't take into account things like VRAM sizes that DO matter in purchasing decisions. I'm sure that if any rational person had the choice between the 390 and the 970, it would get the 390 no questions asked.

-Tj- said: ↑

Fact is - NV's cards are very competitive to AMD's in price/perf even if you disregard the more advanced software ecosystem and better overall compatibility with APIs (+FL12_1, +PhysX, +CUDA) and severely higher power requirements of AMD's cards.
Click to expand...

The FL12_1 doesn't matter, especially for Maxwell cards that aren't getting anything extra from DX12 performance. The rest are propertiary things that involve investment in the NVIDIA ecosystem. CUDA is nice, but OpenCL is catching up, and it's used everywhere by everyone. Same for PhysX, as engines like Havoc or even PhysX itself has to run properly in a variety of hardware, otherwise nobody will use them.

-Tj- said: ↑

AMD has to sell a much more advanced cards with twice the VRAM to even compete with what NV is offering - and how is that good for AMD exactly? Even in this situation where a choice between 970 and 390 seems to be an easy one - AMD still looses the market. They need something which will plain be better, always and everywhere, like 970Pro again to turn the tides - and I don't see anything from them that actually can.
Click to expand...

I'm not sure that AMD will be losing the market any time soon. They have shown signs of recovery since the introduction of the 300 series actually, and I expect their Q4 2015 results to be even better. This response also misses the point.

The point being that NVIDIA chose to go with less complex scheduling hardware to get more performance per watt, and now they see no performance benefit from a lower level API, because their driver is basically fulfilling this role for most games. Why is that so hard to grasp? Look at the things I paraphrased from Anandtech. NVIDIA has said so themselves in the Maxwell architecture presentation to websites. They have explicitly said that the Maxwell scheduling hardware is inferior, and it was a conscious design choice. Why everybody pretends not to read it?

I'll post it again here:

Anandtech said:

Starting with the Maxwell 1 SMM, NVIDIA has adjusted their streaming multiprocessor layout to achieve better efficiency. Whereas the Kepler SMX was for all practical purposes a large, flat design with 4 warp schedulers and 15 different execution blocks, the SMM has been heavily partitioned. Physically each SMM is still one contiguous unit, not really all that different from an SMX. But logically the execution blocks which each warp scheduler can access have been greatly curtailed.
The end result is that in an SMX the 4 warp schedulers would share most of their execution resources and work out which warp was on which execution resource for any given cycle. But on an SMM, the warp schedulers are removed from each other and given complete dominion over a far smaller collection of execution resources. No longer do warp schedulers have to share FP32 CUDA cores, special function units, or load/store units, as each of those is replicated across each partition. Only texture units and FP64 CUDA cores are shared.
Among the changes NVIDIA made to reduce power consumption, this is among the greatest. Shared resources, though extremely useful when you have the workloads to fill them, do have drawbacks. They’re wasting space and power if not fed, the crossbar to connect all of them is not particularly cheap on a power or area basis, and there is additional scheduling overhead from having to coordinate the actions of those warp schedulers. By forgoing the shared resources NVIDIA loses out on some of the performance benefits from the design, but what they gain in power and space efficiency more than makes up for it.
Click to expand...

How hard is it to get?

PrMinisterGR, Feb 26, 2016

#43
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

Alessio1989 said: ↑

I do, but that is just my guess (I do not have any : graphics works involve different part of the hardware, like geometry and rasterizer, as well texture filtering units while compute jobs do not (AFIK). Setting up the pipeline for graphics works may take some time where the some part of the GPUs essentially does nothing. This time is perfect for working in concurrency with computes jobs (ie: compute shaders).
Click to expand...

Thanks for the explanation Alessio.

PrMinisterGR, Feb 26, 2016

#44
-Tj- Ancient Guru

Messages:

18,103

Likes Received:

2,606

GPU:

3080TI iChill Black

Most of that stuff you quted as me wasnt my txt, fix it if you wanna qoute mutiple sentences and debate with the one posting it. Just fyi

And I'll post this again since you seem to ignore it

Despite the limitations, the use of compute shaders should still be considered. The reduced overhead and effectively higher level of concurrency compared to classic draw calls with proxy geometry can still yield remarkable performance gains.

Additional care is required to cleanly separate the render pipeline into batches.
If async compute with support for high priority jobs and independent scheduling is a hard requirement, consider the use of CUDA for these jobs instead of the DX12 API.

With GK110 and later, CUDA bypasses the graphics command processor and is handled by a dedicated function unit in hardware which runs uncoupled from the regular compute or graphics engine. It even supports multiple asynchronous queues in hardware as you would expect.

Ask your personal Nvidia engineer for how to share GPU side buffers between DX12 and CUDA.
Click to expand...

For a safe bet, go with the batched approach recommended for Nvidia hardware:

Choose sufficiently large batches of short running shaders.
Long running shaders can complicate scheduling on Nvidias hardware. Ensure that the GPU can remain fully utilized until the end of each batch. Tune this for Nvidias hardware, AMD will adapt just fine.
Use multiple compute engines when applicable.
If the result of a compute job or an entire chain isn't needed until much later, offload it. This will start execution early on with AMD, while Nvidia gets a chance to batch multiple command lists.
Be careful with GCN 1.0 cards, as they can only allocate two compute engines.
Signal early, signal often.
Nvidia will only update signals and fences at the end of each command list, but AMD will do so much sooner. By using additional signals, the compute engine can be given a head start on GCN. This is also true for synchronizing multiple compute engines.
Don't worry about the overhead of additional fences for synchronisation purposes. The cost for waiting on more than one signal is just the same as waiting for a single one.
While signaling eagerly, be conservative about waiting. Each additional synchronisation point comes at a cost11.
Commit early and be responsive.
Consider that AMD needs a higher level of concurrency, and many of your jobs are in fact independent. Don't wait for milestones in your render loop. Commit your work early and place fences instead.
Signals may arrive in a different order depending on the hardware. Make sure that your CPU side code is aware of that. An event driven approach will work better than a classic procedural one.
Keep 3D related jobs central.
Make especially sure that no compute jobs are required in between.
Your application may still be heavy on draw calls or utilize complex fragment shaders, as long as these are happening in bulk and dependent compute jobs have already been scheduled.
Click to expand...

source: http://ext3h.makegames.de/DX12_Compute.html

-Tj-, Feb 26, 2016

#45
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

I'll keep just this:

NVIDIA said:

Additional care is required to cleanly separate the render pipeline into batches.
If async compute with support for high priority jobs and independent scheduling is a hard requirement, consider the use of CUDA for these jobs instead of the DX12 API.
Click to expand...

You realize that they basically admit that their hardware can't really handle DX12, right? I also love (like you do), that everybody pretends to be blind about the architecture posts on Maxwell that I'm quoting. How they said themselves that the scheduling hardware is inferior.

PrMinisterGR, Feb 26, 2016

#46
CalinTM Ancient Guru

Messages:

1,689

Likes Received:

18

GPU:

MSi GTX980 GAMING 1531mhz

Cant handle dx12 ? Too bad, buy pascal. R9 200/300/fury series will be low end until some proper dx12 games come.

Whats all the fuss about ? Nvidia has their 80% market share, because of this. Making ppl. buying new stuff. They not included those dx12 things in hardware by purpose. A company has their objective in making money...its normal.

CalinTM, Feb 26, 2016

#47
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

CalinTM said: ↑

Cant handle dx12 ? Too bad, buy pascal. R9 200/300/fury series will be low end until some proper dx12 games come.

Whats all the fuss about ? Nvidia has their 80% market share, because of this. Making ppl. buying new stuff. They not included those dx12 things in hardware by purpose. A company has their objective in making money...its normal.
Click to expand...

No, they made an architectural compromise. It gave them one and a half year of almost complete market domination. It was a good choice. Not expecting to get more from DX12, that doesn't mean that the cards are bad.

PrMinisterGR, Feb 26, 2016

#48
-Tj- Ancient Guru

Messages:

18,103

Likes Received:

2,606

GPU:

3080TI iChill Black

I will make judgements if it doesnt support dx12 or not when Im going to play some proper dx12 games and I'll let you know then if it runs ok.

Cuda is cuda if they say use async in cuda lvl instead of dx12 api lvl doesnt mean its not dx12 capable. Running in cuda is faster anyway far more direct to metal then directX api ever will be.

Now interpret this how you wish

-Tj-, Feb 26, 2016

#49
Singleton99 Maha Guru

Messages:

1,071

Likes Received:

125

GPU:

Gigabyte 3080 12gb

So with all this said , what's the future looking like with my 980 ti's , once dx12 is being used more and more will this generation of maxwell become obsolete very quickly ,should we sell or cards now while there worth something and hold onto the money ? ,i do hope this isn't the case as when i got these cards i wanted at least 3 rys out of them .

Are we as nvidia customers doomed by the introduction of dx12 and Asynchronous compute.

As you can tell all this goes over my head a lot ,, but i'm starting to learn or trying to c1:

Last edited: Feb 26, 2016

Singleton99, Feb 26, 2016

#50
Alessio1989 Ancient Guru

Messages:

2,952

Likes Received:

1,244

GPU:

.

I still do not see why they will not able to bring a better dispatcher on the d3d12 driver too like the one of the CUDA driver... Or I simple ignore some implementation details...

Alessio1989, Feb 26, 2016

#51
Carfax Ancient Guru

Messages:

3,972

Likes Received:

1,462

GPU:

Zotac 4090 Extreme

PrMinisterGR said: ↑

CUDA is a compute-only thing really. It is quite obvious that that unit can't handle switching between graphics and compute queues, but compute only. The ACEs can do both.
Click to expand...

That's pre-Maxwell v2 only. With Maxwell v2, the GPU can do 1 graphics plus 31 compute tasks in mixed mode, or 32 compute tasks in compute mode.

This information comes straight from Anandtech as well, since you like to quote them a lot

On a side note, part of the reason for AMD's presentation is to explain their architectural advantages over NVIDIA, so we checked with NVIDIA on queues. Fermi/Kepler/Maxwell 1 can only use a single graphics queue or their complement of compute queues, but not both at once – early implementations of HyperQ cannot be used in conjunction with graphics. Meanwhile Maxwell 2 has 32 queues, composed of 1 graphics queue and 31 compute queues (or 32 compute queues total in pure compute mode). So pre-Maxwell 2 GPUs have to either execute in serial or pre-empt to move tasks ahead of each other, which would indeed give AMD an advantage..
Click to expand...

Source

You can see that all NVIDIA hardware is very fast on compute-only or graphics-only tasks, but when switches between them are involved, the latencies go up 60% at least. If you read the thread a bit and see people reporting their results, you will see that there is zero performance penalty for GCN for context switching.
Click to expand...

I'm actually subscribed to that thread, so I'm familiar with it. The guy who wrote that benchmark/test app was a novice programmer. Also, asynchronous compute wasn't enabled in the GPU drivers for NVidia..

We'll have to wait until AC is fully enabled before we can come to any final conclusions.

No, it cannot. It can only switch between compute tasks. That's why it's not enabled for DX12.
Click to expand...

That's pre-Maxwell v2 only. That limitation is lifted for Maxwell v2.

Why do you think Maxwell v2 is so much faster than Kepler with PhysX workloads? Because it can execute both graphics and compute in parallel.

That reason is that it's compute switching only, and for lightweight PhysX tasks. The top Tesla card for NVIDIA is a Kepler one, least you forget. And the main differentiating factor between Maxwell and Kepler is the go-away with even more hardware scheduling, giving space for more efficient graphics units in Maxwell. Let me quote Anandtech's Maxwell architecture review:
Click to expand...

This has already been addressed by TJ and dr_rus. The Tesla being a Kepler based variant has nothing to do with the supposed lack of a hardware scheduler..

You don't seem to hold any faith to the people who designed this awesome hardware. All indicators show that Maxwell 2.0 literally works 100% under DX11, which is a miracle on its own. If it didn't, it would get performance increases with DX12. I believe that NVIDIA has a PR problem and nothing else.
Click to expand...

Here's the thing though. One other DX12 benchmark (Fable Legends) shows no discrepancy with NVidia.. In fact, the GTX 980 Ti is faster than the Fury X in that benchmark, and this is using reference cards.

With aftermarket cards, the gap would be even larger. Also Fable Legends uses asynchronous compute for the dynamic global illumination..

Carfax, Feb 26, 2016

#52
fellix Master Guru

Messages:

252

Likes Received:

87

GPU:

MSI RTX 4080

The warp scheduling in Maxwell is indeed simplified because it doesn't need to be complex anymore, since the purpose of the new SMM layout is to boost the perf/W ratio with more balanced architecture. Naturally, a big part of the downsizing was the significant reduction of the number of FP64 units, compared to Kepler. As compensation, many improvements to the memory pipeline were made in Maxwell, like doubling of the LDS size and general streamlining of the data caching routines.

The only advantage of Kepler (GK110 and GK210) is the high FP64 throughput and the official HPC validation for the Tesla line of SKUs.

fellix, Feb 26, 2016

#53
Carfax Ancient Guru

Messages:

3,972

Likes Received:

1,462

GPU:

Zotac 4090 Extreme

Alessio1989 said: ↑

I still do not see why they will not able to bring a better dispatcher on the d3d12 driver too like the one of the CUDA driver... Or I simple ignore some implementation details...
Click to expand...

It's probably being developed as we speak. These things take time.

I remember it took NVidia about two years to come up with a working DX11 multithreading driver, something which AMD still lacks.

Knowing NVidia, they are waiting until it's as great as they can make it before they release it. Since no final release DX12 game is available yet, they are not really under any pressure to release it before it's ready.

Carfax, Feb 26, 2016

#54
-Tj- Ancient Guru

Messages:

18,103

Likes Received:

2,606

GPU:

3080TI iChill Black

I checked with AIDA64 and it has 2 async engines, dunno though as a whole or per block..

CUDA

OpenCL

I see this driver has OpenCL2.0 @ 62% maybe full profile also adds some async parts in it and they also "wait" (optimize) for that to make it full?

-Tj-, Feb 26, 2016

#55
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

Singleton99 said: ↑

So with all this said , what's the future looking like with my 980 ti's , once dx12 is being used more and more will this generation of maxwell become obsolete very quickly ,should we sell or cards now while there worth something and hold onto the money ? ,i do hope this isn't the case as when i got these cards i wanted at least 3 rys out of them .

Are we as nvidia customers doomed by the introduction of dx12 and Asynchronous compute.

As you can tell all this goes over my head a lot ,, but i'm starting to learn or trying to c1:
Click to expand...

If you want my bet, your cards will perform under DX12, roughly the same as under DX11. The only "problem" is in the comparison with AMD cards, because NVIDIA users expect similar performance uplifting from them, which is completely unreasonable in my opinion.

PrMinisterGR, Feb 26, 2016

#56
Barry J Ancient Guru

Messages:

2,803

Likes Received:

152

GPU:

RTX2080 TRIO Super

PrMinisterGR said: ↑

If you want my bet, your cards will perform under DX12, roughly the same as under DX11. The only "problem" is in the comparison with AMD cards, because NVIDIA users expect similar performance uplifting from them, which is completely unreasonable in my opinion.
Click to expand...

I agree NVidia cards are very well optimised in DX11 so if hardware is already being used to almost its maximum DX12 will only give slight/no improvement. AMD has huge room for improvement due to poor DX11 usage

Barry J, Feb 27, 2016

#57
fellix Master Guru

Messages:

252

Likes Received:

87

GPU:

MSI RTX 4080

-Tj- said: ↑

I checked with AIDA64 and it has 2 async engines, dunno though as a whole or per block..
Click to expand...

That feature could be referring to the ability of the GPU to utilize the PCIe bus full-duplex (bi-directional) data transfer. For that the GPU has to have two independent interface controllers (engines) to facilitate async requests.

fellix, Feb 27, 2016

#58
Yxskaft Maha Guru

Messages:

1,495

Likes Received:

124

GPU:

GTX Titan Sli

Alessio1989 said: ↑

I still do not see why they will not able to bring a better dispatcher on the d3d12 driver too like the one of the CUDA driver... Or I simple ignore some implementation details...
Click to expand...

Shouldn't Nvidia be able to do that through NVAPI?

Yxskaft, Feb 27, 2016

#59
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

Yxskaft said: ↑

Shouldn't Nvidia be able to do that through NVAPI?
Click to expand...

So ignoring DX12, like they are actually saying for heavy workloads in their documentation? Doesn't that completely negate the usage of a common API?

PrMinisterGR, Feb 27, 2016

#60

(You must log in or sign up to reply here.)

Page 3 of 57

Share This Page