Log in or Sign up

Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

Page 5 of 57

fantaskarsef Ancient Guru

Messages:

15,766

Likes Received:

9,667

GPU:

4090@H2O

CrazyGenio said: ↑

because they want only to workship a single hardware company, that's the level of fanboys we have on pc master race.
Click to expand...

I would not have any problem with putting cards from two vendors in my rig. Would give me bragging rights in both sub forums.

EdKiefer said: ↑

Well IMO there few issues with mutli vender gpu usage .
1) The two cards need to be same speed so that will limit what people have around .If you have to by 2 cards, might as well get ones that support SLI/CF .

2) You now need both vender drivers installed on 1 system, vid drivers don't really work great with multiple drivers, other than going back to 3Dfx (Voodoo2/2d primary).
Click to expand...

Not sure about 1) (if they ever make it work, this limitation won't be the thing). But with 2), it's not only about installing the drivers, but also setting them up, tweaking them, and try outs with new driver versions (basically Nvidia I'm aware of).

Takes double the time to even run such a setup properly than with any normal CFX / SLI config, let alone if you have troubles with games, and you can't even tell for sure which of the drivers is causing the problem :eyes:

fantaskarsef, Feb 29, 2016

#81
stereoman Master Guru

Messages:

887

Likes Received:

182

GPU:

Palit RTX 3080 GPRO

Mixing cards from different vendors is a nice feature but I'm more interested in the memory pooling more than anything, I've always thought what a huge waste it is to only be able to use memory from one card in an sli setup, can't wait till they implement this feature.

stereoman, Feb 29, 2016

#82
fantaskarsef Ancient Guru

Messages:

15,766

Likes Received:

9,667

GPU:

4090@H2O

stereoman said: ↑

Mixing cards from different vendors is a nice feature but I'm more interested in the memory pooling more than anything, I've always thought what a huge waste it is to only be able to use memory from one card in an sli setup, can't wait till they implement this feature.
Click to expand...

Yep, one of my biggest hopes for dx12 / vulkan. Not too optimistic about it though...

fantaskarsef, Feb 29, 2016

#83
PrMinisterGR Ancient Guru

Messages:

8,132

Likes Received:

974

GPU:

Inno3D RTX 3090

stereoman said: ↑

Mixing cards from different vendors is a nice feature but I'm more interested in the memory pooling more than anything, I've always thought what a huge waste it is to only be able to use memory from one card in an sli setup, can't wait till they implement this feature.
Click to expand...

They kinda both depend on the same feature, which is SFR. You will usually get to get the one with the other. The mixed GPUs really make sense at the point where the integrated Intel GPU can shave off some post processing and give an extra 5-15% on performance, which would be great since it's hardware that most people already have but don't use.

PrMinisterGR, Mar 1, 2016

#84
Alessio1989 Ancient Guru

Messages:

2,959

Likes Received:

1,246

GPU:

.

stereoman said: ↑

Mixing cards from different vendors is a nice feature but I'm more interested in the memory pooling more than anything, I've always thought what a huge waste it is to only be able to use memory from one card in an sli setup, can't wait till they implement this feature.
Click to expand...

Memory pool sharing is possible with linked adapters setups (ie: SLI or Crossfire) and the level of sharing is determined by the related tier (https://msdn.microsoft.com/en-us/library/dn914408.aspx).

Alessio1989, Mar 1, 2016

#85
dr_rus Ancient Guru

Messages:

3,938

Likes Received:

1,047

GPU:

RTX 4090

PrMinisterGR said: ↑

It's me and the sum of the technical press apparently. Can you please link me to a single reputable source showing that they know the reason that Maxwell doesn't have effective async compute? One? Please?
Click to expand...

This source is talking to you right now -) You may believe what you want obviously.

PrMinisterGR said: ↑

All these words carefully avoid the once sentence from Anandtech (which was obviously given by NVIDIA, since last time I checked Anandtech don't break apart chips and study them with microscopes). The one that says that the scheduler is not as flexible any more.
Click to expand...

You keep saying something about some schedulers (and I can just repeat myself here and say again that this Anand's quote don't have anything to do with global thread / queues scheduling at all) while in reality it's not an issue of scheduler but more an issue of the architecture design in general. Even if Maxwell would have a global scheduler structure similar to what GCN cards have this would not lead to nearly the same performance gains from running compute asynchronously on Maxwell. Because there are actually reasons why concurrent compute is showing gains on GCN and these reasons are absent on Maxwell architecture almost completely.

PrMinisterGR said: ↑

After that, you have NVIDIA saying about scheduling and preemption, that you shouldn't be switching between workloads in between draw calls, since it slows things down. Which is the hardware that does that? The scheduler. It seriously doesn't take much to see that. I'm not even "blaming" them for that.
Click to expand...

That would be Kepler and Fermi. Kepler and Fermi can't run compute queues concurrently with graphics. And I'd bet that this quote is about them.

PrMinisterGR said: ↑

You're basically telling me that the high level scheduling is being done in the CPU for Maxwell (since by NVIDIA's admission it's not done in the GMU), and then you wonder why Maxwell cards see no benefit from an API that is using CPU resources better. If the NVIDIA driver schedules effectively already, you are getting 100% of your card JUST with DX11, which (I'll say again), is a MIRACLE.
Click to expand...

So here's the thing - you can't do "in the CPU" something which is happening in the GPU. Driver while it is running on the CPU controls the GPU. If the GPU doesn't allow for something to happen in principle - it can't happen no matter how fast of a CPU you have.

High level thread scheduling is happening in both the driver and GPU.

Maxwell (and even Kepler) cards are showing a lot of benefit from an API which is using CPU resources better - the benefits are actually pretty close to what GCN h/w is showing in CPU limited situations:

And yeah, you are getting the maximum utilization from Maxwell under DX11 already because of how Maxwell h/w operates and how NV's DX11 driver is able to use CPU resources much more effectively than AMD's. Hence the general lack of performance boosts from concurrent compute in DX12 on Maxwell, hence the apparent tie between DX11 and DX12 on NV's h/w.

dr_rus, Mar 1, 2016

#86
-Tj- Ancient Guru

Messages:

18,107

Likes Received:

2,611

GPU:

3080TI iChill Black

^
Exactly, he keeps mentioning some useless stuff, like I said to him, he should read nvidia whitepapers GF110, GK110 and GM204 to see what's up @ nvidia side, most of the stuff what Anand mentioned is form whitepapers though, copy pasta and something in his own words..

Although there is no GM200 WP yet...

http://www.hardwareluxx.de/index.ph...nchronous-shaders-im-benchmark-vergleich.html

Last edited: Mar 1, 2016

-Tj-, Mar 1, 2016

#87
PrMinisterGR Ancient Guru

Messages:

8,132

Likes Received:

974

GPU:

Inno3D RTX 3090

dr_rus said: ↑

This source is talking to you right now -) You may believe what you want obviously.
Click to expand...

And I could claim I'm another source, but that doesn't work that way and for good reason.

dr_rus said: ↑

You keep saying something about some schedulers (and I can just repeat myself here and say again that this Anand's quote don't have anything to do with global thread / queues scheduling at all) while in reality it's not an issue of scheduler but more an issue of the architecture design in general. Even if Maxwell would have a global scheduler structure similar to what GCN cards have this would not lead to nearly the same performance gains from running compute asynchronously on Maxwell. Because there are actually reasons why concurrent compute is showing gains on GCN and these reasons are absent on Maxwell architecture almost completely.
Click to expand...

So, in the end, no matter the exact reason you are basically agreeing with me that Maxwell doesn't seem to get as much of an uplift from DX12, as GCN does. My main point is that Maxwell is already maxed out under DX11, and it won't see any great performance increases under DX12. I don't see us disagreeing with that.

dr_rus said: ↑

That would be Kepler and Fermi. Kepler and Fermi can't run compute queues concurrently with graphics. And I'd bet that this quote is about them.
Click to expand...

Look at the Do's and Don'ts Guide for DX12, from NVIDIA themselves, they are quite interesting. On the first list there is the "admission" that the NVIDIA DX11 driver is handling multithreaded submission for the developers already.

NVIDIA DX12 Guide said:

Don’ts
•Don’t rely on the driver to parallelize any Direct3D12 works in driver threads
-On DX11 the driver does farm off asynchronous tasks to driver worker threads where possible – this doesn’t happen anymore under DX12
-While the total cost of work submission in DX12 has been reduced, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to submit work in parallel, the more benefit in terms of draw call submission performance can be expected.
Click to expand...

So basically they say that under DX11, their driver pretty much manages what a lower level API would do.

The ending quote is very interesting, and it's not for Kepler only, since this is their DX12 guide and they don't seem to be making any differentiation. In fact in the end they state it's for both Maxwell and Kepler.

NVIDIA DX12 Guide said:

Don’ts
•Don’t toggle between compute and graphics on the same command queue more than absolutely necessary
-This is still a heavyweight switch to make
Click to expand...

That kinda answers it about Async Compute right there. The benchmarks seem to indicate the same also.

dr_rus said: ↑

Maxwell (and even Kepler) cards are showing a lot of benefit from an API which is using CPU resources better - the benefits are actually pretty close to what GCN h/w is showing in CPU limited situations:

Click to expand...

This is Kepler, and if my comment about it's scheduler being more flexible than Maxwell's is correct, then Kepler might see more increases as a percentage compared to DX11, than Maxwell will.
Furthermore, the image supplied was is from Dolphin. Dolphin being an emulator has various restrictions, the main being that it can't really use more than three threads for any kind of logic. A very good summary of the situation in Dolphin is being given in this post, which is part of the Multithreading topic on Dolphin's forums.
To get a quote from Dolphin's FAQ

Dolphin FAQ said:

In Dolphin, the only demanding tasks that can run well in parallel are the CPU, the GPU and the DSP. Breaking up any of these tasks into smaller tasks just to run it on more cores is very likely to make the program slower. That's why Dolphin only runs on 3 cores and won't use all of your 4 or 6 cores CPU.
Click to expand...

If I understand correctly, the DX12 renderer uses less CPU than the DX11 one, therefore helping in a CPU bound situation. You pasted the pictures from the post of the guy who wrote the DX12 renderer, yet you didn't paste his own words where he says that the extra performance is gained by the extra CPU time saved for emulation of graphics intensive tasks, and not because there was any kind of amazing uplift on GPU performance.

Dolphin DX12 Developer said:

Performance
Generally, graphics-intensive games get a nice win, while (Gamecube CPU)-bound games (Zelda OOT from the 'bonus disk' is a good example) are the same - graphics wasn't on the critical path there. At higher resolutions, graphics becomes more important, so the relative improvement can increase there. In general, CPU usage is now much lower for the same workload relative to DX11/OpenGL.
Click to expand...

dr_rus said: ↑

And yeah, you are getting the maximum utilization from Maxwell under DX11 already because of how Maxwell h/w operates and how NV's DX11 driver is able to use CPU resources much more effectively than AMD's. Hence the general lack of performance boosts from concurrent compute in DX12 on Maxwell, hence the apparent tie between DX11 and DX12 on NV's h/w.
Click to expand...

So, in the end, we are saying the exact same thing
Maxwell won't see any great performance boosts from DX12, so people should stop being surprised from benchmarks showing small/zero performance improvements from DX12 on Nvidia hardware.

Last edited: Mar 1, 2016

PrMinisterGR, Mar 1, 2016

#88
Stormyandcold Ancient Guru

Messages:

5,872

Likes Received:

446

GPU:

RTX3080ti Founders

Well, I finished my popcorn. Going to play my dx11 games now. Apparently, it's fun.

Stormyandcold, Mar 1, 2016

#89
Alessio1989 Ancient Guru

Messages:

2,959

Likes Received:

1,246

GPU:

.

Guys, the number of shader-group units is completely irrelevant. What does matter is hot the different jobs are scheduled and handled by the hardware. Having 31-32-63-64 computing groups does mean nothing.

Alessio1989, Mar 1, 2016

#90
PrMinisterGR Ancient Guru

Messages:

8,132

Likes Received:

974

GPU:

Inno3D RTX 3090

Stormyandcold said: ↑

Well, I finished my popcorn. Going to play my dx11 games now. Apparently, it's fun.
Click to expand...

Such elegance and grace.

Alessio1989 said: ↑

Guys, the number of shader-group units is completely irrelevant. What does matter is hot the different jobs are scheduled and handled by the hardware. Having 31-32-63-64 computing groups does mean nothing.
Click to expand...

Did any of us argue that? (I honestly don't understand why you would refer to that).

PrMinisterGR, Mar 1, 2016

#91
Deasnutz Guest

Messages:

174

Likes Received:

0

GPU:

Titan X 12GB

Surely upcoming games like Hitman and Gears of War will have it enabled right? I can't find confirmation other than from the red side.

Deasnutz, Mar 1, 2016

#92
Alessio1989 Ancient Guru

Messages:

2,959

Likes Received:

1,246

GPU:

.

PrMinisterGR said: ↑

Such elegance and grace.

Did any of us argue that? (I honestly don't understand why you would refer to that).
Click to expand...

Mapping a ludicrous number of command queues to the number of shader group uint (or whatever the IHV call them) will not gain any performance but will results in overhead (both driver and hardware). Most of the time using 1 gfx and 1 compute queues (2 if with different priorities for different jobs) will deliver better performance. Same for copy queues. DX12 allows to set queue execution priority too (actually there are only two values allowed, 0 - aka normal priority - and 100 - aka high priority) but that is just an hint to the driver.
The driver-hardware should be able to handled and split the queues onto it's architecture in the best way (except when the driver is buggy).
Also, please not there is only 1 graphics queue allowed per device node, and as far I am aware none of current GPUs should be able to run multiple graphics queue in parallel per node (and this is why there is only one graphics queue).

Last edited: Mar 1, 2016

Alessio1989, Mar 1, 2016

#93
dr_rus Ancient Guru

Messages:

3,938

Likes Received:

1,047

GPU:

RTX 4090

-Tj- said: ↑

^
Exactly, he keeps mentioning some useless stuff, like I said to him, he should read nvidia whitepapers GF110, GK110 and GM204 to see what's up @ nvidia side, most of the stuff what Anand mentioned is form whitepapers though, copy pasta and something in his own words..

Although there is no GM200 WP yet...

http://www.hardwareluxx.de/index.ph...nchronous-shaders-im-benchmark-vergleich.html
Click to expand...

GM200 is just a beefier version of GM204, there is no serious feature difference like it was between GK10x and GK110.

PrMinisterGR said: ↑

And I could claim I'm another source, but that doesn't work that way and for good reason.
Click to expand...

You really can't because you don't know what you're saying -)

PrMinisterGR said: ↑

So, in the end, no matter the exact reason you are basically agreeing with me that Maxwell doesn't seem to get as much of an uplift from DX12, as GCN does. My main point is that Maxwell is already maxed out under DX11, and it won't see any great performance increases under DX12. I don't see us disagreeing with that.
Click to expand...

There are several reasons why DX12 may provide an uplift to a videocard compared to DX11.

Concurrent async compute is the lesser one of them all as it's the most complex to implement without issues and provides rather moderate performance increases even on GCN h/w (link). Maxwell doesn't support concurrent async compute as well as GCN does because Maxwell doesn't have the same issues which GCN has in it graphics pipeline. Because of the lack of these issues Maxwell don't have as advanced global scheduler as GCN has - it simply doesn't need one.

The second and most important reason for DX12 uplifts is the much more efficient CPU utilization under DX12 -- especially multicore CPU utilization. Here both vendors are enjoying more or less the same uplifts in CPU limited situations. The difference here lies in AMD being more CPU limited in DX11 on average because their DX11 driver is, well, crap. Hence there are titles - AotS being one of them - where NV is GPU limited in DX11 while AMD is CPU limited. In such cases you will see uplifts on AMD and won't see them on NV because NV's GPUs are already maxed out and they don't need more CPU power. In a case where both vendors will be CPU limited - like the Dolphin emulation benchmark I've provided above - you will see both vendors having a similar performance boost from switching to DX12. This is the biggest performance boost which DX12 provides over DX11.

The third reason for performance increases in DX12 is the new h/w features introduced in DX12 h/w - specifically those found in FL12_1 feature level which AMD chips do not support at all at the moment. These features are mostly there for performance optimizations and I expect them to provide NV with more or less the same performance increases as AMD will get from their concurrent async compute implementation. Note that these features can be used to get a better performance in DX11/DX12/OGL/VK - all modern APIs, not just the newer ones.

PrMinisterGR said: ↑

Look at the Do's and Don'ts Guide for DX12, from NVIDIA themselves, they are quite interesting. On the first list there is the "admission" that the NVIDIA DX11 driver is handling multithreaded submission for the developers already.

So basically they say that under DX11, their driver pretty much manages what a lower level API would do.
Click to expand...

Wrong again -)
They are saying that the threading of requests from an app to the GPU which is done by their DX11 driver the app developer will have to perform manually in the app's code in DX12. DX12 driver don't do any threading and that's actually the reason why it has so low CPU overhead - it just don't do hell of a lot of work which is being done in DX11 driver.
NV is warning developers that they can't rely on their driver to perform the threading in DX12 - that's all. That's by DX12 design, nothing unique to NV's h/w here.

PrMinisterGR said: ↑

The ending quote is very interesting, and it's not for Kepler only, since this is their DX12 guide and they don't seem to be making any differentiation. In fact in the end they state it's for both Maxwell and Kepler.

That kinda answers it about Async Compute right there. The benchmarks seem to indicate the same also.
Click to expand...

This quote specifically mentions same command queue which already means that it has nothing to do with async compute.
As for the quote itself - context switch of the pipeline is a costly operation and you should avoid it as much as possible on any h/w, GCN included.
The main feature of concurrent async compute is that it happens without context switch because you have several command queues running in parallel - one with the graphics context and others with compute contexts - and the chip's multiprocessors are able to be fed with commands from both queues without actually switching the queue context.
So while this quote has relation to switches between graphics and compute contexts in general - it doesn't say much about concurrent async compute. What's even more interesting - you can actually run compute inside the graphics queue context without switching - which is what NV seems to be doing with PhysX since ages ago.

PrMinisterGR said: ↑

This is Kepler, and if my comment about it's scheduler being more flexible than Maxwell's is correct, then Kepler might see more increases as a percentage compared to DX11, than Maxwell will.
Click to expand...

Kepler doesn't support compute queues in parallel to graphics at all. Your comments on this are completely irrelevant.

PrMinisterGR said: ↑

Furthermore, the image supplied was is from Dolphin. Dolphin being an emulator has various restrictions, the main being that it can't really use more than three threads for any kind of logic. A very good summary of the situation in Dolphin is being given in this post, which is part of the Multithreading topic on Dolphin's forums.
To get a quote from Dolphin's FAQ

If I understand correctly, the DX12 renderer uses less CPU than the DX11 one, therefore helping in a CPU bound situation. You pasted the pictures from the post of the guy who wrote the DX12 renderer, yet you didn't paste his own words where he says that the extra performance is gained by the extra CPU time saved for emulation of graphics intensive tasks, and not because there was any kind of amazing uplift on GPU performance.
Click to expand...

What difference does it make? If you're getting more CPU cycles in DX12 then you can spend these CPU cycles either in your application to speed up some CPU intensive stuff or in the DX12 driver to push more commands to the GPU. In both cases you will get a speed up if you were CPU limited. If you weren't - you won't, on any GPU, as you were GPU limited and this limit hasn't changed.

PrMinisterGR said: ↑

So, in the end, we are saying the exact same thing
Maxwell won't see any great performance boosts from DX12, so people should stop being surprised from benchmarks showing small/zero performance improvements from DX12 on Nvidia hardware.
Click to expand...

Kinda. But I'm saying that it will see less improvements because of reasons (most of which are the same why Maxwell is so good in DX11 right now which is actually a big plus) and you're saying it won't because it doesn't support concurrent async compute - which is wrong on many levels.

Deasnutz said: ↑

Surely upcoming games like Hitman and Gears of War will have it enabled right? I can't find confirmation other than from the red side.
Click to expand...

Hitman most likely will as it's an AMD's Gaming Evolved game.
Gears most likely don't as they seem to not run properly on AMD's current DX12 driver at all at the moment. I wouldn't expect a game which can't get basic rendering in DX12 right to use such a complex feature as async compute.

Last edited: Mar 2, 2016

dr_rus, Mar 2, 2016

#94
narukun Master Guru

Messages:

228

Likes Received:

24

GPU:

EVGA GTX 970 1561/7700

Hey guys a random question here.

I got my 970 OC'ed to 1561mhz, if i have that clock it means that i have 5.195 Gflops?

core clock * shaders units * 2 / 1.000.000 = gflops

right?

I'm asking that because i'm not getting close in FPS to the R9 390 in The Division (cutscenes) and im supposedly getting the same level of gflops. It has to do with shaders i guess?

//R9 390 5,120 GFLOPS

Last edited: Mar 8, 2016

narukun, Mar 8, 2016

#95
dr_rus Ancient Guru

Messages:

3,938

Likes Received:

1,047

GPU:

RTX 4090

Flops are not the sole metric of performance. R390 has twice more the RAM and more memory bandwidth.

dr_rus, Mar 8, 2016

#96
Keesberenburg Master Guru

Messages:

886

Likes Received:

45

GPU:

EVGA GTX 980 TI sc

Hitman Use Async compute but Where the hell is AMD in this game? http://www.pcgameshardware.de/Hitman-Spiel-6333/Specials/DirectX-12-Benchmark-Test-1188758/

Keesberenburg, Mar 10, 2016

#97
Denial Ancient Guru

Messages:

14,207

Likes Received:

4,121

GPU:

EVGA RTX 3080

Keesberenburg said: ↑

Hitman Use Async compute but Where the hell is AMD in this game? http://www.pcgameshardware.de/Hitman-Spiel-6333/Specials/DirectX-12-Benchmark-Test-1188758/
Click to expand...

What do you mean? The 390 is $350 card and it's outperforming a $500 980.

Denial, Mar 10, 2016

#98
Keesberenburg Master Guru

Messages:

886

Likes Received:

45

GPU:

EVGA GTX 980 TI sc

But just it was a fury they edit the name of the videokard from fury to 390. So i think AMD is Faster with the fury?

Last edited: Mar 10, 2016

Keesberenburg, Mar 10, 2016

#99
Undying Ancient Guru

Messages:

25,502

Likes Received:

12,902

GPU:

XFX RX6800XT 16GB

Edit the name? The 390 is performing great in that benchmark. When they put Fury it will probably match even 980ti.

It would be funny to see how 970 stands vs 390.

Undying, Mar 10, 2016

#100

(You must log in or sign up to reply here.)

Page 5 of 57

Share This Page