Log in or Sign up

Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

Page 55 of 57

Darren Hodgson Ancient Guru

Messages:

17,222

Likes Received:

1,541

GPU:

NVIDIA RTX 4080 FE

I was shocked to see the ASync Compute option in Gears of War 4 but even if it does only add a handful of frames per second to the game it's still nice to have. Also, the game runs like a dream, looks great and is jammed packed full of every option you could ever need. The developers should be commended for this IMO as so many PC mutliplatform games seem like afterthoughts with barebones options and lacklustre performance/optimization.

Darren Hodgson, Oct 11, 2016

#1081
dr_rus Ancient Guru

Messages:

3,930

Likes Received:

1,044

GPU:

RTX 4090

Darren Hodgson said: ↑

I was shocked to see the ASync Compute option in Gears of War 4 but even if it does only add a handful of frames per second to the game it's still nice to have. Also, the game runs like a dream, looks great and is jammed packed full of every option you could ever need. The developers should be commended for this IMO as so many PC mutliplatform games seem like afterthoughts with barebones options and lacklustre performance/optimization.
Click to expand...

And the reason for this is, partially, at least: https://blogs.nvidia.com/blog/2016/10/07/dx12-gears-of-war-4/

It's actually kind of an interesting result with NV's title working good on all h/w again, just a month after AMD's title (DXMD) running like **** on NV's h/w even in DX11, let alone 12, don't you think?

dr_rus, Oct 11, 2016

#1082
Stormyandcold Ancient Guru

Messages:

5,872

Likes Received:

446

GPU:

RTX3080ti Founders

Well, Mafia 3 has finally appeared on Geforce.com, but, if you look at it's positioning, they're basically shuffling it away.

http://www.geforce.com/games-applications

Stormyandcold, Oct 11, 2016

#1083
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

dr_rus said: ↑

And the reason for this is, partially, at least: https://blogs.nvidia.com/blog/2016/10/07/dx12-gears-of-war-4/

It's actually kind of an interesting result with NV's title working good on all h/w again, just a month after AMD's title (DXMD) running like **** on NV's h/w even in DX11, let alone 12, don't you think?
Click to expand...

Mafia III is an NVIDIA title and runs like dogsh*t. Some would say that it's the developer that actually matters, and not the company sponsoring.

PrMinisterGR, Oct 11, 2016

#1084
dr_rus Ancient Guru

Messages:

3,930

Likes Received:

1,044

GPU:

RTX 4090

PrMinisterGR said: ↑

Mafia III is an NVIDIA title and runs like dogsh*t. Some would say that it's the developer that actually matters, and not the company sponsoring.
Click to expand...

It runs like dog **** on everything and it's actually running a bit better on AMD cards. It's also DX11 so I don't see how it's relevant to this thread. I think it's time you've noticed that there are Nvidia titles and Nvidia titles and some of them are actually just games which were tested by NV for compatibility and nothing more.

dr_rus, Oct 11, 2016

#1085
Denial Ancient Guru

Messages:

14,207

Likes Received:

4,121

GPU:

EVGA RTX 3080

angelgraves13 said: ↑

It has everything to do with Async Compute. Pre-emption is simply a dated and legacy method of dividing up workload for the GPU. It starts and stops work to do other compute work, whereas Async is able to do compute work without stopping, using idle resources.

If and when...should just say when at this point, Nvidia goes fully hardware based Async with Volta, we should expect to see massive gains in performance for games that support Async.
Click to expand...

http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10

Read pages 10 and 11.

Denial, Oct 12, 2016

#1086
dr_rus Ancient Guru

Messages:

3,930

Likes Received:

1,044

GPU:

RTX 4090

angelgraves13 said: ↑

It has everything to do with Async Compute. Pre-emption is simply a dated and legacy method of dividing up workload for the GPU. It starts and stops work to do other compute work, whereas Async is able to do compute work without stopping, using idle resources.

If and when...should just say when at this point, Nvidia goes fully hardware based Async with Volta, we should expect to see massive gains in performance for games that support Async.
Click to expand...

All of this is just completely wrong, starting with "dated and legacy preemption" and ending with "fully h/w based async". Please, read at least something from this thread before posting such bull****.

dr_rus, Oct 12, 2016

#1087
Stormyandcold Ancient Guru

Messages:

5,872

Likes Received:

446

GPU:

RTX3080ti Founders

PrMinisterGR said: ↑

Mafia III is an NVIDIA title and runs like dogsh*t. Some would say that it's the developer that actually matters, and not the company sponsoring.
Click to expand...

It looks like the deal's off. M3 wasn't a "way it's meant to be played" game as far as I'm aware. It must be so s**t that it has been removed from GeForce.com now. Nvidia must've known that it performed like a turd because M3 receive little to no advertising on Nvidia sites.

Even Nvidia's facebook only had 3 posts, 1 for system requirements and 2 posts about the game that was actually only shared links.

EDIT: I've asked both Nvidia and M3 facebook sites to confirm whether M3 is an Nvidia game or not, no answer yet. Both parties are silent on the issue.

Last edited: Oct 12, 2016

Stormyandcold, Oct 12, 2016

#1088
Redemption80 Guest

Messages:

18,491

Likes Received:

267

GPU:

GALAX 970/ASUS 970

Yeah, looks like it was tested by Nvidia purely to get settings for something like GFE and that was it.

The fact neither Nvidia or AMD were involved with Mafia 3 is probably why it's so bad.

I have to laugh at the post above that is still claiming async compute is magic and can bring huge performance increases from thin air.

Redemption80, Oct 12, 2016

#1089
Redemption80 Guest

Messages:

18,491

Likes Received:

267

GPU:

GALAX 970/ASUS 970

The technical side has already been explained in this thread.

Personally, I like my dumbed down, layman's way of looking at async compute.
It's pretty much just a way of making better/efficient use of underutilised GPU hardware.

Since AMD hardware is underutilised, async brings nice performance gains, but since Nvidia hardware is much better utilised/efficient it gains much less.

I've been informed that depending on the engine this may not be the case all the time, but I bet that it's the case in every game so far, and the idea that "hardware async" makes GPU's run at 110-120% is a strange one, and logic seems to be getting lost.

Last edited: Oct 12, 2016

Redemption80, Oct 12, 2016

#1090
Redemption80 Guest

Messages:

18,491

Likes Received:

267

GPU:

GALAX 970/ASUS 970

It helps on AMD hardware that has 5-15% of underutilised hardware, not all hardware has that breathing space/weakness.

Redemption80, Oct 13, 2016

#1091
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

Stormyandcold said: ↑

It looks like the deal's off. M3 wasn't a "way it's meant to be played" game as far as I'm aware. It must be so s**t that it has been removed from GeForce.com now. Nvidia must've known that it performed like a turd because M3 receive little to no advertising on Nvidia sites.

Even Nvidia's facebook only had 3 posts, 1 for system requirements and 2 posts about the game that was actually only shared links.

EDIT: I've asked both Nvidia and M3 facebook sites to confirm whether M3 is an Nvidia game or not, no answer yet. Both parties are silent on the issue.
Click to expand...

This only reinforces the opinion that neither AMD nor NVIDIA endorsements really matter, just what the developers do.

PrMinisterGR, Oct 13, 2016

#1092
Denial Ancient Guru

Messages:

14,207

Likes Received:

4,121

GPU:

EVGA RTX 3080

Yeah - in one of the interviews The Coalition dev said Epic had a big hand in bringing the game up on UE4 - which is most likely the reason why it runs so well. They probably optimized the hell out of it as it's the first really big AAA title on the engine.

Denial, Oct 13, 2016

#1093
Stormyandcold Ancient Guru

Messages:

5,872

Likes Received:

446

GPU:

RTX3080ti Founders

PrMinisterGR said: ↑

This only reinforces the opinion that neither AMD nor NVIDIA endorsements really matter, just what the developers do.
Click to expand...

There's AAA, then, there's wannabes.

There's also this article for GOW4; https://blogs.nvidia.com/blog/2016/10/07/dx12-gears-of-war-4/

I'm of the opinion there are games that do have hands on engineers from Vendors who do help work on games and make them run better. Then, there's games that just want the "branding" which is what I believe M3 is.

Stormyandcold, Oct 13, 2016

#1094
dr_rus Ancient Guru

Messages:

3,930

Likes Received:

1,044

GPU:

RTX 4090

angelgraves13 said: ↑

I've already read it and have no desire to read it again. Maybe you can just tell me what's wrong about my understanding of pre-emption.
Click to expand...

Again? I've said it several times already in this thread. Preemption is a technique which makes multitasking possible so unless you want everything to run in order you need it and because of that there's nothing "legacy" in it. Pascal's way of handling pre-emption on pixel/instruction level is actually the best way there is in GPUs at the moment, GCN is behind on this now.

As for pre-emption having nothing to do with async compute: async compute runs "concurrently" only when it runs on a dedicated SM partition which is precisely the way Pascal handles it (and GCN3+Polaris got the ability as well although it's kinda not really enabled in general I think). There's no pre-emption of any kind in play here as different contexts run on different execution units, this is very much like multicore CPU handling several contexts at the same time.

GCN's way of running compute wavefronts on the same CUs which run graphics at the same time is nice as it improves the overall utilization of CU execution units - but one thing to understand here is that the actual execution is still happening serially as a SIMD can't process two wavefronts per clock. So when a graphics wave is running - the compute wave is waiting for it's turn and vice versa. This execution is "concurrent" only in the scheduling part, not the actual processing of wavefronts.

As for NV h/w getting "massive gains in performance" when it will be able to run compute warps on the same SMs as graphics warps - this won't happen, you're looking at +5% to Pascal at best (much less is more likely even), because there's not much idle units in NV's SMs when running graphics, partially because there's a lot of scheduling level concurrency going on in NV's SM without any async compute already.

And async compute gains will go down with h/w progressing further, not up, both for AMD and NV. Both NV and AMD are handling async compute "in hardware", there is no other way of doing this.

dr_rus, Oct 13, 2016

#1095
dr_rus Ancient Guru

Messages:

3,930

Likes Received:

1,044

GPU:

RTX 4090

angelgraves13 said: ↑

Ok... I'm not a programmer, but it makes sense. AMD fans think Async is the be all and end all of DX12 graphics and Nvidia is behind because they don't have it. I've always just said that their gains in DX12 are mainly from the low-level API, which they've been praising for years from Mantle onward.

So, if Nvidia was to improve on their pre-emption further, where exactly would they go with it? My 1080 GTX is a very good card, but I'm curious how Volta will perform when it's out in 2018.
Click to expand...

AMD's gains in new APIs are mostly because AMD's drivers for old APIs sucks donkey balls. So instead of improving their driver they said "**** that, let the devs handle all this ****, we can't be bothered". Results are speaking for themselves really, with most devs being actually unable to provide as efficient solution as NV have in their drivers and thus falling behind in new APIs on NV h/w. Async compute general contribution to this is small, around +5% average even on AMD's h/w. Bulk gain is coming from better resource and CPU management than what AMD have in their driver.

NV can't improve pre-emption further because there's nowhere to improve it from Pascal. Volta will most likely add the ability to run compute warps on the same SMs which are already running graphics but in case of NV this is unlikely to lead to significant performance gains because the issue which AMD have in their GCN architecture which do lead to these gains is just absent from NV's h/w. Basically, there's not much to utilize with compute in NV's SM when it's already doing graphics. Some corner cases will certainly benefit -- most likely the same which already benefit on Pascal's async implementation though. So the overall gain over Pascal will probably be very small.

dr_rus, Oct 13, 2016

#1096
PrMinisterGR Ancient Guru

Messages:

8,129

Likes Received:

971

GPU:

Inno3D RTX 3090

dr_rus said: ↑

GCN's way of running compute wavefronts on the same CUs which run graphics at the same time is nice as it improves the overall utilization of CU execution units - but one thing to understand here is that the actual execution is still happening serially as a SIMD can't process two wavefronts per clock. So when a graphics wave is running - the compute wave is waiting for it's turn and vice versa. This execution is "concurrent" only in the scheduling part, not the actual processing of wavefronts.
Click to expand...

So you mean that there is concurrency in the CU level, and not in the SIMD level, right? A GCN CU has four SIMDs in it, so it could basically run four different things at once, per clock. A GCN SIMD has to finish its current task, to grab another task, so there is no concurrency in the SIMD level.

On Pascal the concurrency exists in the GPU level, as each SM can do one thing until you switch it (which you actually can outside of draw call boundaries in Pascal), right? So GCN offers more fine grained thread/job control, but it pays the price for that by having more ALUs that stay idle.

There is also a nice perspective in from Anandtech's Ryan Smith:

Ryan Smith said:

Meanwhile, because this is a question that I’m frequently asked, I will make a very high level comparison to AMD. Ever since the transition to unified shader architectures, AMD has always favored higher ALU counts; Fiji had more ALUs than GM200, mainstream Polaris 10 has nearly as many ALUs as high-end GP104, etc. All other things held equal, this means there are more chances for execution bubbles in AMD’s architectures, and consequently more opportunities to exploit concurrency via async compute. We’re still very early into the Pascal era – the first game supporting async on Pascal, Rise of the Tomb Raider, was just patched in last week – but on the whole I don’t expect NVIDIA to benefit from async by as much as we’ve seen AMD benefit. At least not with well-written code.[...]
Finally, getting back to the subject of dynamic scheduling, I’ve spent some time mulling over what’s probably the obvious question: if dynamic scheduling is so great, why didn’t NVIDIA do this sooner? It’s not a question I have an answer to, but I strongly suspect it’s another one of those tradeoffs that’s rooted in balancing costs and benefits. Dynamic scheduling requires a greater management of hazards that simply weren’t an issue with static scheduling, as now you need to handle everything involved with suddenly switching an SM to a different queue. Meanwhile NVIDIA more than likely paid a die space penalty for implementing dynamic scheduling. GPUs continually sit on the fence between being an ultra-fast staticly scheduled array of ALUs and an ultra-flexible somewhat smaller array of ALUs, and GPU vendors get to sit in the middle trying to figure out which side to lean towards in order to deliver the best performance for workloads that are 2-5 years down the line. It is, if you’ll pardon the pun, a careful balancing act for everyone involved.
Click to expand...

PrMinisterGR, Oct 13, 2016

#1097
pharma Ancient Guru

Messages:

2,496

Likes Received:

1,197

GPU:

Asus Strix GTX 1080

One GCN CU can have up to 40 waves running concurrently (10 per SIMD). It doesn't matter where each wave has originated. There can be any mix of pixel/vertex/geometry/hull/domain/compute shader waves in flight at the same time (from any amount of queues). Instructions from these 40 waves are scheduled to the CU SIMDs in a round robin manner. If some of these 40 waves is waiting for memory, the GPU simply jumps over it in the round robin scheduling.

There is no need to store wave's data in off-chip memory. Each CU has enough on-chip storage for the metadata of these waves. Waves are grouped as thread groups. A single thread group needs to execute on a single CU (this is true for all GPUs, including Intel and Nvidia). This is because threads in the same thread group can use barrier synchronization and share data through LDS (64 KB on-chip buffer on each CU). All GPU architectures use static register allocation. The maximum count of registers used during a shader life time (even if some branch was never taken) needs to be allocated for the wave. Simplified: The GPU scheduler keeps track of available resources (free registers, free LDS, free waves) on each CU. When there's enough registers on some CU for a new thread group (thread group = 1 to 16 waves), the scheduler spawns a thread group for that CU. Each GCN CU has 256 KB of registers, 40 wave slots and 64 KB of LDS. There is no need to context swap kernels (*). Each thread group is guaranteed to finish execution once started. GPU programming model doesn't support thread groups waiting for other thread groups (atomics are supported, but the programmer is not allowed to write spinning locks). Nvidia GPUs work similarly. Intel has per wave (they call them threads) register files. Their register allocation works completely differently (shader compiler outputs different SIMD widths based on register count).

(*) Exceptional cases might need context switch (= store GPU state to memory and restore later). Normal flow of execution doesn't.
Click to expand...

https://forum.beyond3d.com/posts/1939976/

pharma, Oct 14, 2016

#1098
dr_rus Ancient Guru

Messages:

3,930

Likes Received:

1,044

GPU:

RTX 4090

PrMinisterGR said: ↑

So you mean that there is concurrency in the CU level, and not in the SIMD level, right? A GCN CU has four SIMDs in it, so it could basically run four different things at once, per clock. A GCN SIMD has to finish its current task, to grab another task, so there is no concurrency in the SIMD level.

On Pascal the concurrency exists in the GPU level, as each SM can do one thing until you switch it (which you actually can outside of draw call boundaries in Pascal), right? So GCN offers more fine grained thread/job control, but it pays the price for that by having more ALUs that stay idle.

There is also a nice perspective in from Anandtech's Ryan Smith:
Click to expand...

Yes, you could say that GCN's "granularity" of assigning execution units to different contexts is higher than Pascal's. But this wouldn't really be an advantage for NV's h/w as NV's h/w doesn't have as much (state change) idle bubbles in graphics execution on the SMs and because of this there's little gain in assigning (stateless) compute warps to the same SMs as graphics. This "granularity" choice is a conscious one, in both architectures. What results in a possible performance gain for GCN would not result in the same gain on Pascal.

The biggest plus of GCN's approach for NV's h/w is the generalization of execution scheduling which should make it simpler for a driver+h/w combo to run mixed contexts workloads resulting in, potentially, simpler driver and more universally robust h/w execution. But it's unlikely that GCN's approach will actually bring performance improvements for NV's architecture outside of some corner cases. I don't expect that from Volta, and I'm pretty sure that if Volta will bring some significant performance gains over Pascal - it certainly won't happen because of async compute or the ability to run different contexts on the same SM. This is mostly a convenience feature, not performance one.

dr_rus, Oct 14, 2016

#1099
stevevnicks Guest

Messages:

1,440

Likes Received:

11

GPU:

Don't need one

Maybe the site should be changed to guru geeks, the average gamer cares more about playing the game and not worry about the rest, still i guess it's just away to make the geeks feel they know best all the time lol, if the geeks spent as much time playing their games as they did benchamrking and gloating, they wouldnt be a issues about what dose AS best.

stevevnicks, Oct 14, 2016

#1100

(You must log in or sign up to reply here.)

Page 55 of 57

Share This Page