Log in or Sign up

Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

Page 38 of 57

pharma Ancient Guru

Messages:

2,496

Likes Received:

1,197

GPU:

Asus Strix GTX 1080

Undying said: ↑

Same can be said about 980ti. Question is, who will go first.
Click to expand...

Any Maxwell 2 chips you can still buy are simply clearing inventory according to what Nvidia mentioned a month ago. Fury is part of the new AMD lineup and based on the Nvidia presentation will not be a competitive part.

pharma, May 7, 2016

#741
dr_rus Ancient Guru

Messages:

3,940

Likes Received:

1,048

GPU:

RTX 4090

moab600 said: ↑

it won't have any async support, if so very limited, otherwise the CEO will scream that name as a good marketing tool.

NVIDIA just brute force it's way, if the games can utilize whole card performance, they don't need async, but we will find out only in the years to come.
Click to expand...

It will have async support as any DX12 chip on the planet has it.

It will have concurrent async compute support as well. Why they didn't mention it during the unveil - that's a good question, NV's marketing is all kinds of crap lately, but I think that it just shows how little they care about that "rrrrevolutionary" feature which gives some +10% of performance even on Radeons and will essentially give less than that on Pascal.

Here's a nice post from sebbbi explaining some stuff around the whole async compute debacle. It also kinda explains how AMD can slow down competitors GPUs even with one queue, without any async compute, i.e. in DX11 (and this is basically what's happening to Kepler / Maxwell because of console code optimizations in some recent titles).

Last edited: May 7, 2016

dr_rus, May 7, 2016

#742
PrMinisterGR Ancient Guru

Messages:

8,132

Likes Received:

974

GPU:

Inno3D RTX 3090

dr_rus said: ↑

It will have async support as any DX12 chip on the planet has it.

It will have concurrent async compute support as well. Why they didn't mention it during the unveil - that's a good question, NV's marketing is all kinds of crap lately, but I think that it just shows how little they care about that "rrrrevolutionary" feature which gives some +10% of performance even on Radeons and will essentially give less than that on Pascal.

Here's a nice post from sebbbi explaining some stuff around the whole async compute debacle. It also kinda explains how AMD can slow down competitors GPUs even with one queue, without any async compute, i.e. in DX11 (and this is basically what's happening to Kepler / Maxwell because of console code optimizations in some recent titles).
Click to expand...

Let's hope it will, it will bring some standardization and will probably make these chips much more viable for a longer period. I'm looking at that 1070 quite intently.

PrMinisterGR, May 7, 2016

#743
fellix Master Guru

Messages:

252

Likes Received:

87

GPU:

MSI RTX 4080

Looks like Pascal is able to preempt individual SMs at run-time and switch between graphics and compute tasks. On Maxwell, the SM task assignment is static for the duration of the current batch.

fellix, May 7, 2016

#744
dr_rus Ancient Guru

Messages:

3,940

Likes Received:

1,048

GPU:

RTX 4090

fellix said: ↑

Looks like Pascal is able to preempt individual SMs at run-time and switch between graphics and compute tasks. On Maxwell, the SM task assignment is static for the duration of the current batch.
Click to expand...

I'm still pretty puzzled by why they decided to disclose this during today closed off press event instead of stating this from stage yesterday.

It's also kinda interesting that they don't actually enable this on Maxwell in the current drivers I think, even on a per application basis.

dr_rus, May 7, 2016

#745
Ieldra Banned

Messages:

3,490

Likes Received:

0

GPU:

GTX 980Ti G1 1500/8000

Decided to run these tests again on fresh windows install

Very interesting results.

I will quote myself here and post results from launch of AotS

5820K @ 4.5
32GB DDR4 2933
980Ti

1440p

Some benchmark results with and without core and memory overclocks

Highest memory controller load (reported by GPU-Z) was in CRAZY preset run with memory @ 1490/7000 (73%)

DX12 - HIGH PRESET 980Ti @ 1210/7000

== Configuration =========================================================
API: DirectX 12
==========================================================================
Quality Preset: High
==========================================================================

Resolution: 2560x1440
Fullscreen: True
Bloom Quality: High
PointLight Quality: High
Glare Quality: Low
Shading Samples: 8 million
Terrain Shading Samples: 8 million
Shadow Quality: Mid
Temporal AA Duration: 0
Temporal AA Time Slice: 0
Multisample Anti-Aliasing: 2x
Texture Rank : 1

== Total Avg Results =================================================
Total Time: 60.005051 ms per frame
Avg Framerate: 61.550373 FPS (16.246855 ms)
Weighted Framerate: 60.946045 FPS (16.407955 ms)
CPU frame rate (estimated if not GPU bound): 142.365112 FPS (7.024193 ms)
Percent GPU Bound: 99.982994 %
Driver throughput (Batches per ms): 7991.057129 Batches
Average Batches per frame: 14088.831055 Batches
==========================================================================

DX12 - HIGH PRESET 980Ti @ 1490/7000

== Configuration =========================================================
API: DirectX 12
==========================================================================
Quality Preset: High
==========================================================================

Resolution: 2560x1440
Fullscreen: True
Bloom Quality: High
PointLight Quality: High
Glare Quality: Low
Shading Samples: 8 million
Terrain Shading Samples: 8 million
Shadow Quality: Mid
Temporal AA Duration: 0
Temporal AA Time Slice: 0
Multisample Anti-Aliasing: 2x
Texture Rank : 1

== Total Avg Results =================================================
Total Time: 60.001251 ms per frame
Avg Framerate: 71.991623 FPS (13.890505 ms)
Weighted Framerate: 71.337791 FPS (14.017815 ms)
CPU frame rate (estimated if not GPU bound): 145.114700 FPS (6.891101 ms)
Percent GPU Bound: 99.961853 %
Driver throughput (Batches per ms): 8080.241699 Batches
Average Batches per frame: 14101.337891 Batches
==========================================================================

DX12 - HIGH PRESET 980Ti @ 1490/8000

== Configuration =========================================================
API: DirectX 12
==========================================================================
Quality Preset: High
==========================================================================

Resolution: 2560x1440
Fullscreen: True
Bloom Quality: High
PointLight Quality: High
Glare Quality: Low
Shading Samples: 8 million
Terrain Shading Samples: 8 million
Shadow Quality: Mid
Temporal AA Duration: 0
Temporal AA Time Slice: 0
Multisample Anti-Aliasing: 2x
Texture Rank : 1

== Total Avg Results =================================================
Total Time: 59.999844 ms per frame
Avg Framerate: 74.839638 FPS (13.361903 ms)
Weighted Framerate: 73.669044 FPS (13.574223 ms)
CPU frame rate (estimated if not GPU bound): 136.416306 FPS (7.330503 ms)
Percent GPU Bound: 95.914856 %
Driver throughput (Batches per ms): 7934.309570 Batches
Average Batches per frame: 14281.052734 Batches
==========================================================================

DX12 - CRAZY PRESET - 980Ti @ 1210/7000

== Configuration =========================================================
API: DirectX 12
==========================================================================
Quality Preset: Crazy
==========================================================================

Resolution: 2560x1440
Fullscreen: True
Bloom Quality: High
PointLight Quality: High
Glare Quality: High
Shading Samples: 16 million
Terrain Shading Samples: 16 million
Shadow Quality: High
Temporal AA Duration: 0
Temporal AA Time Slice: 0
Multisample Anti-Aliasing: 4x
Texture Rank : 1

== Total Avg Results =================================================
Total Time: 60.002960 ms per frame
Avg Framerate: 37.833069 FPS (26.431904 ms)
Weighted Framerate: 37.195988 FPS (26.884621 ms)
CPU frame rate (estimated if not GPU bound): 123.674149 FPS (8.085764 ms)
Percent GPU Bound: 100.000000 %
Driver throughput (Batches per ms): 7912.587402 Batches
Average Batches per frame: 17555.273438 Batches
==========================================================================

DX12 - CRAZY PRESET - 980Ti @ 1490/7000

== Configuration =========================================================
API: DirectX 12
==========================================================================
Quality Preset: Crazy
==========================================================================

Resolution: 2560x1440
Fullscreen: True
Bloom Quality: High
PointLight Quality: High
Glare Quality: High
Shading Samples: 16 million
Terrain Shading Samples: 16 million
Shadow Quality: High
Temporal AA Duration: 0
Temporal AA Time Slice: 0
Multisample Anti-Aliasing: 4x
Texture Rank : 1

== Total Avg Results =================================================
Total Time: 60.002106 ms per frame
Avg Framerate: 43.745224 FPS (22.859638 ms)
Weighted Framerate: 43.088013 FPS (23.208311 ms)
CPU frame rate (estimated if not GPU bound): 129.148865 FPS (7.743002 ms)
Percent GPU Bound: 100.000000 %
Driver throughput (Batches per ms): 8550.075195 Batches
Average Batches per frame: 17746.050781 Batches
==========================================================================

DX12 - CRAZY PRESET - 980Ti @ 1490/8000

== Configuration =========================================================
API: DirectX 12
==========================================================================
Quality Preset: Crazy
==========================================================================

Resolution: 2560x1440
Fullscreen: True
Bloom Quality: High
PointLight Quality: High
Glare Quality: High
Shading Samples: 16 million
Terrain Shading Samples: 16 million
Shadow Quality: High
Temporal AA Duration: 0
Temporal AA Time Slice: 0
Multisample Anti-Aliasing: 4x
Texture Rank : 1

== Total Avg Results =================================================
Total Time: 60.002808 ms per frame
Avg Framerate: 45.875206 FPS (21.798267 ms)
Weighted Framerate: 45.184284 FPS (22.131588 ms)
CPU frame rate (estimated if not GPU bound): 132.483719 FPS (7.548098 ms)
Percent GPU Bound: 99.924889 %
Driver throughput (Batches per ms): 8930.305664 Batches
Average Batches per frame: 17745.410156 Batches
==========================================================================

[H] results for comparison
Click to expand...

Currently running same system (slightly different ram settings only) on W10 & 365.10, game version may have been updated

I tested at 1300/7000

I test async on vs async off

i'll leave you guys to guess which is which

1490/8000

Last edited: May 7, 2016

Ieldra, May 7, 2016

#746
netkas Guest

Messages:

55

Likes Received:

0

GPU:

Many

From my point of view the whole async compute story is about one thing - increasing gpu's silicon usage by driver/app. As AMD stated many times, their gpu is not fully loaded at many stages of rendering pipeline. Nvidia had same issue. They have used different approach to workaround it.

AMD decided to add ACEs to its gpus to process some additional workload on CUs in parallel to graphics. but this approach needed support from 3d api and applications. Initially this worked out on new gen consoles.

Nvidia decided to do it a different, easier way. Lets do the "low silicon load" stages faster. Higher clocks. And When gpu in not reaching its power budget ( e.g. not under 100% stress load) then let it overclock itself to finish some tasks faster, so it will spend less time in situations when only a small part of silicon is doing the job. Hello turbo boost 1/2/3.

Amd won, thanks to consoles and mantle (giving a birth to dx12/vulkan) , now nvidia has to catch up.

Last edited: May 9, 2016

netkas, May 9, 2016

#747
dr_rus Ancient Guru

Messages:

3,940

Likes Received:

1,048

GPU:

RTX 4090

netkas said: ↑

From my point of view the whole async compute story is about one thing - increasing gpu's silicon usage by driver/app. As AMD stated many times, their gpu is not fully loaded at many stages of rendering pipeline. Nvidia had same issue. They have used different approach to workaround it.

AMD decided to add ACEs to its gpu, but this approach needed support from 3d api and applications. Initially this worked out on new gen consoles.

Nvidia decided to do it a different, easier way. When gpu in not reachihng its power budget ( e.g. not under 100% stress load) then let it overclock itself to finish some tasks faster, so it will spend less time in situations when only a small part of silicon is doing the job. Hello turbo boost 1/2/3.

Amd won, thanks to consoles and mantle (giving a birth to dx12/vulkan) , now nvidia has to catch up.
Click to expand...

Bunch of PR flavored bull**** again. You don't need ACEs to launch anything asynchronously and NV GPUs are doing this just fine -- SMs can launch different jobs when some h/w is available since prehistoric times.

What really happened is this: NV has been working on improving the utilization of their h/w since G80 introduction and has reached nearly peak levels in Maxwell (and Pascal probably but that remains to be seen). There's nothing easy in this task and it is a more proper way because it reaps results in all applications straight away, without the need for any support on the s/w side.

AMD's graphics utilization hasn't really improved since Tahiti where it was already quite a bit worse than on Kepler (which was a third generation of NV's architecture). So since they obviously can't (or don't want to?) improve the graphics part of their GPUs they've decided to use the ability which they have to their advantage - they can slice the compute kernels into the same SM which is running graphics. This way an SM which is partially or fully idling in graphics due to inefficiency of their architecture can fill this bubble with compute kernel.

This is a worse approach because a) it is completely based on s/w support - not just game code but APIs as well - and because of this will only help in games which are using these APIs and using this ability; it also will make these games run worse on other h/w. b) It's very unpredictable in performance not only on all the AMD h/w available right now but on any future h/w from AMD or any other vendor as well.

The ability to run different contexts on the same SM is a good ability and this is something which NV h/w lacks still and will have to get at some point. But using this ability to speed up the GPU because of how bad it is in handling the singular graphics context is a backwards solution of the problem. A good GPU with such ability would not be getting performance boosts from such contexts interleaving as it will be able to run at a maximum capacity in one context as well.

So the situation right now is this:
- AMD's h/w sucks in graphics context but is able to execute any contexts on the same SM at any given time. If they'll improve their graphics engine they won't be gaining any performance from running compute in parallel anymore.
- NV's h/w is almost always at 100% utilization in graphics or compute but it can't run them both on the same SMs at any given time. If they'll improve their SM execution abilities they won't be getting a performance loss when some compute warp is needed to run in a middle of a graphics shader.

My own stance here is that NV's approach is a better one while AMD's approach will fade away in time but right now, with current console h/w being the main target for engine optimization it will affect NV's (and Intel's btw) h/w performance negatively - although not by much as there are ways of getting around this with a clever driver / h/w scheduling. Pascal's improvements should help a bit as well.

pharma said: ↑

Superb clocks on air!
Click to expand...

Man, can you PLEASE reduce that image in your post? Horizontal scroll isn't something I'm used to on my 30" display.

Last edited: May 9, 2016

dr_rus, May 9, 2016

#748
Ieldra Banned

Messages:

3,490

Likes Received:

0

GPU:

GTX 980Ti G1 1500/8000

Async on 1490/8000

Async off

Ieldra, May 9, 2016

#749
TheRyuu Guest

Messages:

105

Likes Received:

0

GPU:

EVGA GTX 1080

dr_rus said: ↑

I'm still pretty puzzled by why they decided to disclose this during today closed off press event instead of stating this from stage yesterday.
Click to expand...

They actually disclosed this during the GP100 reveal back in the beginning of April[1]. Although it's not specifically referring to the ability to preempt individual SM's the wording (sort of) implies it's something new over Maxwell (which didn't have this functionality):

GP100 includes full pre-emption support for compute tasks.
Click to expand...

This doesn't mention anything about async only pure compute tasks but it is still a new addition over Maxwell since you can't preempt at the SM level on Maxwell.

dr_rus said: ↑

It's also kinda interesting that they don't actually enable this on Maxwell in the current drivers I think, even on a per application basis.
Click to expand...

CUDA does use this ability (or at least what I think you're referring to at least). The on board ARM chip will manage how CUDA work is dispatched although obviously individual SM's cannot be preempted on Maxwell.

[1] https://techreport.com/news/29946/pascal-makes-its-debut-on-nvidia-tesla-p100-hpc-card

TheRyuu, May 10, 2016

#750
dr_rus Ancient Guru

Messages:

3,940

Likes Received:

1,048

GPU:

RTX 4090

TheRyuu said: ↑

They actually disclosed this during the GP100 reveal back in the beginning of April[1]. Although it's not specifically referring to the ability to preempt individual SM's the wording (sort of) implies it's something new over Maxwell (which didn't have this functionality):

This doesn't mention anything about async only pure compute tasks but it is still a new addition over Maxwell since you can't preempt at the SM level on Maxwell.
Click to expand...

I'm not so sure that this is even related to the topic as pre-emption means that some warp goes into execution with a high priority before the previous one finishes. It's basically the opposite of running something concurrently on the same SM.

This can also be unrelated because of two things: a) being true only for one context execution. You can pre-empt a workload if it's from the same context - cool for compute, running some debug stuff and async timewarp but pretty useless in graphics otherwise. b) There are no indication at the moment that the feature is present outside of GP100 chip which has its own CUDA compute capability compared to the rest of GP10x line.

TheRyuu said: ↑

CUDA does use this ability (or at least what I think you're referring to at least). The on board ARM chip will manage how CUDA work is dispatched although obviously individual SM's cannot be preempted on Maxwell.

[1] https://techreport.com/news/29946/pascal-makes-its-debut-on-nvidia-tesla-p100-hpc-card
Click to expand...

This has nothing to do with CUDA and/or ARM chip in Tesla P100 system. The chip there is needed most likely because GP100 can't work without a host CPU running the OS and managing system memory.

What I'm talking about is the details of concurrent compute implementation provided by NV during Pascal launch. They are comparing Pascal improvements in this area to Maxwell's capabilities but AFAIK Maxwell isn't actually using this capability as it's disabled in the drivers right now.

dr_rus, May 10, 2016

#751
Ieldra Banned

Messages:

3,490

Likes Received:

0

GPU:

GTX 980Ti G1 1500/8000

dr_rus said: ↑

I'm not so sure that this is even related to the topic as pre-emption means that some warp goes into execution with a high priority before the previous one finishes. It's basically the opposite of running something concurrently on the same SM.

This can also be unrelated because of two things: a) being true only for one context execution. You can pre-empt a workload if it's from the same context - cool for compute, running some debug stuff and async timewarp but pretty useless in graphics otherwise. b) There are no indication at the moment that the feature is present outside of GP100 chip which has its own CUDA compute capability compared to the rest of GP10x line.

This has nothing to do with CUDA and/or ARM chip in Tesla P100 system. The chip there is needed most likely because GP100 can't work without a host CPU running the OS and managing system memory.

What I'm talking about is the details of concurrent compute implementation provided by NV during Pascal launch. They are comparing Pascal improvements in this area to Maxwell's capabilities but AFAIK Maxwell isn't actually using this capability as it's disabled in the drivers right now.
Click to expand...

He's talking about the gmu on maxwell, it's an arm uc

Ieldra, May 10, 2016

#752
Keesberenburg Master Guru

Messages:

886

Likes Received:

45

GPU:

EVGA GTX 980 TI sc

Ieldra said: ↑

Decided to run these tests again on fresh windows install

Very interesting results.

I will quote myself here and post results from launch of AotS

Currently running same system (slightly different ram settings only) on W10 & 365.10, game version may have been updated

I tested at 1300/7000

I test async on vs async off

i'll leave you guys to guess which is which

1490/8000

Click to expand...

Low score, is you'r cpu on stock?

Keesberenburg, May 10, 2016

#753
dr_rus Ancient Guru

Messages:

3,940

Likes Received:

1,048

GPU:

RTX 4090

Ieldra said: ↑

He's talking about the gmu on maxwell, it's an arm uc
Click to expand...

Highly doubtful. Why would they use a CPU for a dedicated h/w engine?

dr_rus, May 10, 2016

#754
Keesberenburg Master Guru

Messages:

886

Likes Received:

45

GPU:

EVGA GTX 980 TI sc

What driver Nvdia use here i see beter performacne for the gtx 980 ti to?
http://www.mobipicker.com/nvidia-gtx-1080-directx-12-benchmarks-async-compute/

Keesberenburg, May 10, 2016

#755
dr_rus Ancient Guru

Messages:

3,940

Likes Received:

1,048

GPU:

RTX 4090

Keesberenburg said: ↑

What driver Nvdia use here i see beter performacne for the gtx 980 ti to?
http://www.mobipicker.com/nvidia-gtx-1080-directx-12-benchmarks-async-compute/
Click to expand...

That's because these 980Ti results are from a heavily OC 980Ti card (>1500 MHz). The 1080 results in AotS database seems to be from early engineering samples as well.

Wait for proper benchmarks ffs. These "leaks" are all kinds of confusing and misleading.

dr_rus, May 10, 2016

#756
Keesberenburg Master Guru

Messages:

886

Likes Received:

45

GPU:

EVGA GTX 980 TI sc

What are the clock speeds, i can not read it in this article. Do you have pictures, or other proof?

Keesberenburg, May 10, 2016

#757
Denial Ancient Guru

Messages:

14,207

Likes Received:

4,121

GPU:

EVGA RTX 3080

Keesberenburg said: ↑

What are the clock speeds, i can not read it in this article. Do you have pictures, or other proof?
Click to expand...

http://forums.guru3d.com/showpost.php?p=5269436&postcount=113

Denial, May 10, 2016

#758
Keesberenburg Master Guru

Messages:

886

Likes Received:

45

GPU:

EVGA GTX 980 TI sc

Thats not a poof It's just speculating form people here.

Keesberenburg, May 10, 2016

#759
dr_rus Ancient Guru

Messages:

3,940

Likes Received:

1,048

GPU:

RTX 4090

Keesberenburg said: ↑

What are the clock speeds, i can not read it in this article. Do you have pictures, or other proof?
Click to expand...

The 980Ti results used in this post are the top results of 980Ti in AotS benchmark database. You can be damn sure that they are from a heavily OC 980Ti card. And why do you ask us on details of the information presented in that post? Go ask its author.

Keesberenburg said: ↑

Thats not a poof It's just speculating form people here.
Click to expand...

That's actually proof enough as it shows that even on 1525MHz Core & 3950Mhz Memory OC 980Ti isn't hitting the 50 fps number used in mobipicker.com post.

dr_rus, May 10, 2016

#760

(You must log in or sign up to reply here.)

Page 38 of 57

Share This Page