Log in or Sign up

AMD: “There’s no such thing as “full support” for DX12 today”

Discussion in 'Frontpage news' started by (.)(.), Sep 1, 2015.

Page 3 of 4

Denial Ancient Guru

Messages:

14,207

Likes Received:

4,121

GPU:

EVGA RTX 3080

pharma said: ↑

The program was an attempt to produce some data, is not definitive and believe was done by a graphics developer (a lot of folk there are game developers where the program originated) -- it's basically a work in progress and may continue to be so well after the first DX12 game (Fable Legends) is released in Oct. 2015.
Click to expand...

I'm familiar with Beyond3D. I'm just pointing out that there is a huge argument going on in that thread about how to interpret the results of that program. So posting the reddit thread here and saying anything definitive about the results is misleading.

Denial, Sep 1, 2015

#41
Barry J Ancient Guru

Messages:

2,803

Likes Received:

152

GPU:

RTX2080 TRIO Super

Denial said: ↑

I'm familiar with Beyond3D. I'm just pointing out that there is a huge argument going on in that thread about how to interpret the results of that program. So posting the reddit thread here and saying anything definitive about the results is misleading.
Click to expand...

I agree but it is interesting

Barry J, Sep 1, 2015

#42
BedantP Guest

Messages:

220

Likes Received:

0

GPU:

1660Ti

I bought a 960 because of DX 12. Damn, could've gotten a Refurb 290 or New 280X.
No real difference now, is it?
And now the AMD cards will get even more powerful because of better async support and low latency, consoles will have a boost too. #PCMR will end soon.

BedantP, Sep 1, 2015

#43
Chillin Ancient Guru

Messages:

6,814

Likes Received:

1

GPU:

-

Does any of this even matter?

Seriously, you guys are at each others neck over HYPOTHETICAL DETAILS! FFS, even if Nvidia did support every godamn thing in DX12(_1) and AMD only supported the most basic things, it wouldn't really matter too much today. These architectures are literally a year old at this point (even older if you take into account they are revisions), and will be two years old by the time some real DX12 games come around; not to mention for how much longer they were in development before the final DX12 spec. If you bought a year old architecture that were both known to have varying levels of support right now to play hypothetical games a year from now with future features, then you are just throwing away your money.

Pascal is what you should be screaming at if it doesn't support nearly everything, same goes with whatever AMD's next architecture is.

Chillin, Sep 1, 2015

#44
Turanis Guest

Messages:

1,779

Likes Received:

489

GPU:

Gigabyte RX500

Pascal is like a ghost,nobody knows nothing.And HBM2 is still on Amd&Hynix hands.
Until will come on the market we still have old "gen" DX11.2/DX12_0 cards which are still good.

Turanis, Sep 1, 2015

#45
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

Denial said: ↑

I'm familiar with Beyond3D. I'm just pointing out that there is a huge argument going on in that thread about how to interpret the results of that program. So posting the reddit thread here and saying anything definitive about the results is misleading.
Click to expand...

That is because they have a lot of tests for compute running alone from 1 to 128 depth.
Then there is just ONE value for graphical rendering time, so one should not over look it.
And finally 1~128 test for both Graphics + compute.

Proper display for those data are 3 graphs:
1st: showing 1~128 for compute only ( lowest values & best case scenario if async shaders were magical)
2nd: 1~128 theoretical result with taking Graphical rendering time and adding to it values from 1st graph (results in highest values and worst case scenario where execution is not done in parallel at all)
3rd: showing 1~128 values for compute+graphical running at once (real world result comparable with best/worst case scenario)

And I would add 4th graph showing percent difference for each 1~128 value where it would show how much it went from worst case to best case scenario.

Fox2232, Sep 1, 2015

#46
degazmatic Guest

Messages:

1

Likes Received:

0

GPU:

HD 7xxx

rl66 said: ↑

after the "Honest" word you forget "this time", i remember the argument that make me buy a HD7850... the worse bought i have done (btw still have it as no one want it... only 3 week of use).
Click to expand...

send them to me.

degazmatic, Sep 1, 2015

#47
dr_rus Ancient Guru

Messages:

3,924

Likes Received:

1,040

GPU:

RTX 4090

AMD's in a full scale damage control mode it seems.

While he's technically right that supporting a full stack of DX12 features is close to impossible (some of them are actually made for different h/w architectures so you would need to have a chip which will have several of them in itself to support them all) it doesn't mean that some GPU can't support less DX12 features than another GPU.

So for example a Fermi-based GPU support DX12 runtime but doesn't support much features above the FL11_0 level and Maxwell 2 support the same runtime with features listed in FL12_1 level - does this mean that they're the same since they both support DX12 runtime?

AMD's FUD is getting tiresome to read.

thatguy91 said: ↑

Yes, it seems the claim on Nvidia's side seems they did what was required to scrape in at 12_1, just for marketability sake. Performance wise it means nothing since you don't have asynchronous shaders, and especially if the following is the norm:
http://forums.guru3d.com/showpost.php?p=5149904&postcount=20
Click to expand...

This is completely wrong.
A. FL12_1 is made mostly of features which should help with performance. Not supporting FL12_1 means that you'll loose more performance trying to render the same effects.
B. Asynchronous shaders aren't needed for performance - it depends on the architecture in question. Your architecture may be able to run the code just fine without async shaders or with them being done in a less efficient way. It's still up for discussion how much performance will async shaders even bring to PC h/w - most of estimations are coming from consoles which are quite a bit different in both h/w and s/w they run from PCs.

Last edited: Sep 1, 2015

dr_rus, Sep 1, 2015

#48
warezme Master Guru

Messages:

237

Likes Received:

37

GPU:

Evga 970GTX Classified

Lack of full DX12 calculated move

If you install enough features to claim DX12 support you sell more of the "new" GPU. If you leave enough off, you can sell future full support DX12 GPU's. win/win. Both sides know this. surpise!...., not really.

warezme, Sep 1, 2015

#49
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

Fox2232 said: ↑

That is because they have a lot of tests for compute running alone from 1 to 128 depth.
Then there is just ONE value for graphical rendering time, so one should not over look it.
And finally 1~128 test for both Graphics + compute.

Proper display for those data are 3 graphs:
1st: showing 1~128 for compute only ( lowest values & best case scenario if async shaders were magical)
2nd: 1~128 theoretical result with taking Graphical rendering time and adding to it values from 1st graph (results in highest values and worst case scenario where execution is not done in parallel at all)
3rd: showing 1~128 values for compute+graphical running at once (real world result comparable with best/worst case scenario)

And I would add 4th graph showing percent difference for each 1~128 value where it would show how much it went from worst case to best case scenario.
Click to expand...

One way to display it properly:
Blue bars show how big portion (in percent) of graphical rendering could be squeezed in between compute operations.
One should note several things:
- Compute Code is for some reason executed much faster on nV hardware (leaving less space to squeeze-in graphical processing); but on other hand it had space to squeeze Compute in between rendering (if space was left)
- I used improvement scale from 0 to 100%, but in reality nV results range from negative to positive due to statistical error between runs
- Real average improvement across entire execution 1~128 is:
- > GTX960 : 1.96%
- > GTX980Ti : 9.81%
- > R9-390X : 92.51%
- > Fury X : 73.45% (seems like some driver/OS/other HW problem)
- nVidia's execution time for compute goes up with batch size (logical since it is more work)
- AMD's execution time stays practically same over entire range, and is way too high (50~52ms) for workload which even GTX960 can process in 10 to 40ms
- > seems like AMD has some static driver overhead for compute stuff and therefore there is so much more space for rendering
- results for Fury X are not from mine since I have no access to files on beyond3d forums

GTX 960

GTX 980Ti

GTX R9-390X

Fury X

Edit: To make it very easy to understand, Blue Background shows how many percent of rendering time has been absorbed by free time slots in between rendering tasks.

Last edited: Sep 1, 2015

Fox2232, Sep 1, 2015

#50
BedantP Guest

Messages:

220

Likes Received:

0

GPU:

1660Ti

lol, that Fury X graph doe.
@Chillin, you are right and wrong. Why don't we get what we *paid* for? Here, I did not waste my money.
NVIDIA's box art includes that big-****ing-font-size "DIRECTX 12", but does not have all the features as mentioned.
It's how you feel if you get less potato chips in your packet.

BedantP, Sep 1, 2015

#51
Dazz Maha Guru

Messages:

1,010

Likes Received:

131

GPU:

ASUS STRIX RTX 2080

Fox2232 said: ↑

One way to display it properly:
Blue bars show how big portion (in percent) of graphical rendering could be squeezed in between compute operations.
One should note several things:
- Compute Code is for some reason executed much faster on nV hardware (leaving less space to squeeze-in graphical processing); but on other hand it had space to squeeze Compute in between rendering (if space was left)
- I used improvement scale from 0 to 100%, but in reality nV results range from negative to positive due to statistical error between runs
- Real average improvement across entire execution 1~128 is:
- > GTX960 : 1.96%
- > GTX980Ti : 9.81%
- > R9-390X : 92.51%
- > Fury X : 73.45% (seems like some driver/OS/other HW problem)
- nVidia's execution time for compute goes up with batch size (logical since it is more work)
- AMD's execution time stays practically same over entire range, and is way too high (50~52ms) for workload which even GTX960 can process in 10 to 40ms
- > seems like AMD has some static driver overhead for compute stuff and therefore there is so much more space for rendering
- results for Fury X are not from mine since I have no access to files on beyond3d forums

GTX 960

GTX 980Ti

GTX R9-390X

Fury X

Click to expand...

Although it's hard to say, no one can say for sure how the program really works except the creator. From what i can make of the thread NV's is done in serial and the CPU is feeding it data one after the other hence less delay between since there is less delay for the CPU to look and pull the data to feed the GPU as it's cached along the way one to the other, while AMD is waiting for data to be fed by the CPU hence the longer delay time as people are saying it's executing two commands (parallel) at a time. People there are also saying there should be no variance in delays if it's working correctly while with nVidia there is 10~40ms and AMD it's a static 50ms.

https://www.reddit.com/r/pcmasterra...e_all_jump_to_conclusions_and_crucify/cumlmwv

Also from the original thread there is an explanation of it.

A lower percentage is better. If it's at or near 100% it means it's doing it pretty much serially, no benefit from asynchronously running them together.
tl;dr: OP missed the point. Maxwell is good at compute, that wasn't the point. Maxwell just cannot benefit from doing compute + graphics asynchronously. GCN can.
Extra point: all of the NVidia cards show a linear increase in time when you increase the number of compute kernels, stepping up every 32 kernels since Maxwell has 32 thread blocks. The 980Ti took 10ms~ for 1-31 kernels, 21ms~ for 32-63 kernels, 32ms~ for 64-95 kernels, and 44ms~ for 96-127 kernels.
The Fury X took 49ms~ for all 1...128 kernel runs, didn't even budge. It looks like the 49ms is some kind of fixed system overhead and we haven't even seen it being strained by the compute calls at all yet.

Last edited: Sep 1, 2015

Dazz, Sep 1, 2015

#52
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

Dazz said: ↑

Although it's hard to say, no one can say for sure how the program really works except the creator. From what i can make of the thread NV's is done in serial and the CPU is feeding it data one after the other hence less delay between since there is less delay for the CPU to look and pull the data to feed the GPU as it's cached along the way one to the other, while AMD is waiting for data to be fed by the CPU hence the longer delay time as people are saying it's executing two commands (parallel) at a time. People there are also saying there should be no variance in delays if it's working correctly while with nVidia there is 10~40ms and AMD it's a static 50ms.
Click to expand...

Delays for compute are explained incorrectly. nVidia gradually rise 'delays' (properly named time to finish compute batch) because it is more complex and therefore takes longer to finish.
While for AMD it is cake and it is finished almost instantly once dispatched. And that dispatch takes around:
Fury X : 49.66ms to finish compute task
R9-390x : 52.28ms to finish compute task
Difference 2.62ms from performance jump between r9-390x and Fury X. Fury X is 31.25% stronger in compute so 2.62ms is this part of entire calculation.
Entire calculation time for Fury X is 2.62ms / 0.3125 = 8.384ms
And that makes this stable delay from giving order via software to get through API & driver into GPU 41.276ms.
Therefore in this test where graphical rendering takes 25~27ms, there is no problem to process it in between compute tasks.
Which as usually shows that AMD has room to improve... what a faux pas again.

And btw. rendering time on 980Ti takes 17.88ms, and 960 takes 41.8ms. 17.88ms vs 25ms is again quite big difference.
That means this 'benchmark' sides with AMD in showing Async shader functionality, but then sides with nV in total performance (AMD's usual overhead).

Edit: In my graphs blue background shows how much rendering you could squeeze in between compute tasks, so higher percentage = better!!!

Dazz said: ↑

https://www.reddit.com/r/pcmasterra...e_all_jump_to_conclusions_and_crucify/cumlmwv

Also from the original thread there is an explanation of it.

A lower percentage is better. If it's at or near 100% it means it's doing it pretty much serially, no benefit from asynchronously running them together.
tl;dr: OP missed the point. Maxwell is good at compute, that wasn't the point. Maxwell just cannot benefit from doing compute + graphics asynchronously. GCN can.
Extra point: all of the NVidia cards show a linear increase in time when you increase the number of compute kernels, stepping up every 32 kernels since Maxwell has 32 thread blocks. The 980Ti took 10ms~ for 1-31 kernels, 21ms~ for 32-63 kernels, 32ms~ for 64-95 kernels, and 44ms~ for 96-127 kernels.
The Fury X took 49ms~ for all 1...128 kernel runs, didn't even budge. It looks like the 49ms is some kind of fixed system overhead and we haven't even seen it being strained by the compute calls at all yet.
Click to expand...

Guy in link uses simplest way to calculate difference and so you are right that lower = better there, but in his way his scale can't ever reach 0%

Last edited: Sep 1, 2015

Fox2232, Sep 1, 2015

#53
Dazz Maha Guru

Messages:

1,010

Likes Received:

131

GPU:

ASUS STRIX RTX 2080

Fox2232 said: ↑

Delays for compute are explained incorrectly. nVidia gradually rise 'delays' (properly named time to finish compute batch) because it is more complex and therefore takes longer to finish.
While for AMD it is cake and it is finished almost instantly once dispatched. And that dispatch takes around:
Fury X : 49.66ms to finish compute task
R9-390x : 52.28ms to finish compute task
Difference 2.62ms from performance jump between r9-390x and Fury X. Fury X is 31.25% stronger in compute so 2.62ms is this part of entire calculation.
Entire calculation time for Fury X is 2.62ms / 0.3125 = 8.384ms
And that makes this stable delay from giving order via software to get through API & driver into GPU 41.276ms.
Therefore in this test where graphical rendering takes 25~27ms, there is no problem to process it in between compute tasks.
Which as usually shows that AMD has room to improve... what a faux pas again.

And btw. rendering time on 980Ti takes 17.88ms, and 960 takes 41.8ms. 17.88ms vs 25ms is again quite big difference.
That means this 'benchmark' sides with AMD in showing Async shader functionality, but then sides with nV in total performance (AMD's usual overhead).

Edit: In my graphs blue background shows how much rendering you could squeeze in between compute tasks, so higher percentage = better!!!
Guy in link uses simplest way to calculate difference and so you are right that lower = better there, but in his way his scale can't ever reach 0%
Click to expand...

Which is right as the graphics part will always take longer than the compute so thats right it will never be 0%. But like the guy who created said it's not a benchmark but only to test the functionality.

Dazz, Sep 1, 2015

#54
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

Dazz said: ↑

Which is right as the graphics part will always take longer than the compute so thats right it will never be 0%. But like the guy who created said it's not a benchmark but only to test the functionality.
Click to expand...

Apparently I used scenario where compute absorbs rendering, because in this test it is made that way, but you can easily do reverse math and say that Rendering is absorbing compute in between draw calls.

You can again get from 0% on nV to 100% on AMD if proper workload is used.
Saying it in reverse, than Async shaders allowed r9-390x to absorb (in average) 48.75% of compute workload into rendering space.
While gtx 980ti could only absorb 5.61% of compute task in between rendering calls.
But again, since AMD took so long to finish anyway, there for sure was a lot of opportunities to schedule other kind of workload in between.

Btw, has someone link for benchmark download? I see another think I would like to test myself.

Fox2232, Sep 1, 2015

#55
tsunami231 Ancient Guru

Messages:

14,748

Likes Received:

1,868

GPU:

EVGA 1070Ti Black

We already knew this, or i already know this i would think most people that keep up with would know there is no full support of DX12 on any card yet

tsunami231, Sep 1, 2015

#56
Enticles Guest

Messages:

242

Likes Received:

10

GPU:

Asus RTX 3070ti

sajibjoarder said: ↑

at least amd is honest.. they dont cheat their customers
Click to expand...

this made me lol.

Both companies are equally guilty of misleading their customers, its just this time AMD appears to be more honest about it. the next scandal will have people hating on AMD for something or another (if i had to guess it would be AMD's performance claims getting debunked - again.)

my point is, twisting the truth and in some cases outright bull****ing is rife in the technology marketplace. so lets all relax about what nvidia did or didnt do, because it'll be AMD's turn to mess up next!

Enticles, Sep 1, 2015

#57
DmitryKo Master Guru

Messages:

446

Likes Received:

159

GPU:

ASRock RX 7800 XT

Dazz said: ↑

DX12 tier 2 (12_0) support so clearly it's not 100% DX12 compliant since it needs to be tier 3 (12_1)
Click to expand...

You are messing things up. A few of each individual optional features, such as resource binding, tiled resources, conservative rasterization etc., have their own separate tiers with well-defined set of requirements on each tier.

When you see "tier 1" and "tier 3" in the table, it's the tiers for these individual optional features - not the Direct3D 12 API in general.

https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D#matrix

All DX12 cards support Resource Binding at tier 1 and Resource Heap at tier 1, feature level 12_0 additionally requires Resource Binding at tier 2, and feature level 12_1 requires Conservative Rasterization at tier 1.

airfathaaaaa said: ↑

i tought every cgn card has resource binding technology on them and thus they are fully tier 3 cards
Click to expand...

Yes, GCN supports Resource Binding tier 3, and so does Skylake integrated graphics.

pharma said: ↑

Lol ... noobs keep changing the chart. The primary DX12 feature level chart can be found at the link below and is modified daily by people in the field.
Click to expand...

It's simply an earlier version of the table.

Fox2232 said: ↑

he is not altering originally present information.
Click to expand...

No, there were actually major changes for Skylake (feature level 12_1, Resource Binding at tier 3, Tiled Resources at tier 3, Conservative Rasterization at tier 3, PS reference value), as well as a small change for Maxwell-1 (it now supports Typed UAV for additional formats with the latest drivers).
https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D#matrix

He added Async shaders, cross-node sharing, UAVs at every stage, maximum sample count for UAV-only rendering, Logical Blend Operations.
Click to expand...

I removed features from the table because they are not optional in Direct3D 12 anymore.

For example,
1) "UAVs at every stage" is supported for all DX12-capable hardware (this feature was tied with UAV slot count of 64 in Direct3D 11.2, but in Direct3D 12these "slots" are just memory pointers and you can use them at every pipeline stage);
2) Maximum sample count of 16 for UAV-only rendering - supported by all DX12 hardware;
2) cross-node sharing - exposed in multi-GPU configurations and current drivers don't seem to support it, so there is no way to test the tiers supported by actual hardware;
3) "async shaders" is not an optional capability in Direct3D 12 - it's an internal hardware feature that can be exposed by the WDDM driver, [post="5094415"]as I explained in an earlier thread.[/post]

I did not really remove the logical blend operations cap bit. Also I didn't bother add a few smaller cap bits from the MDSN docs.

alanm said: ↑

some idiot kept editing the table to indicate pascal has no async shader
Click to expand...

No, someone kept adding "async shaders" to this table when [post="5094415"]there is no such optional capability in Direct3D.[/post]

Last edited: Sep 1, 2015

DmitryKo, Sep 1, 2015

#58
Denial Ancient Guru

Messages:

14,207

Likes Received:

4,121

GPU:

EVGA RTX 3080

DmitryKo said: ↑

You are messing things up. A few of each individual optional features, such as resource binding, tiled resources, conservative rasterization etc., have their own separate tiers with well-defined set of requirements on each tier.

When you see "tier 1" and "tier 3" in the table, it's the tiers for these individual optional features - not the Direct3D 12 API in general.

All DX12 cards support Resource Binding at tier 1, feature level 12_0 additionally requires Resource Binding at tier 2, and feature level 12_1 requires Conservative Rasterization at tier 1.

Yes, GCN supports Resource Binding tier 3, and so does Skylake integrated graphics.

It's simply an earlier version of the table.

No, there were actually major changes for Skylake (feature level 12_1, Resource Binding at tier 3, Tiled Resources at tier 3, Conservative Rasterization at tier 3, PS reference value)., as well as a small change for Maxwell-1 (it supports Typed UAV for additional formats).
https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D#matrix

I removed features from the table because they are not optional in Direct3D 12 anymore.

For example,
1) "UAVs at every stage" is supported for all DX12-capable hardware (this feature was tied with UAV slot count of 64 in Direct3D 11.2, but in Direct3D 12these "slots" are just memory pointers and you can use them at every pipeline stage);
2) Maximum sample count of 16 for UAV-only rendering - supported by all DX12 hardware;
2) cross-node sharing - exposed in multi-GPU configurations and current drivers don't seem to support it, so there is no way to test the tiers supported by actual hardware;
3) "async shaders" is not an optional capability in Direct3D 12 - it's an internal hardware feature that can be exposed by the WDDM driver, [post="5094415"]as I explained in an earlier thread.[/post]

I did not really remove the logical blend operations cap bit.

No, someone kept adding "async shaders" to this table when [post="5094415"]there is no such optional capability in Direct3D.[/post]
Click to expand...

Since you obviously seem to know a bit about Direct3D and whatnot, could you shed some light on what Oxide is claiming about Nvidia's Async stuff or just AoS implementation of DX12 in general?

Denial, Sep 1, 2015

#59
DmitryKo Master Guru

Messages:

446

Likes Received:

159

GPU:

ASRock RX 7800 XT

Denial said: ↑

Since you obviously seem to know a bit about Direct3D and whatnot, could you shed some light on what Oxide is claiming about Nvidia's Async stuff or just AoS implementation of DX12 in general?
Click to expand...

Anyone who is not a graphics driver developer has very few chances to get the right answer to that question.

If I had free time to investigate, I would need to learn the internals DXGK first, which is quite lower-level stuff comparing to the main Direct3D API, and unfortunately MSDN documentation for WDDM 2.0 driver development is far from complete as of now.

DmitryKo, Sep 1, 2015

#60

(You must log in or sign up to reply here.)

Page 3 of 4

Share This Page