Asynchronous Compute

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Carfax, Feb 25, 2016.

  1. HeavyHemi

    HeavyHemi Ancient Guru

    Messages:
    6,954
    Likes Received:
    959
    GPU:
    GTX1080Ti
    Yeah, it's an old story you keep getting corrected and apparently you're not really among those who 'know how it works'.
     
  2. Undying

    Undying Ancient Guru

    Messages:
    16,834
    Likes Received:
    5,715
    GPU:
    Aorus RX580 XTR 8GB
    Changing from SMAA to TSSAA brings 15fps extra and thats with only 2 ACE on my card.
     
  3. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    Your are mixing something up here. There is general code path to follow with DX 12 as DX 11 well. Amd and nvidia does not share the same capabilities in the same directions. So, one of them which is out of directive(can be relevant because of the feature level and so on) , has to make some extra work and optimizations to achieve the same or identical performance. It lies in the hw , not just in the sw. Some of them better in one scenario or more, but some others are better in most of them. You can not tweak the driver as the company's did with DX 11 before. Won't work. It can help but not the same way as before.


    We ended up to the hw limitation as a factor here. However, vulkan API giving much better freedom to company's compare to DX 12. Lot more easier to implement on both vendor's gpu the same thing with similar performance. Hw level limitations will show up, but you can even implement vulkan in an very easy way even to Fermi with a relative boost from day 0 .

    One more thing, doom and deux ex has not received the full optimizations on amd hw. Vulkan is still in the beta test, it should have the next step in this fall so as deux ex. Still in working progress.
     
    Last edited: Sep 10, 2016
  4. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    I am not getting corrected by anyone with valid info so far :D
     

  5. Redemption80

    Redemption80 Ancient Guru

    Messages:
    18,495
    Likes Received:
    266
    GPU:
    GALAX 970/ASUS 970
    Yeah, Vulkan is even worse than DX12.

    It only looks good because of AMD's failure with OGL, but you have to hand it to them with fooling people into turning their OGL issues into a win, many (stupid) people even believe that async compute is responsible for the performance increase over OGL.
     
  6. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    I vote you as a next major api developer! No question about it.
     
  7. Redemption80

    Redemption80 Ancient Guru

    Messages:
    18,495
    Likes Received:
    266
    GPU:
    GALAX 970/ASUS 970
    I would rather be next major api BS filterererer....
     
  8. aufkrawall2

    aufkrawall2 Maha Guru

    Messages:
    1,131
    Likes Received:
    239
    GPU:
    3060 TUF
    Compare no AA vs. FXAA.
    It was ~10% difference on my 390.
     
  9. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,900
    Likes Received:
    773
    GPU:
    Inno3D RTX 3090
    Same here.

    Don't even compare Deus Ex to DOOM. DOOM has even GCN shader intrinsics used. Deus Ex is a beta port that even the developers said had a bug with a lot of GPUs. There is time for it to prove if it's a proper port. DOOM is the absolutely best, most perfect example of what AMD hardware can do to date.

    In what sense is Vulkan worse than DX12? AMD's OGL driver has never really been great, but OGL has always been one big, hot mess, and one of the primary reasons for the mess were custom extensions by vendors.

    On paper it had all the credentials to be a universal API, but every vendor (NVIDIA is more at fault here, but AMD was doing it too) created their own extensions, up until the point that the API lost it's purpose which was to be a generalized platform. Direct X has always been a much "cleaner" API with much better support and documentation, and in recent years it has also been ahead of OpenGL in spec, despite said extensions.

    The "OpenGL Next" project had been in the papers for years, yet everything was completely stuck. The idea behind OpenGL would be impossible to resurrect without AMD gifting Mantle to Khronos (whose president is an NVIDIA guy, btw). One of the concessions paid to NVIDIA for accepting an AMD api, was that 3rd party extensions could be used. I sincerely hope the same sh*t doesn't tank the same boat a second time, but I can't seem to see NVIDIA allowing both major APIs to run against their architecture.

    The way that DX12 divides the pipeline (Graphics/Compute/Copy) is a software image of GCN's internal design (Graphics Scheduler/ACEs/DMA Engines). Same for the requirements for multiengine, which really makes multiple programs run on the GPU in a way that is very natural and easy for programmers, but requires a GPU with fast context switching, the only one being GCN.

    NVIDIA is getting monstrous raw performance out of their current design, which is a very smart and elegant one, but it lacks the context switching, which is something so intrinsic to the architecture that's impossible to "bolt on". NVIDIA has the upper hand due to the crazy frequencies and the basically limitless front end of the card and their tile-based renderer. My feeling for that is that AMD is closing the gap more and more in the frontend gap (the RX 480 has almost triple the frontend performance of the 380x, closing on Maxwell), and that NVIDIA's stopgap at 2GHz is a hard stop, similar to Intel's at 4.5-4.8GHz on the desktop. NVIDIA will have to change their scheduling and make it more "GCN" like, there is no way around that, especially after PS4Pro and Scorpio just cemented AMD in the console space for more than a decade (and possibly more if they keep the architectures steady and follow a mobile-like path with frequent hardware updates that keep backwards compatibility).

    TL;DR: Vulkan was the only way to get out of the clutches of Microsoft, OpenGL never stood a chance. I hope NVIDIA won't f*ck it up with custom extensions, and NVIDIA will have to change their architecture to survive, eventually.

    Here is a brief history of the whole DirectX vs OpenGL thing, for anyone who might care to read.

    Many of the answers here are really, really good. But the OpenGL and Direct3D (D3D) issue should probably be addressed. And that requires... a history lesson.

    And before we begin, I know far more about OpenGL than I do about Direct3D. I've never written a line of D3D code in my life, and I've written tutorials on OpenGL. So what I'm about to say isn't a question of bias. It is simply a matter of history.

    Birth of Conflict

    One day, sometime in the early 90's, Microsoft looked around. They saw the SNES and Sega Genesis being awesome, running lots of action games and such. And they saw DOS. Developers coded DOS games like console games: direct to the metal. Unlike consoles however, where a developer who made an SNES game knew what hardware the user would have, DOS developers had to write for multiple possible configurations. And this is rather harder than it sounds.

    And Microsoft had a bigger problem: Windows. See, Windows wanted to own the hardware, unlike DOS which pretty much let developers do whatever. Owning the hardware is necessary in order to have cooperation between applications. Cooperation is exactly what game developers hate because it takes up precious hardware resources they could be using to be awesome.
    In order to promote game development on Windows, Microsoft needed a uniform API that was low-level, ran on Windows without being slowed down by it, and most of all cross-hardware. A single API for all graphics, sound, and input hardware.

    Thus, DirectX was born.

    3D accelerators were born a few months later. And Microsoft ran into a spot of trouble. See, DirectDraw, the graphics component of DirectX, only dealt with 2D graphics: allocating graphics memory and doing bit-blits between different allocated sections of memory.

    So Microsoft purchased a bit of middleware and fashioned it into Direct3D Version 3. It was universally reviled. And with good reason; looking at D3D v3 code is like staring into the Ark of the Covenant.

    Old John Carmack at Id Software took one look at that trash and said, "Screw that!" and decided to write towards another API: OpenGL.
    See, another part of the many-headed-beast that is Microsoft had been busy working with SGI on an OpenGL implementation for Windows. The idea here was to court developers of typical GL applications: workstation apps. CAD tools, modelling, that sort of thing. Games were the farthest thing on their mind. This was primarily a Windows NT thing, but Microsoft decided to add it to Win95 too.

    As a way to entice workstation developers to Windows, Microsoft decided to try to bribe them with access to these newfangled 3D graphics cards. Microsoft implemented the Installable Client Driver protocol: a graphics card maker could override Microsoft's software OpenGL implementation with a hardware-based one. Code could automatically just use a hardware OpenGL implementation if one was available.

    In the early days, consumer-level videocards did not have support for OpenGL though. That didn't stop Carmack from just porting Quake to OpenGL (GLQuake) on his SGI workstation. As we can read from the GLQuake readme:
    This was the birth of the miniGL drivers. These evolved into full OpenGL implementations eventually, as hardware became powerful enough to implement most OpenGL functionality in hardware. nVidia was the first to offer a full OpenGL implementation. Many other vendors struggled, which is one reason why developers preferred Direct3D: they were compatible on a wider range of hardware. Eventually only nVidia and ATI (now AMD) remained, and both had a good OpenGL implementation.

    OpenGL Ascendant

    Thus the stage is set: Direct3D vs. OpenGL. It's really an amazing story, considering how bad D3D v3 was.

    The OpenGL Architectural Review Board (ARB) is the organization responsible for maintaining OpenGL. They issue a number of extensions, maintain the extension repository, and create new versions of the API. The ARB is a committee made of many of the graphics industry players, as well as some OS makers. Apple and Microsoft have at various times been a member of the ARB.

    3Dfx comes out with the Voodoo2. This is the first hardware that can do multitexturing, which is something that OpenGL couldn't do before. While 3Dfx was strongly against OpenGL, NVIDIA, makers of the next multitexturing graphics chip (the TNT1), loved it. So the ARB issued an extension: GL_ARB_multitexture, which would allow access to multitexturing.
    Meanwhile, Direct3D v5 comes out. Now, D3D has become an actual API, rather than something a cat might vomit up. The problem? No multitexturing.

    Oops.

    Now, that one wouldn't hurt nearly as much as it should have, because people didn't use multitexturing much. Not directly. Multitexturing hurt performance quite a bit, and in many cases it wasn't worth it compared to multi-passing. And of course, game developers love to ensure that their games works on older hardware, which didn't have multitexturing, so many games shipped without it.

    D3D was thus given a reprieve.

    Time passes and NVIDIA deploys the GeForce 256 (not GeForce GT-250; the very first GeForce), pretty much ending competition in graphics cards for the next two years. The main selling point is the ability to do vertex transform and lighting (T&L) in hardware. Not only that, NVIDIA loved OpenGL so much that their T&L engine effectively was OpenGL. Almost literally; as I understand, some of their registers actually took OpenGL enumerators directly as values.

    Direct3D v6 comes out. Multitexture at last but... no hardware T&L. OpenGL had always had a T&L pipeline, even though before the 256 it was implemented in software. So it was very easy for NVIDIA to just convert their software implementation to a hardware solution. It wouldn't be until D3D v7 until D3D finally had hardware T&L support.

    Dawn of Shaders, Twilight of OpenGL

    Then, GeForce 3 came out. And a lot of things happened at the same time.
    Microsoft had decided that they weren't going to be late again. So instead of looking at what NVIDIA was doing and then copying it after the fact, they took the astonishing position of going to them and talking to them. And then they fell in love and had a little console together.

    A messy divorce ensued later. But that's for another time.

    What this meant for the PC was that GeForce 3 came out simultaneously with D3D v8. And it's not hard to see how GeForce 3 influenced D3D 8's shaders. The pixel shaders of Shader Model 1.0 were extremely specific to NVIDIA's hardware. There was no attempt made whatsoever at abstracting NVIDIA's hardware; SM 1.0 was just whatever the GeForce 3 did.

    When ATI started to jump into the performance graphics card race with the Radeon 8500, there was a problem. The 8500's pixel processing pipeline was more powerful than NVIDIA's stuff. So Microsoft issued Shader Model 1.1, which basically was "Whatever the 8500 does."

    That may sound like a failure on D3D's part. But failure and success are matters of degrees. And epic failure was happening in OpenGL-land.
    NVIDIA loved OpenGL, so when GeForce 3 hit, they released a slew of OpenGL extensions. Proprietary OpenGL extensions: NVIDIA-only. Naturally, when the 8500 showed up, it couldn't use any of them.

    See, at least in D3D 8 land, you could run your SM 1.0 shaders on ATI hardware. Sure, you had to write new shaders to take advantage of the 8500's coolness, but at least your code worked.

    In order to have shaders of any kind on Radeon 8500 in OpenGL, ATI had to write a number of OpenGL extensions. Proprietary OpenGL extensions: ATI-only. So you needed an NVIDIA codepath and an ATI codepath, just to have shaders at all.

    Now, you might ask, "Where was the OpenGL ARB, whose job it was to keep OpenGL current?" Where many committees often end up: off being stupid.
    See, I mentioned ARB_multitexture above because it factors deeply into all of this. The ARB seemed (from an outsider's perspective) to want to avoid the idea of shaders altogether. They figured that if they slapped enough configurability onto the fixed-function pipeline, they could equal the ability of a shader pipeline.

    So the ARB released extension after extension. Every extension with the words "texture_env" in it was yet another attempt to patch this aging design. Check the registry: between ARB and EXT extensions, there were eight of these extensions made. Many were promoted to OpenGL core versions.
    Microsoft was a part of the ARB at this time; they left around the time D3D 9 hit. So it is entirely possible that they were working to sabotage OpenGL in some way. I personally doubt this theory for two reasons. One, they would have had to get help from other ARB members to do that, since each member only gets one vote. And most importantly two, the ARB didn't need Microsoft's help to screw things up. We'll see further evidence of that.
    Eventually the ARB, likely under threat from both ATI and NVIDIA (both active members) eventually pulled their head out long enough to provide actual assembly-style shaders.

    Want something even stupider?

    Hardware T&L. Something OpenGL had first. Well, it's interesting. To get the maximum possible performance from hardware T&L, you need to store your vertex data on the GPU. After all, it's the GPU that actually wants to use your vertex data.

    In D3D v7, Microsoft introduced the concept of Vertex Buffers. These are allocated swaths of GPU memory for storing vertex data.

    Want to know when OpenGL got their equivalent of this? Oh, NVIDIA, being a lover of all things OpenGL (so long as they are proprietary NVIDIA extensions), released the vertex array range extension when the GeForce 256 first hit. But when did the ARB decide to provide similar functionality?
    Two years later. This was after they approved vertex and fragment shaders (pixel in D3D language). That's how long it took the ARB to develop a cross-platform solution for storing vertex data in GPU memory. Again, something that hardware T&L needs to achieve maximum performance.

    One Language to Ruin Them All

    So, the OpenGL development environment was fractured for a time. No cross-hardware shaders, no cross-hardware GPU vertex storage, while D3D users enjoyed both. Could it get worse?

    You... you could say that. Enter 3D Labs.

    Who are they, you might ask? They are a defunct company whom I consider to be the true killers of OpenGL. Sure, the ARB's general ineptness made OpenGL vulnerable when it should have been owning D3D. But 3D Labs is perhaps the single biggest reason to my mind for OpenGL's current market state. What could they have possibly done to cause that?

    They designed the OpenGL Shading Language.

    See, 3D Labs was a dying company. Their expensive GPUs were being marginalized by NVIDIA's increasing pressure on the workstation market. And unlike NVIDIA, 3D Labs did not have any presence in the mainstream market; if NVIDIA won, they died.

    Which they did.

    So, in a bid to remain relevant in a world that didn't want their products, 3D Labs showed up to a Game Developer Conference wielding presentations for something they called "OpenGL 2.0". This would be a complete, from-scratch rewrite of the OpenGL API. And that makes sense; there was a lot of cruft in OpenGL's API at the time (note: that cruft still exists). Just look at how texture loading and binding work; it's semi-arcane.

    Part of their proposal was a shading language. Naturally. However, unlike the current cross-platform ARB extensions, their shading language was "high-level" (C is high-level for a shading language. Yes, really).

    Now, Microsoft was working on their own high-level shading language. Which they, in all of Microsoft's collective imagination, called... the High Level Shading Language (HLSL). But their was a fundamentally different approach to the languages.

    The biggest issue with 3D Labs's shader language was that it was built-in. See, HLSL was a language Microsoft defined. They released a compiler for it, and it generated Shader Model 2.0 (or later shader models) assembly code, which you would feed into D3D. In the D3D v9 days, HLSL was never touched by D3D directly. It was a nice abstraction, but it was purely optional. And a developer always had the opportunity to go behind the compiler and tweak the output for maximum performance.

    The 3D Labs language had none of that. You gave the driver the C-like language, and it produced a shader. End of story. Not an assembly shader, not something you feed something else. The actual OpenGL object representing a shader.

    What this meant is that OpenGL users were open to the vagaries of developers who were just getting the hang of compiling assembly-like languages. Compiler bugs ran rampant in the newly christened OpenGL Shading Language (GLSL). What's worse, if you managed to get a shader to compile on multiple platforms correctly (no mean feat), you were still subjected to the optimizers of the day. Which were not as optimal as they could be.

    While that was the biggest flaw in GLSL, it wasn't the only flaw. By far.
    In D3D, and in the older assembly languages in OpenGL, you could mix and match vertex and fragment (pixel) shaders. So long as they communicated with the same interface, you could use any vertex shader with any compatible fragment shader. And there were even levels of incompatibility they could accept; a vertex shader could write an output that the fragment shader didn't read. And so forth.

    GLSL didn't have any of that. Vertex and fragment shaders were fused together into what 3D Labs called a "program object". So if you wanted to share vertex and fragment programs, you had to build multiple program objects. And this caused the second biggest problem.

    See, 3D Labs thought they were being clever. They based GLSL's compilation model on C/C++. You take a .c or .cpp and compile it into an object file. Then you take one or more object files and link them into a program. So that's how GLSL compiles: you compile your shader (vertex or fragment) into a shader object. Then you put those shader objects in a program object, and link them together to form your actual program.

    While this did allow potential cool ideas like having "library" shaders that contained extra code that the main shaders could call, what it meant in practice was that shaders were compiled twice. Once in the compilation stage and once in the linking stage. NVIDIA's compiler in particular was known for basically running the compile twice. It didn't generate some kind of object code intermediary; it just compiled it once and threw away the answer, then compiled it again at link time.

    So even if you want to link your vertex shader to two different fragment shaders, you have to do a lot more compiling than in D3D. Especially since the compiling of a C-like language was all done offline, not at the beginning of the program's execution.

    There were other issues with GLSL. Perhaps it seems wrong to lay the blame on 3D Labs, since the ARB did eventually approve and incorporate the language (but nothing else of their "OpenGL 2.0" initiative). But it was their idea.

    And here's the really sad part: 3D Labs was right (mostly). GLSL is not a vector-based shading language the way HLSL was at the time. This was because 3D Labs's hardware was scalar hardware (similar to modern NVIDIA hardware), but they were ultimately right in the direction many hardware makers went with their hardware.

    They were right to go with a compile-online model for a "high-level" language. D3D even switched to that eventually.

    The problem was that 3D Labs were right at the wrong time. And in trying to summon the future too early, in trying to be future-proof, they cast aside the present. It sounds similar to how OpenGL always had the possibility for T&L functionality. Except that OpenGL's T&L pipeline was still useful before hardware T&L, while GLSL was a liability before the world caught up to it.
    GLSL is a good language now. But for the time? It was horrible. And OpenGL suffered for it.

    Falling Towards Apotheosis

    While I maintain that 3D Labs struck the fatal blow, it was the ARB itself who would drive the last nail in the coffin.

    This is a story you may have heard of. By the time of OpenGL 2.1, OpenGL was running into a problem. It had a lot of legacy cruft. The API wasn't easy to use anymore. There were 5 ways to do things, and no idea which was the fastest. You could "learn" OpenGL with simple tutorials, but you didn't really learn the OpenGL API that gave you real performance and graphical power.
    So the ARB decided to attempt another re-invention of OpenGL. This was similar to 3D Labs's "OpenGL 2.0", but better because the ARB was behind it. They called it "Longs Peak."

    What is so bad about taking some time to improve the API? This was bad because Microsoft had left themselves vulnerable. See, this was at the time of the Vista switchover.

    With Vista, Microsoft decided to institute some much-needed changes in display drivers. They forced drivers to submit to the OS for graphics memory virtualization and various other things.

    While one can debate the merits of this or whether it was actually possible, the fact remains this: Microsoft deemed D3D 10 to be Vista (and above) only. Even if you had hardware that was capable of D3D 10, you couldn't run D3D 10 applications without also running Vista.

    You might also remember that Vista... um, let's just say that it didn't work out well. So you had an underperforming OS, a new API that only ran on that OS, and a fresh generation of hardware that needed that API and OS to do anything more than be faster than the previous generation.

    However, developers could access D3D 10-class features via OpenGL. Well, they could if the ARB hadn't been busy working on Longs Peak.

    Basically, the ARB spent a good year and a half to two years worth of work to make the API better. By the time OpenGL 3.0 actually came out, Vista adoption was up, Win7 was around the corner to put Vista behind them, and most game developers didn't care about D3D-10 class features anyway. After all, D3D 10 hardware ran D3D 9 applications just fine. And with the rise of PC-to-console ports (or PC developers jumping ship to console development. Take your pick), developers didn't need D3D 10 class features.

    Now, if developers had access to those features earlier via OpenGL on WinXP machines, then OpenGL development might have received a much-needed shot in the arm. But the ARB missed their opportunity. And do you want to know the worst part?

    Despite spending two precious years attempting to rebuild the API from scratch... they still failed and just reverted back to the status quo (except for a deprecation mechanism).

    So not only did the ARB miss a crucial window of opportunity, they didn't even get done the task that made them miss that chance. Pretty much epic fail all around.

    And that's the tale of OpenGL vs. Direct3D. A tale of missed opportunities, gross stupidity, willful blindness, and simple foolishness.
     
    Last edited: Sep 10, 2016
  10. somemadcaaant

    somemadcaaant Master Guru

    Messages:
    392
    Likes Received:
    19
    GPU:
    Red Devil 6900XT LE
    Yeah read most and skipped some though a lot of fanboyism detected have to say.

    See @PrMinisterG post that's basically what everyone is currently seeing and thinks.

    Moore so just wanting to see what Micro$oft, AMD and Nvidia have been doing with all their time and money in R&D for DX12. In BF1 beta is was disabled either figuratively or literally. Not to mention all games hitting the Micro$oft store must be DX12 enabled.

    Exactly. It's all hypothetical and juicy hearsay.

    Yeah that's it.
     

  11. siriq

    siriq Master Guru

    Messages:
    790
    Likes Received:
    14
    GPU:
    Evga GTX 570 Classified
    Finally who's got knowledge about these things.

    One thing. Vulkan API already got vendor specific extensions. Nvidia has just announced it a week ago or so. Amd did it before. If i remember correctly, nvidia extensions related to memory management .
     
  12. Stormyandcold

    Stormyandcold Ancient Guru

    Messages:
    5,704
    Likes Received:
    363
    GPU:
    MSI GTX1070 GamingX
    This is a crock of **** tbh. Really, Nvidia f*ck up Vulkan with custom extensions? Do you even follow Vulkan development? Why don't you talk about all the custom AMD extensions? Oh, no, it can't be AMD because it's open standard, right? No, because when you add extentions that only work on AMD, then, that is a custom extension.

    Nvidia should add extensions that benefit their hardware to the Vulkan spec as much as possible.

    Mantle failed because it required Nvidia and Intel to adopt GCN architecture. Vulkan started with a spec that worked on all vendors hardware, stripping Mantle of AMD-only features. Slowly, these features are making their way back into Vulkan, but, unlike Mantle where Nvidia had no say, this time Nvidia can also add their Nvidia-only features. To disallow this and only allow AMD to "steer the ship" will lead ultimately to Vulkan failing as well and it's not going to be that way. Nvidia would only have to drop support of Vulkan on PC and it would be a dead api on PC over-night.

    The "one-size-fits-all" mentality is a pipe-dream. The reality is by doing this, one or more vendors hardware will be held-back. No-one wants that. If everyone adopted GCN it would only stifle competition and limit choice.

    IMHO, Nvidia are more than capable of forging their own path. There's absolutely nothing stopping them developing features to be continued to be added to DX12 and Vulkan spec and they should and will continue to do so.

    As we have seen on PC, if those features and techniques are good enough, then, they will be used on PC, regardless of what the consoles are doing.
     
  13. -Tj-

    -Tj- Ancient Guru

    Messages:
    17,268
    Likes Received:
    1,986
    GPU:
    Zotac GTX980Ti OC
    Just fyi Nvidia has context switching in cuda since Fermi days.

    Its here for cuda compute & physx, flex, cloth,etc among other things..
     
    Last edited: Sep 10, 2016
  14. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,900
    Likes Received:
    773
    GPU:
    Inno3D RTX 3090
    If you read my post, I clearly say that custom extensions will once more be the downfall of the alternative-to-DirectX API. NVIDIA and AMD are doing the same, which is equally bad. NVIDIA overdid it with OpenGL.

    No. And neither should AMD. They should leave the spec to Khronos where it belongs. This is folly and idiocy, and it's exactly what happened with OpenGL. Read the large post to understand where I'm coming from.

    Mantle was a proof of concept, not a generalized high-level graphics API. Mantle was gifted to the Khronos Foundation to be the basis for the next OpenGL. Neither AMD nor NVIDIA should f*ck it up with proprietary extensions.

    There are specific coding paradigms that work best with specific architectures. No one said that NVIDIA has to copy GCN. Just to do the context switching with predictable latencies like GCN can. All the rest they have is quite amazing actually, they could literally make the ultimate GPU if they manage to combine elements from both architectures. The convergence you're so afraid of has practically happened everywhere else, btw. Zen is really copying the microps caches and latencies of the Intel architecture past Sandy, AMD is also doing similar things that NVIDIA is doing with their GPU frontends. Convergence makes sense as some paradigms do work better than others. That doesn't limit choices, just performance outliers, and it actually enhances competition.


    No. F*ck that. If I wanted even more vendor lock-in I would have stayed with my Voodoo and Glide. It's wrong and NVIDIA can't add anything to DX that Microsoft and AMD won't allow, and vice versa. Which is a good thing that is not happening with Vulkan unfortunately.

    They do, but on the compute queue. They don't have it in the general pipeline where compute, graphics and copy queues can be switched on the compute unit level. leldra's post was actually very informative about that.
     
  15. Stormyandcold

    Stormyandcold Ancient Guru

    Messages:
    5,704
    Likes Received:
    363
    GPU:
    MSI GTX1070 GamingX
    Unfortunately PR, your reply basically stifles competition and also holds back progress, but, you can't see that. Basically, you want the competition to have fixed terms. That's not real competition tbh. That's creating a frame-work to level the playing field when there's no need to do that.
     

  16. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,113
    Likes Received:
    466
    GPU:
    RTX 3080
    Async doesn't have to be "supported" by h/w since all Intel GPUs and Keplers support DX12 without anything "async" in h/w. There is no requirement or any guidance on how the h/w should support execution of commands coming from several queues. There is no indication that GCN's way of supporting such execution is anyhow better than Pascal's either. Considering that Pascal is dancing around GCN even in all DX12 titles if you consider TDPs and relative chips complexities - I'd say that GCN is far from being good here.

    I honestly don't know why you still write anything. The fact that you don't know **** has been established on this forum a long time ago.

    The number of ACEs have nothing to do with how much performance increase you're getting.

    You didn't answer my questions. Please do so.

    And no, I won't read PrMinisterG's posts since he's incorrect as well.

    Again, all of this is completely incorrect. Stop spreading misinformation on things you don't understand.

    Nv isn't "more at fault here" at all, most custom Vulkan extensions are from AMD at the moment. Doom is using them and does not use the NV ones btw which is the main reason for it's relatively better performance on AMD h/w under VK.

    DX never was "a much cleaner API", in fact it was the opposite of that in its early days. The reason it got better is rather simple: each new DX version started from scratch instead of building on top of the previous one, previous ones were emulated on the newer ones. OGL have total b/w compatibility between all versions. But none of this has anything to do with DX12 and VK anyway.

    OpenGL Next wasn't "stuck", it was evolving at quite rapid pace recently. Vulkan is not "OpenGL Next", it's a completely separate API. There was nothing impossible in creating a Khronos API without Mantle, they've created lots of them just fine (including OpenCL, you may have heard about that one). Vulkan is not Mantle, nothing was "paid to NV", IHV extensions exist even in DX11, AMD have them as well, in Vulkan too, more than NV has at the moment.

    DX12 doesn't "divide the pipeline". Graphics, Compute, Copy are s/w constructions of the API. DMA engines are supported by all h/w since days of the dinosaurs. Separate compute queue is a hack which does fit AMD h/w better than anyone's else but it doesn't mean that it's somehow bad for other h/w since it was actually NV who implemented the ability to run several compute queues in parallel first. Context switching even on Maxwell is faster than even on GCN4, let alone Pascal. GCN is not in any way better in running several programs at once, although up till Pascal it was better at running several contexts at once (context =/= program).

    NV doesn't "lack the context switching". There's nothing to "bolt on". Pascal's context switching is a generation ahead of GCN's. You seem to completely misunderstand the purpose of a context switch and b/c of this you confuse context switching and global commands dispatch.

    Your feeling that AMD is closing the gap more and more doesn't look like something based on facts since with Pascal vs Polaris the gap has become larger than it was with SI vs Maxwell. I don't really see where you get the fantasies of 2GHz being a "hard stop" or why you think that NV even need high frequencies to beat AMD.

    NV won't have to change their scheduling since this isn't a problem, they'll have to change the capabilities of their SMs and while they will become more "GCN like" it is basically the same as GCN has become more "NV like" by implementing the FB color compression in GCN3 - a natural evolution of the h/w, not something which is being forced on NV somehow.

    TL;DR: You don't know what you're talking about and you're heavily AMD biased.

    Not really. Mantle failed b/c it was AMD's proprietary software which no other IHV would've ever supported. I don't know where that idea of Mantle and Vulkan being somehow better suited for AMD came from. Logically, architecturally, they are not. Technically, at present, AMD will get more benefits because of how bad their DX11/OGL drivers are but that's about it.
     
    Last edited: Sep 10, 2016
  17. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,900
    Likes Received:
    773
    GPU:
    Inno3D RTX 3090
    The point I made is that custom extensions were the bane of OpenGL, and if continued by both vendors they will be the bane of Vulkan too. Vulkan fits AMD GPUs just fine, just because of the multicore usage of the CPU. Maybe if you would read what I actually wrote, it would be better.

    Nope, it was an API with a much more stable and less "hacky" implementation, and much better documentation on top, with a much more robust leadership behind it than the ARB. Maybe if you just read you would also see that multiple attempts at rebooting OpenGL have happened, just because of the cruft and the mess behind it.

    [​IMG]

    I might be wrong with some things, but at least I try to provide links for other people to check. You seem to be content that we believe whatever you say. glNext is Vulkan, and it wasn't going anywhere for at least two years until AMD donated Mantle to the Khronos Group.

    I posted a fairly comprehensive history of OpenGL (along with a link to it), which I guess you didn't read. They haven't created anything just "fine". In fact most of the time they were either too slow, or they didn't know what the f*ck they were doing. Custom extensions for Mantle was a concession to NVIDIA, according to a lot of rumors floating around, but I have nothing concrete on that. NVIDIA is already the vendor with the most custom extensions on Vulkan, btw.

    It does. It divides it to three software (obviously, it's an API not a GPU) queues that map exactly to how GCN is partitioned internally. You call the separate compute queue a "hack" in one sentence, and then you praise NVIDIA for implementing a way to run multiple ones in parallel. I won't even go in depth on why a separate compute queue is a "hack" according to you, it seems ok for people knowing more than both of us. Maybe we need to read a bit more I guess.

    We're back at leldra's post basically. GCN offers much more predictable performance when you want to context switch. NVIDIA's approach is faster most of the time, but you have to trust the driver on how it statically divides the pipeline between calls. You don't have to do that with GCN, as each CU can do it's own "thing" and GCN seems to have very predictable latencies. It's a much more programmer-friendly approach if you want to do multiple things in the GPU. Read this extremetech article, they even have benchmarks. Even the 290 is better at context switching than the 980Ti. Pascal adds preemption at the pixel level, but no context switching between draw call boundaries.

    [​IMG]

    [​IMG]


    Maxwell lacks context switching between draw calls. NVIDIA's own documentation says so. Pascal still needs to finish what it's doing before changing contexts.

    In contrast, GCN's context switching is orders of magnitude faster, as each CU has local memory and the contents of whatever is switched doesn't go to VRAM/cache to die. You might want to read more here. Really good article.

    That's why it's called a "feeling". Of course none of us has hard facts yet, but I find the 2.05GHz "limit" exhibited by almost all Pascal GPUs up to date, quite fascinating. If you simply extrapolate performance from 1.7GHz to 1.3GHz, even the RX 480 could get the 1070, it's not exact rocket science as to why NVIDIA's architectures seem to need higher clocks. I don't know how you can say that the Pascal vs Polaris gap is even "greater". In what sense? The RX 480 is probably the best purchase of them both, and depending on the engine even Fury the Hot Dinosaur gets on the heels of the 1070. I would still get the 1070, btw, as I have said multiple times. What a f*cking fanboy am I, right?

    Yes. I agree 95% on this. I disagree about the scheduling hardware, I believe that a few years into DX12's life, NVIDIA will have to adapt that. As for the rest, the use of the word "force" might not be the best, but you do have convergences in design.

    Mantle didn't fail. Mantle proved a point and it was adopted as the next OpenGL. How is Vulkan and Mantle not fitting AMD? Multi-threaded submission is bad for them? :infinity:

    Don't you wander sometimes how they can write the APIs and drivers for two consoles, a separate PC API, provide the basis for the next OpenGL and (possibly) DX12, and yet you honestly believe they can't write a proper DX11 driver. What if a single-core submission API simply doesn't fit a GPU made for small multicore environments and created to run multiple smaller programs on it with no latencies?

    TL;DR: Read.
     
  18. dr_rus

    dr_rus Ancient Guru

    Messages:
    3,113
    Likes Received:
    466
    GPU:
    RTX 3080
    GCN do not switch contexts "quickly", it is slower at this than even Maxwell. "Preempt" means the same as context switch with an additional option of doing it amidst another context running. All GPUs have "on die buffers", they are called LDS and caches.

    Nothing of this has anything to do with how GCN is running in DX12 and VK.

    ACEs always were just dispatchers, HWS is not a centralized unit, it's just an updated ACE with more functionality capable of managing two command streams simultaneously. Global scheduler is present in all GCN GPUs.


    This point is wrong. There is nothing bad in some IHV exposing functionality which is absent from other IHV's h/w.
    "Multicore usage of the CPU" in Vulkan is the same between all IHVs.
    Maybe if you'd actually known what you're talking about there would be something to read in your posts.

    Early DX versions were very bad compared to OpenGL. This is basically the main reason why id which was among the 3D pioneers is still using OpenGL now. The situation improved considerably around DX7 but at this point MS switched to a "clean slate" approach with each next DX version basically substituting the previous one instead of extending it which made any comparisons of new DX versions to previous ones pointless.

    Maybe if you'd just know instead of spouting fantasies you'd actually know this. OpenGL was a much more robust API with better industry support and documentation up till ~DX7. Its problems is an other side of its advantages - with lots of h/w supported back to pre-3D-acceleration era it has loads of stuff which slows it down and complicates things these days.


    Your problem is that you don't actually know anything, you just find links by Google using the words which you say. I don't post any links because I don't feel like I need to support my words before you, you know. Feel free to educate yourself and you'll see that I'm right.

    OpenGL was and is developing just fine which you can easily see on wiki history page. Before the OpenGL Next became Vulkan it was supposed to be OpenGL 5.0 which would be very much another extension of OpenGL in vein of 3.x and 4.x versions. But then it's been decided that instead of evolving OpenGL they'd produce Vulkan - which is named this way to specifically distance it from OpenGL as it's nor a continuation neither a substitution of the latter. There will still be new OpenGL versions alongside Vulkan in the same way as there already are new DX11 versions alongside DX12.

    Also note that the idea of running a clean slate with some next OpenGL version was around for quite some time and the fact that AMD gifted Mantle specs to Khronos doesn't really mean much as these specs were unusable anywhere but on AMD's h/w. Vulkan is not Mantle even though all three recent graphics APIs share the same ground concept.


    You posted a link to a rant which has nothing to do with reality. OpenGL's development wasn't and isn't in any way different from any other software development, including DX. There was one time where OpenGL fell behind DX in features (DX11 launch frame) but that's it, no other issues worth even speaking about were ever there.

    And I won't even comment on "rumors floating around" as they are floating inside your head only.

    If you'd actually know **** you'd see that even with recent memory extensions added to the spec (not to the driver yet even) it's still 4 extensions from NV against AMD's 5 but alas you don't and can't.

    The funny part here is that all your rage above against NV and their Vulkan extensions is completely misdirected. But it shows your knowledge level and AMD bias again.


    GCN isn't partitioned internally like this at all. GCN have support for 16-64 command queues of any type and dedicated DMA engines for data movement. This is different from NV's (or Intel's) h/w in queue grouping and the total number of queues allowed (which is different between AMD GPUs as well) and that's all.

    Compute queue is a hack because only AMD GPUs have a quirk which allow to get more performance from running compute commands simultaneously with graphics ones on them. NV GPUs don't have that issue and for them this isn't needed. Note that AMD is specifically promoting compute queues for asynchronous execution while they can in fact be graphics queues as well (Vulkan allow this I believe, DX12 doesn't) - but in this case AMD h/w would be hit with the same issues it already have with the graphics queue.

    So I call that a hack, you noticed, good. This is probably the only AMD centered thing in new APIs which I can think of. Once AMD will fix their graphics pipeline they won't need it and they won't get much from running compute specifically asynchronously. This is already apparent on Polaris which gains significantly less from async compute than previous GCN gens. As any hack it will be less and less beneficial with h/w fixing its issues.

    So A. context switching has nothing to do with asynchronous compute execution on GCN GPUs. One would think that a person talking about this stuff in this thread knows this. Your graphs are irrelevant.

    B. GCN doesn't offer "much more predictable performance when you want to context switch". In fact, it was the opposite with Maxwell (maybe AMD already fixed it in their driver). With Pascal NV is a generation ahead here as they can preempt on an instruction level meaning that the context switch can be pretty much instantaneous.

    I also find the bolded part funny as hell after reading all this crap from you about how GCN is better at context switching. I also have no idea what you mean by "driver dividing the pipeline between calls". If that's about preemption then AMD is a gen behind here right now.

    GCN also lacks context switching between draw calls. This just shows how little you actually understand in the thing you're talking about. Every h/w needs to finish what it's doing before doing something else, GCN included.

    You have a soup of misknowledge in your head. Context switching, preemption and asynchronous execution are all different things even though they are linked sometimes (and sometimes they aren't). Nothing of this has anything to do with GCN gains in DX12/VK compared to DX11/OGL. Most of this stuff can work and works just fine in old APIs as well, including async execution, btw.

    You might want to read less of stuff which is wrong and read more of stuff which is actually correct. Or just listen to what I'm saying instead of linking to some completely incorrect posts on social networks.

    GCN context switching is considerably slower than Maxwell's and severely slower than Pascal's. However GCN's CUs are context agnostic and because of this they can in fact run several contexts in parallel ("in flight"). This is why AMD needs the ACEs mechanism as for them having several contexts on the GPU at once means that they need to schedule commands from several contexts at once since their CUs are able to accept them. ACE is a global command dispatch unit, and the only part of GCN GPU which is actually "switch context" is the ACE/HWS unit but it does so only when it has to change the type of queue it is working with.

    GCN CUs do not change context, they are context agnostic, they can accept wavefronts from any active context and execute them when the CU execution pipeline allow it. Nothing is stored anywhere at this point since everything which is running on a CU (I believe it's 64 wavefronts per CU? don't really remember) is already loaded into the CU's LDS/registers or reside in L1/L2.

    This CU context indifference is the last strength GCN have over Pascal where SMs are context aware, everything else is worse, and this strength will go away after Volta launch as well - although I don't expect Volta to be anywhere close to GCN2/3 in gains from async compute either. Because, well, a hack only works on a bugged h/w and I don't expect Volta to be as bugged as GCN2/3 since even GCN4 isn't already.

    The only time when a CU is performing something resembling a "context switch" is when the program is asking to preempt the current workload with another one. This is happening even slower than on Maxwell (and orders of magnitude slower than on Pascal) on all GCN gens but it's not really a "context switch", it's a preemtion call. You can preempt a context with a wavefront from the same context for example. A widely known example of this is an async time warp request or a high priority queue interrupt.


    It's called a BS, not a "feeling". An architecture on a process have an optimal frequency window, news at eleven. Last year people were launching when I was saying that Pascal will run at 2GHz, some were saying that NV has reached a "hard stop" at 1.5GHz. I'm 100% sure that next year there will be changes as well, although not necessarily in the north direction.

    Pascal vs Polaris gap is even greater in a sense of Pascal providing an even higher perf/watt+perf/transistor compared to Polaris. You only need eyes to see that.

    Not sure if you're serious about Fury competing with 1070 being a good thing for GCN and AMD. I think you're joking, aren't you? Because why would you get a 1070 still if that's actually a thing?


    Adapt what? NV doesn't need AMD's ACEs dispatch because NV SMs are never that idle so there is no reason to have the ability to dispatch several warps at once on different SMs as most of them are loaded with just one global dispatch port even right now, when they are able to work with one context only. NV's global dispatcher is context agnostic so there's no need to have an "ACE" for each context running. If or when AMD will fix their h/w they won't need as many ACEs either - in fact, I'm not even sure that they need 8 since there's not much evidence that this helps compared to 2 in GCN1.

    Well, if you consider spending years on something which wasn't used much an optimal way of "proving a point" then sure. I myself not sure that this point needed any proof since Mantle is basically a port of a console API to PC. That point was obvious for some years before there even was Mantle. DX12 was in development before Mantle I think. OpenGL was moving in that direction with AZDO etc although with Mantle it probably ended up quite a bit differently to what would happen without it.

    What? You seriously think that AMD have anything to do with s/w for consoles? :lol: Nothing of this was written by AMD, even Mantle was basically written by DICE guys.

    The fact that DX11 can have a much better driver implementation is proven without a doubt by NV, it has nothing to do with "smaller programs" or other non-existing green men. I know for a fact that AMD is able to use their async compute capabilities in DX11 as well so even this does not require the new APIs. New APIs are good at one thing only: they significantly lower the driver overhead on the CPU because they remove bulk of runtime validation hoping that devs will write good code. Up till now this hope seems to be a bit misplaced as most DX12 efforts are just half-assed.

    Nope. Learn. The fact that you read loads of forums fan fiction or badly written editorials on the subject of this thread doesn't help you in understanding these issues at all.
     
    Last edited: Sep 11, 2016
  19. Carfax

    Carfax Ancient Guru

    Messages:
    2,913
    Likes Received:
    465
    GPU:
    NVidia Titan Xp
    This is true. I remember Ieldra telling me I was crazy to think that Pascal would be at least 1.5ghz :) He didn't think Pascal would have such high clock speeds, as did many others.

    Anyway, epic post and very informative.. Too bad we don't have a rep system on these forums :bang:
     
  20. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    7,900
    Likes Received:
    773
    GPU:
    Inno3D RTX 3090
    I was preparing a huge reply and the forum didn't save it. And I was an idiot for not keeping it in a temp text file like I usually do. Damn.

    The gist of what I wanted to say is that dr_rus seems to say that GCN doesn't have "fast context switching", because the CUs themselves are context agnostic. Which of course would present a GPU as a sum, as a system that you can literally throw whatever you want inside, either graphics or compute, and you can get predictably fast results back, hence it would behave like a system with instant context switching.

    GCN seems to behave much more like a "bucket" partitioned in smaller pieces (each CU), where you can "throw" your code inside and each smaller partition would run whatever it is you want, either graphics or compute. This approach seems to be slower than NVIDIA's for a lot of things, but it has the plus that as a programmer you don't really care up to a point.

    NVIDIA's approach seems to be like a pipe that things pass really fast inside, but the it's statically divided between draw calls. This means that if you want to do fancy stuff with generalized programming/compute that might require more compute than the static divide, you get latencies, or if your compute work is done and the draw call is not, parts of the GPU remain unused.
     

Share This Page