Maxwell 2nd Gen to Support DirectX 12?

Discussion in 'Videocards - NVIDIA GeForce' started by DamnedLife, Jan 30, 2015.

  1. DamnedLife

    DamnedLife Guest

    Messages:
    101
    Likes Received:
    0
    GPU:
    Sapphire R9 290X TRI-X OC
    First of all I scoured the internet to find if GTX 980, 970 and 960s supports all the features of DirectX 12 (Direct3D 12 to be specific) or not. As Nvidia previously fooled us with clever wording tactics saying some gpus supporting DirectX 11 and 11.1 API but not actually supporting hardware level features of them and for DirectX 12 the same seems to be happening. From the official nvidia blog http://blogs.nvidia.com/blog/2015/01/21/windows-10-nvidia-dx12/ and it which says:
    "We’re more than ready. GPUs built on our Maxwell GPU architecture – such as our recently released GeForce GTX 970 and GeForce GTX 980 – fully support DX12 - See more at: http://blogs.nvidia.com/blog/2015/01/21/windows-10-nvidia-dx12/#sthash.yPKoXclP.dpuf"

    But that just means DX 12 API, not all feature levels surely, no? Though all the review sites seems to think that 2nd gen Maxwell supports Dx 11.3 and 12 features currently announced. As Dx12 is not finalized, all previously announced features are at least supported it seems. So I searched for the most official nvidia document there is, the GTX 980 whitepaper. I've found that the features most commonly announced by review sites are indeed hw accelerated "features" (some are actually improvements added on top of basic requirements like level 1 and 2 tile resources with level 1 basic and level 2 with optional CAP bits) though those 4 features are supported in maxwell 2nd gen, direct3d 12 may have additional hw features as it is not finalized. But my opinion is while nvidia will have those 4 features added to directx 12 api, AMD will have asynchronous compute and DMA availability features added to directx 12 api. AMD made asynchronous compute a big deal in their GCN hw including PS4 and DMA is a way to compensate for their much slower cpu which can be supported via HSA with GPGPU if DMA is enabled along with a memory pool for both dGPU and system memory then it explains both consoles having AMD APUs. All 6 features will help porting much easier in the end.

    Nvidia's 4 features:
    Conservative Rasterization
    Volume Tiled Resources
    Rasterizer Ordered View
    Typed UAV Load

    And I will show that these are really hw accelerated supported by excerpts from the offical whitepaper.

    1- First Conservative rasterization, which will be used along with multi projection to create a new global illumination system, VXGI. Both are hardware accelerated. Conservative rasterization can also be used for accurate tiling and collision detection. Conservative rasterization can actually be used in older hw too (albeit in software mode) but slower as it was an old feature never used but available as in general shader hw but not in a specific function hardware way. As it will be required for next-gen applications (such as voxelizations ie. VXGI) it became specific function hw accelerated. Multi projection is also hw accelerated for instancing geometry for different uses once and for all, as previously when instancing the same geometry, game developers were slacking and making each geomtery to be drawn each and every time in the timeline of application. When the geometry wasn't used it was scrapped, but when it was called once more, it had to draw it once again. Also for a specific time in the application each face of the same geometry had to be calculated (as in each face of the cube) but can't be instanced for creating a voxel.

    First off Official Nvidia GTX 980 Whitepaper: http://international.download.nvidi...nal/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF

    Details about these two features from GTX 980 whitepaper:

    Hardware Acceleration for VXGI – Multi-Projection and Conservative Raster

    One exciting property of VXGI is that it is very scalable—by changing the density of the voxel grid, and the amount of tracing of that voxel grid that is performed per pixel, it is possible for VXGI to run across a wide range of hardware, including Kepler GPUs, console hardware, etc. However, for Maxwell it was an important goal to identify opportunities for significant acceleration of VXGI that would enable us to demonstrate its full potential and achieve the highest possible level of realism.
    As described above, VXGI based lighting has three major phases—the first two are new, accomplishing the generation of a new voxel data structure, while the third stage is a modification of the existing
    lighting phase of real time rendering
    . Therefore, to enable VXGI as a real-time dynamic lighting technique, it is important that the new work—the creation of the voxel data structure—is as fast as possible, as this is the part of VXGI that is new work for the renderer. Fast voxelization ensures that changes in lighting or the position of objects in the scene can be reflected immediately in the lighting calculation. With this in mind, it was a top priority for Maxwell to implement hardware acceleration for this stage.
    One important observation is that the voxelization stage is challenged by the need to analyze the same scene geometry from many views—each face of the voxel cube—to determine coverage and lighting. We call this property of rendering the same scene from multiple views “multi-projection.” It turns out that multi-projection is a property of other important rendering algorithms as well. For example, cube maps (used commonly for assisting with modelling of reflections) require rendering to six faces. And as will be discussed in more depth later, shadow maps can also be rendered at multiple resolutions.
    Therefore, acceleration of multi-projection is a broadly useful capability.
    Today, multi-projection can be implemented either by explicitly sending geometry to the hardware multiple times, or by expanding
    geometry in the geometry shader; however, neither approach is particularly efficient. The specific capability that we added to speed up multi-projection is called “Viewport Multicast.” With this feature, Maxwell can use dedicated hardware to automatically broadcast input geometry to any number of desired render targets, avoiding geometry shader overhead. In addition, we added some hardware
    support for certain kinds of per viewport processing that are important to this application.

    [​IMG]

    “Conservative Raster” is the second feature in Maxwell that accelerates the voxelization process. As illustrated in the following Figure 11, conservative raster is an alternate algorithm for triangle rasterization.

    [​IMG]

    In traditional rasterization, a triangle covers a pixel if it covers a specific sample point within that pixel, for example, the pixel center in the following picture. Therefore with traditional rasterization, the four purple pixels would be considered “covered” by the triangle. With conservative rasterization rules on the other hand, a pixel is considered covered if any part of the pixel is covered by any part of the triangle. In the following picture, the seven range pixels are also “covered” by conservative rasterization rules. Hardware support for conservative raster is very helpful for the coverage phase of voxelization. In this phase, fractional coverage of each voxel needs to be determined with high accuracy to ensure the voxelized 3D grid represents the original 3D triangle data properly. Conservative raster helps the hardware to perform this calculation efficiently; without conservative raster there are workarounds that can be used to achieve the same result, but they are much more expensive.
    The benefit of these features can be measured by running the voxelization stage of VXGI both ways (i.e., with the new features enabled vs. disabled). Figure 12 below compares the performance of voxelization on “San Miguel,” a popular test scene for global illumination algorithms—GTX 980 achieves a 3x speedup when these features are enabled.

    [​IMG]

    2- Secondly Volume Tiled Resources that will be hardware accelerated. Which is actually just usage of an old feature in a new way. As previously unused Tiled resources (except for a handful of games) can be extended into the 3rd dimension and used for voxelization purposes this time around. This is an old feature which had two levels of hardware acceleration. Maxwell 2nd gen and GCN 1 and 1.1 has Level 2 Tiled Resources so it is backwards compatible. But with Maxwell, Level 2 Tiled Resources will be extended into the 3rd dimension and also can be used along with multi-projection (another Maxwell only hardware accelerated feature) for new opportunities like creating voxels (cubes) with only one side of it calculated/drawn (less calculations) and instancing the rest with the first side so it also requires less memory footprint.


    Details about those features from GTX 980 whitepaper:

    Multi-Projection and Tiled Resources

    DirectX 11.2 introduced a feature called Tiled Resources that could be accelerated with an NVIDIA
    Kepler and Maxwell hardware feature called Sparse Texture
    . With Tiled Resources, only the portions of
    the textures required for rendering are stored in the GPU’s memory. Tiled Resources works by breaking
    textures down into tiles (pages), and the application determines which tiles might be needed and loads
    them into video memory. It is also possible to use the same texture tile in multiple textures without any
    additional texture memory cost; this is referred to as aliasing. In the implementation of voxel grids,
    aliasing can be used to avoid redundant storage of voxel data, saving significant amounts of memory.

    One interesting application of Tiled Resources is multi resolution shadow maps. In the following Figure
    13, the image on the left shows the result of determining shadow information from a fixed resolution
    shadow map.
    [​IMG]
    [​IMG]
    In the foreground, the shadow map resolution is not adequate, and blocky artifacts are
    clearly visible. One solution would be to use a much higher resolution shadow map for the whole scene,
    but this would be expensive in memory footprint and rendering time.
    Alternatively, with Tiled Resources
    it is possible to render multiple copies of the shadow map at different resolutions, each populated only
    where that level of resolution detail is needed based on the scene.
    In the image, each
    resolution of shadow map is illustrated with a different color. The highest resolution shadow map (in
    red) is only used in the foreground when that high resolution is required. This is another application of multi-projection that will benefit from the hardware acceleration in
    Maxwell.
    In the future, we also believe that tiled resources can be leveraged within VXGI, to save voxel
    memory footprint



    3- Thirdly Raster Ordered View is about the order of rasterizations of objects using special interlocks placed in 2nd gen Maxwell shader units just like in ROPs so it is also hardware accelerated. It gives the developer control over the order that elements are rasterized in a scene, so that elements are drawn in the correct order in the first place all at once (previously it was drawn first and then sorted afterwards in an order for correct image - too slow). This feature specifically applies to Unordered Access Views (UAVs) being generated by pixel shaders, which by their very definition are initially unordered. ROVs offers an alternative to UAV's unordered nature, which would result in elements being rasterized simply in the order they were finished. For most rendering tasks unordered rasterization is fine (deeper elements would be occluded anyhow), but for a certain category of tasks having the ability to efficiently control the access order to a UAV is important to correctly render a scene quickly.

    [​IMG]

    The textbook use case for ROVs is Order Independent Transparency, which allows for elements to be rendered in any order and still blended together correctly in the final result (in a fast fashion due to ROVs). Order Independent Transparency is not new – Direct3D 11 gave the API enough flexibility to accomplish this task – however these earlier OIT implementations would be very slow due to sorting, restricting their usefulness outside of CAD/CAM. The ROV implementation however could accomplish the same task much more quickly by getting the order correct from the start, as opposed to having to sort results after the fact. So now Order Independent Transparency is finally fast enough to use it in real time rendering in games.

    [​IMG]

    Along these lines, since OIT is just a specialized case of a pixel blending operation, ROVs will also be usable for other tasks that require controlled pixel blending, including certain cases of anti-aliasing.


    Details about those features from GTX 980 whitepaper:

    Raster Ordered View

    To ensure that rendering results are predictable, the DX API has always specified “in order” processing
    rules for the raster pipeline, in particular the Color and Z units (“ROP”). Given two triangles sent to the
    GPU in order—first triangle “A,” then “B”—that touch the same XY screen location, the GPU hardware
    guarantees that triangle “A” will blend its color result before “B” blends it. Special interlock hardware in
    the ROP is responsible for enforcing this ordering requirement.

    DX11 introduced the capability for the pixel shader to bind “Unordered Access Views” of color and Z
    buffers, and read and write arbitrary locations within those buffers. However as the name implies, there
    is no processing order guarantee when multiple pixel shaders are accessing the same UAV.
    The next generation DX API introduces the concept of a “Raster Ordered View,” which supports the
    same guaranteed processing order that has traditionally been supported by Z and Color ROP units.

    Specifically, given two shaders A and B, each associated with the same raster X and Y, hardware must
    guarantee that shader A completes all of its accesses to the ROV before shader B makes an access.

    To support Raster Ordered View, Maxwell adds a new interlock unit in the shader with similar
    functionality to the unit in ROP.
    When shaders run with access to a ROV enabled, the interlock unit is responsible for tracking the XY of all active pixel shaders and blocking conflicting shaders from running
    simultaneously.

    One potential application for Raster Ordered View is order independent transparency rendering
    algorithms, which handle the case of an application that is unable to pre-sort its transparent geometry
    by instead having the pixel shader maintain a sorted list of transparent fragments per pixel.

    [​IMG]

    4- Finally Typed UAV(Unordered Access View) Load is actually a newer and improved form of Unordered Access View that was first available in Feature Level 11_1. But this time around unpacking and then ordering of these unordered packets will be handled by the GPU instead of CPU which was previously the case. With 2nd gen Maxwell, NVIDIA has finally implemented the remaining features required for FL11_1 compatibility and beyond, updating their architecture to support the 16x raster coverage sampling required for Target Independent Rasterization and UAVOnlyRenderingForcedSampleCount. This extended feature set also extends to Direct3D 11.2, which although it doesn’t have an official feature level of its own, does introduce some new (and otherwise optional) features that are accessed via cap bits. Look at following image for UAV Slots, UAVs at Every Stage, UAV only rendering. And Tiled Resources Level 2.

    [​IMG]

    Unordered Access Views (UAVs) are a special type of buffer that allows multiple GPU threads to access the same buffer simultaneously without generating memory conflicts. Because of this disorganized nature of UAVs, certain restrictions are in place that Typed UAV Load will address. As implied by the name, Typed UAV Load deals with cases where UAVs are data typed, and how to better handle their use. So in general any hardware that completely supports Feature Level 11_1 (all the optional bits included) will also be supporting Typed UAV Load!

    [​IMG]

    Typed UAV Load goes ahead and attempts to address issues that are created (mostly restrictions) that’s currently in DX11. One of the downsides of UAV is that there are specific restrictions in place due to its unordered nature. Basically unpacking was handled on the software side, which means the job was put on the CPU to do it. Now the GPU will be able to accomplish the same thing without CPU intervention in Typed UAV Loads.
     
  2. sykozis

    sykozis Ancient Guru

    Messages:
    22,492
    Likes Received:
    1,537
    GPU:
    Asus RX6700XT
    DirectX 12 is not finalized yet. Stop worrying about what cards will "fully support" DirectX 12 until it's actually finalized.
     
  3. DamnedLife

    DamnedLife Guest

    Messages:
    101
    Likes Received:
    0
    GPU:
    Sapphire R9 290X TRI-X OC
    Yeah I know. I just theorized that those 4 nvidia features along with 2 unannounced (but yet already in place in Mantle) amd features will be all there is to Dx12 in the end when it is finalized.

    Quoting myself: "As Dx12 is not finalized, all previously announced features are at least supported it seems. So I searched for the most official nvidia document there is, the GTX 980 whitepaper. I've found that the features most commonly announced by review sites are indeed hw accelerated "features" (some are actually improvements added on top of basic requirements like level 1 and 2 tile resources with level 1 basic and level 2 with optional CAP bits) though those 4 features are supported in maxwell 2nd gen, direct3d 12 may have additional hw features as it is not finalized. But my opinion is while nvidia will have those 4 features added to directx 12 api, AMD will have asynchronous compute and DMA availability features added to directx 12 api. AMD made asynchronous compute a big deal in their GCN hw including PS4 and DMA is a way to compensate for their much slower cpu which can be supported via HSA with GPGPU if DMA is enabled along with a memory pool for both dGPU and system memory then it explains both consoles having AMD APUs. All 6 features will help porting much easier in the end."
    Dx 12 is all about becoming more like console APIs so that it is low-level, multi threaded, and every detail is up to developers including optimizations. Since consoles are also x86 based this generation, it will be even easier to port between PC and consoles if the APIs behave similar if not exact.

    All I hypothesize is Microsoft won't change or add on top of these features much and since it was a collaboration between HW vendors and Microsoft, 2nd gen Maxwell is actually fully Dx12 capable in hardware level and we will see dx 12 capable amd cards very soon.
     
  4. CK the Greek

    CK the Greek Maha Guru

    Messages:
    1,316
    Likes Received:
    37
    GPU:
    RTX 2060S
    Nice thread though gathering all that info.
     

  5. Carfax

    Carfax Ancient Guru

    Messages:
    3,956
    Likes Received:
    1,450
    GPU:
    Zotac 4090 Extreme
    Very informative post. You get an A+ for effort :)
     
  6. DamnedLife

    DamnedLife Guest

    Messages:
    101
    Likes Received:
    0
    GPU:
    Sapphire R9 290X TRI-X OC

Share This Page