Discussion in 'Frontpage news' started by Hilbert Hagedoorn, Jan 26, 2015.
So, basically, the performance hit would be similar to using x87 for PhysX vs SSE4?
I think it would be worse. The difference is, applications and APIs are coded with high-level languages and the programmer knows everything about the algorythm and the desired end result, so a lot of optimisations can be made either manually or by the compiler.
On the driver level, you have no choice but just to emulate the missing bytecode instruction exactly the way it was specified, using the assembly code of the machine architecture - which for example means a simple swizzle operation takes a single register store on architectures that implement it natively, but requires several register load/store operations to emulate, consuming considerably more instruction slots and registers. Even worse or outright impossible for more complex instrucitons or features that require dedicated hardware or lots of memory accesses, like virtual page tables, texture filtering etc.
Other approach is to detect and replace affected shader code with an optimized version - which both AMD and NVidia have been doing in the early years. But that's basically cheating, and anyway they don't have the resources to analyze and rewrite every shader code on the planet that requires a certain missing feature.
What exactly? If you're talking about Windows, no way. You'd be crippled simply because of the low I/O. I've tried it on a USB 3.0 stick with Linux and it still was far from enjoyable. And that stick reached 200Mbps
Updated Windows 10 preview to technical Preview the other day. I think Windows 10 will be the best windows os ever! I love it.
I just want to thank you, DmitryKo, for helping to demystify this subject.
That's hardware compatibility you're talking about but hardware alone does not make DirectX functionality. Without drivers the hardware can't talk to the OS (or subsystems thereof). While existing DX11 drivers will happily install (on most hardware... not all motherboards or video cards are supported yet) and they work they only enable DX11 features. There are no DX12 drivers yet so the hardware doesn't know how to talk to the DX12-only features. Windows 10's Dxdiag will happily detect that a video card is DX12-ready but that's all Windows 10 can do with DX12 right now. To do more than that it needs drivers that are written for DX12. Right now it just falls back to DX11 mode because any drivers you might have for your video card were written for DX11, not DX12.
Still, the Windows 10 kernel and the DX12 codebase has been improved (well, really it's just in the process of being improved) so supposedly we will see some performance gains with the finished product of Windows 10 compared to Windows 8.1 on the same hardware. But of course it's not the final product yet. And of course you know there will be Windows Update patches after release, not to mention driver optimization. Maybey even a DX12.1 sometime down the road but that's just pure speculation on my part.
Myself, I will continue to run my games on Windows 8.1 while I play around with Windows 10 builds in a VM shell. I just don't think that dual-booting with an OS that's still in a beta is a good idea. I can't afford to have my hardware borked by a bug. There was already one instance of a Windows 10 build that bricked some hard drives. I'm not rich enough to risk wrecking an entire PC. Now if you'll excuse me I need to go hug my motherboard, all this talk of bricking and wrecked hardware has got it worrying. :infinity:
A USB 1.0 flash drive would be incredibly slow. Faster to install off a DVD.
But yeah a USB 3.0 flash drive would be the preferred media to install the OS from. Or USB 3.1 if you're lucky enough to have one of those (USB 3.1 launched this month at CES but I've yet to see a USB 3.1 flash drive in retail.)
I've got a question. What will directx 12 bring to the table? What new features can we expect which wasn't previously available in dx10/11?
Thank you sir. Much appreciated.
I checked the GTX 980 whitepaper from nvidia website to search for if those features are in the hw indeed or just sw implamentation. I recommend you to read as well, since you already know a great deal. I've found that those are indeed hw accelerated "features" (some are actually improvements added on top of basic requirements like level 1 and 2 tile resources with level 1 basic and level 2 with optional CAP bits) though those 4 features are supported in maxwell 2nd gen, direct3d 12 may have additional hw features as it is not finalized. But my opinion is while nvidia will have those 4 features added to directx 12 api, AMD will have asynchronous compute and DMA availability features added to directx 12 api. AMD made asynchronous compute a big deal in their GCN hw including PS4 and DMA is a way to compensate for their much slower cpu which can be supported via HSA with GPGPU if DMA is enabled along with a memory pool for both dGPU and system memory then it explains both consoles having AMD APUs. All 6 features will help porting much easier in the end.
volume tiled resources
rasterizer ordered view
typed UAV load
1- First Conservative rasterization, which will be used along with multi projection to create a new global illumination system, VXGI. Both are hardware accelerated. Conservative rasterization can also be used for accurate tiling and collision detection. Conservative rasterization can actually be used in older hw too (albeit in software mode) but slower as it was an old feature never used but available as in general shader hw but not in a specific function hardware way. As it will be required for next-gen applications (such as voxelizations ie. VXGI) it became specific function hw accelerated. Multi projection is also hw accelerated for instancing geometry for different uses once and for all, as previously when instancing the same geometry, game developers were slacking and making each geomtery to be drawn each and every time in the timeline of application. When the geometry wasn't used it was scrapped, but when it was called once more, it had to draw it once again. Also for a specific time in the application each face of the same geometry had to be calculated (as in each face of the cube) but can't be instanced for creating a voxel.
First off Official Nvidia GTX 980 Whitepaper: http://international.download.nvidi...nal/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF
Details about these two features from GTX 980 whitepaper:
Hardware Acceleration for VXGI – Multi-Projection and Conservative Raster
One exciting property of VXGI is that it is very scalable—by changing the density of the voxel grid, and the amount of tracing of that voxel grid that is performed per pixel, it is possible for VXGI to run across a wide range of hardware, including Kepler GPUs, console hardware, etc. However, for Maxwell it was an important goal to identify opportunities for significant acceleration of VXGI that would enable us to demonstrate its full potential and achieve the highest possible level of realism.
As described above, VXGI based lighting has three major phases—the first two are new, accomplishing the generation of a new voxel data structure, while the third stage is a modification of the existing
lighting phase of real time rendering. Therefore, to enable VXGI as a real-time dynamic lighting technique, it is important that the new work—the creation of the voxel data structure—is as fast as possible, as this is the part of VXGI that is new work for the renderer. Fast voxelization ensures that changes in lighting or the position of objects in the scene can be reflected immediately in the lighting calculation. With this in mind, it was a top priority for Maxwell to implement hardware acceleration for this stage.
One important observation is that the voxelization stage is challenged by the need to analyze the same scene geometry from many views—each face of the voxel cube—to determine coverage and lighting. We call this property of rendering the same scene from multiple views “multi-projection.” It turns out that multi-projection is a property of other important rendering algorithms as well. For example, cube maps (used commonly for assisting with modelling of reflections) require rendering to six faces. And as will be discussed in more depth later, shadow maps can also be rendered at multiple resolutions.
Therefore, acceleration of multi-projection is a broadly useful capability. Today, multi-projection can be implemented either by explicitly sending geometry to the hardware multiple times, or by expanding
geometry in the geometry shader; however, neither approach is particularly efficient. The specific capability that we added to speed up multi-projection is called “Viewport Multicast.” With this feature, Maxwell can use dedicated hardware to automatically broadcast input geometry to any number of desired render targets, avoiding geometry shader overhead. In addition, we added some hardware
support for certain kinds of per viewport processing that are important to this application.
“Conservative Raster” is the second feature in Maxwell that accelerates the voxelization process. As illustrated in the following Figure 11, conservative raster is an alternate algorithm for triangle rasterization.
In traditional rasterization, a triangle covers a pixel if it covers a specific sample point within that pixel, for example, the pixel center in the following picture. Therefore with traditional rasterization, the four purple pixels would be considered “covered” by the triangle. With conservative rasterization rules on the other hand, a pixel is considered covered if any part of the pixel is covered by any part of the triangle. In the following picture, the seven range pixels are also “covered” by conservative rasterization rules. Hardware support for conservative raster is very helpful for the coverage phase of voxelization. In this phase, fractional coverage of each voxel needs to be determined with high accuracy to ensure the voxelized 3D grid represents the original 3D triangle data properly. Conservative raster helps the hardware to perform this calculation efficiently; without conservative raster there are workarounds that can be used to achieve the same result, but they are much more expensive.
The benefit of these features can be measured by running the voxelization stage of VXGI both ways (i.e., with the new features enabled vs. disabled). Figure 12 below compares the performance of voxelization on “San Miguel,” a popular test scene for global illumination algorithms—GTX 980 achieves a 3x speedup when these features are enabled.
2- Secondly Volume Tiled Resources that will be hardware accelerated. Which is actually just usage of an old feature in a new way. As previously unused Tiled resources (except for a handful of games) can be extended into the 3rd dimension and used for voxelization purposes this time around. This is an old feature which had two levels of hardware acceleration. Maxwell 2nd gen and GCN 1 and 1.1 has Level 2 Tiled Resources so it is backwards compatible. But with Maxwell, Level 2 Tiled Resources will be extended into the 3rd dimension and also can be used along with multi-projection (another Maxwell only hardware accelerated feature) for new opportunities like creating voxels (cubes) with only one side of it calculated/drawn (less calculations) and instancing the rest with the first side so it also requires less memory footprint.
Details about those features from GTX 980 whitepaper:
Multi-Projection and Tiled Resources
DirectX 11.2 introduced a feature called Tiled Resources that could be accelerated with an NVIDIA
Kepler and Maxwell hardware feature called Sparse Texture. With Tiled Resources, only the portions of
the textures required for rendering are stored in the GPU’s memory. Tiled Resources works by breaking
textures down into tiles (pages), and the application determines which tiles might be needed and loads
them into video memory. It is also possible to use the same texture tile in multiple textures without any
additional texture memory cost; this is referred to as aliasing. In the implementation of voxel grids,
aliasing can be used to avoid redundant storage of voxel data, saving significant amounts of memory.
One interesting application of Tiled Resources is multi resolution shadow maps. In the following Figure
13, the image on the left shows the result of determining shadow information from a fixed resolution
In the foreground, the shadow map resolution is not adequate, and blocky artifacts are
clearly visible. One solution would be to use a much higher resolution shadow map for the whole scene,
but this would be expensive in memory footprint and rendering time. Alternatively, with Tiled Resources
it is possible to render multiple copies of the shadow map at different resolutions, each populated only
where that level of resolution detail is needed based on the scene. In the image, each
resolution of shadow map is illustrated with a different color. The highest resolution shadow map (in
red) is only used in the foreground when that high resolution is required. This is another application of multi-projection that will benefit from the hardware acceleration in
Maxwell. In the future, we also believe that tiled resources can be leveraged within VXGI, to save voxel
3- Thirdly Raster Ordered View is about the order of rasterizations of objects using special interlocks placed in 2nd gen Maxwell shader units just like in ROPs so it is also hardware accelerated. It gives the developer control over the order that elements are rasterized in a scene, so that elements are drawn in the correct order in the first place all at once (previously it was drawn first and then sorted afterwards in an order for correct image - too slow). This feature specifically applies to Unordered Access Views (UAVs) being generated by pixel shaders, which by their very definition are initially unordered. ROVs offers an alternative to UAV's unordered nature, which would result in elements being rasterized simply in the order they were finished. For most rendering tasks unordered rasterization is fine (deeper elements would be occluded anyhow), but for a certain category of tasks having the ability to efficiently control the access order to a UAV is important to correctly render a scene quickly.
The textbook use case for ROVs is Order Independent Transparency, which allows for elements to be rendered in any order and still blended together correctly in the final result (in a fast fashion due to ROVs). Order Independent Transparency is not new – Direct3D 11 gave the API enough flexibility to accomplish this task – however these earlier OIT implementations would be very slow due to sorting, restricting their usefulness outside of CAD/CAM. The ROV implementation however could accomplish the same task much more quickly by getting the order correct from the start, as opposed to having to sort results after the fact. So now Order Independent Transparency is finally fast enough to use it in real time rendering in games.
Along these lines, since OIT is just a specialized case of a pixel blending operation, ROVs will also be usable for other tasks that require controlled pixel blending, including certain cases of anti-aliasing.
Details about those features from GTX 980 whitepaper:
Raster Ordered View
To ensure that rendering results are predictable, the DX API has always specified “in order” processing
rules for the raster pipeline, in particular the Color and Z units (“ROP”). Given two triangles sent to the
GPU in order—first triangle “A,” then “B”—that touch the same XY screen location, the GPU hardware
guarantees that triangle “A” will blend its color result before “B” blends it. Special interlock hardware in
the ROP is responsible for enforcing this ordering requirement.
DX11 introduced the capability for the pixel shader to bind “Unordered Access Views” of color and Z
buffers, and read and write arbitrary locations within those buffers. However as the name implies, there
is no processing order guarantee when multiple pixel shaders are accessing the same UAV.
The next generation DX API introduces the concept of a “Raster Ordered View,” which supports the
same guaranteed processing order that has traditionally been supported by Z and Color ROP units.
Specifically, given two shaders A and B, each associated with the same raster X and Y, hardware must
guarantee that shader A completes all of its accesses to the ROV before shader B makes an access.
To support Raster Ordered View, Maxwell adds a new interlock unit in the shader with similar
functionality to the unit in ROP. When shaders run with access to a ROV enabled, the interlock unit is responsible for tracking the XY of all active pixel shaders and blocking conflicting shaders from running
One potential application for Raster Ordered View is order independent transparency rendering
algorithms, which handle the case of an application that is unable to pre-sort its transparent geometry
by instead having the pixel shader maintain a sorted list of transparent fragments per pixel.
4- Finally Typed UAV(Unordered Access View) Load is actually a newer and improved form of Unordered Access View that was first available in Feature Level 11_1. But this time around unpacking and then ordering of these unordered packets will be handled by the GPU instead of CPU which was previously the case. With 2nd gen Maxwell, NVIDIA has finally implemented the remaining features required for FL11_1 compatibility and beyond, updating their architecture to support the 16x raster coverage sampling required for Target Independent Rasterization and UAVOnlyRenderingForcedSampleCount. This extended feature set also extends to Direct3D 11.2, which although it doesn’t have an official feature level of its own, does introduce some new (and otherwise optional) features that are accessed via cap bits. Look at following image for UAV Slots, UAVs at Every Stage, UAV only rendering. And Tiled Resources Level 2.
Unordered Access Views (UAVs) are a special type of buffer that allows multiple GPU threads to access the same buffer simultaneously without generating memory conflicts. Because of this disorganized nature of UAVs, certain restrictions are in place that Typed UAV Load will address. As implied by the name, Typed UAV Load deals with cases where UAVs are data typed, and how to better handle their use. So in general any hardware that completely supports Feature Level 11_1 (all the optional bits included) will also be supporting Typed UAV Load!
Typed UAV Load goes ahead and attempts to address issues that are created (mostly restrictions) that’s currently in DX11. One of the downsides of UAV is that there are specific restrictions in place due to its unordered nature. Basically unpacking was handled on the software side, which means the job was put on the CPU to do it. Now the GPU will be able to accomplish the same thing without CPU intervention in Typed UAV Loads.
I still think these are minor additions and as always there will be a considerable time lag before content creators and game programmers start using any of these features, even though they will be available in Direct3D 11.3 as well.
So again, right now any D3D11 card - specifically Radeon HD77xx, HD85xx, R5 240 and up (GCN cards) and GeForce GT420 and up (Fermi/Kepler/Maxwell) - is perfectly compatible with Direct3D 12 and all the performance enhancements it offers.
You are welcome.
So pumped for DX12, since DX12 will use more threads/cores and be more efficient at evening the load across them. All DX before then were heavily single threaded which caused AMDs main weakness to shine more. DX12 should help eliminate that and Windows 10 hopefully as on OS is more core friendly as well.
Turns out Conservative Raster can be used with 2nd gen maxwell even with Direct3D11 via nvapi=>hw support even without directx 12.
What is it?
Conservative raster is a cool new feature that came along with NVIDIA’s new Maxwell architecture, which means it is accessible today on GTX980/GTX970 boards. It allows rasterization to generate fragments for every pixel touched by a primitive. This post shows how to enable the feature, and highlights a couple of example use cases: ray traced shadows and voxelization.
How can it be enabled?
Use the NVIDIA NvAPI to create an ID3D11RasterizerState, the interface for this is shown below:
NVAPI_RS_DESC.ConservativeRasterEnable = TRUE; // Enable conservative raster
Direct3D11.3 & Direct3D12
These future Direct3D API’s will include full support for this feature, as detailed on slide 24 and 25 of this presentation: http://blogs.msdn.com/cfs-file.ashx...irect3D-12-_2D00_-New-Rendering-Features.pptx
Use the GL_NV_conservative_raster extension.
NVIDIA also has an OpenGL GameWorks SDK sample demonstrating how to set up Conservative Rasterization.
Is it possible to achieve Conservative Raster without HW support?
Yes, it is indeed possible to do this, and there is a very good article describing it here.
However both approaches add performance overhead, and as such usage of conservative rasterization in real time graphics has been pretty limited so far.
So what can I do with it?
There are no doubt many applications for the conservative raster, but in this article I’d like to highlight a couple of interesting usage cases:
NEW INFO FROM HERE ON OUT
Ray Traced Shadows
An interesting approach to ray tracing that avoids the use of bounding volume hierarchies, is to store primitives in a regular grid data structure as seen from the light’s point of view. Essentially you can think of this like a shadow map, but instead of storing depth, you store an array of triangles, as depicted in figure 7 below. Of course it is important to note that this kind of data structure would not scale to cater for a highly complex game scene, but instead would be more appropriate for a discrete dynamic object, such as the main game character.
Conservative rasterization is definitely needed in this case, to ensure that every triangle that touches a primitive map texel is stored. Without conservative raster enabled triangles will be missing from the map, and you’ll wind up with holes in the shadow. To perform the ray tracing, just compute primitive map lookups in a similar way as you would for shadow mapping, but instead perform ray-triangle intersection tests to determine if a screen pixel is occluded. The resulting quality of shadow, as compared to regular shadow mapping is very high. Figures 8 to 10 compare regular shadow mapping with the ray traced result, and show the effect of switching conservative raster on:
Figure 8: Regular Shadow Map
Figure 9: Ray Traced - Conservative Raster OFF
Figure 10: Ray Traced - Conservative Raster ON
This is the process of converting a triangular mesh into a voxel representation, and can be thought of as a way of storing a lower level of detail for a scene, as highlighted by the photograph below:
Voxelization has many potential usage scenarios, but an obvious one is Global Illumination (GI). GI is the process of computing the effect of light as it bounces around a scene, the result of which is far more life-like lighting than traditional direct lighting solutions, as demonstrated by the comparison shots below:
Figure 12: Tradition Direct Lighting
Figure 13: Global Illumination
Conservative rasterization comes into play during the voxelization process, to ensure that small triangles are not dropped out of voxel information. Voxelization tends to be performed at a lower resolution than the full scene for performance reasons, which only increases the need for conservative raster to be enabled.
NVAPI_RS_DESC.ConservativeRasterEnable = TRUE; // Enable conservative raster