Discussion in 'Videocards - NVIDIA GeForce' started by pharma, Sep 17, 2018.
Nvidia's Lead Exceeds Intel's in Cloud
Nvidia’s GPUs now account for 97.4% of infrastructure-as-a-service (IaaS) instance types of dedicated accelerators deployed by the top four cloud services. By contrast, Intel’s processors are used in 92.8% of compute instance types, according to one of the first reports from Liftr Cloud Insights’ component tracking service.
AMD’s overall processor share of instance types is just 4.2%. Cloud services tend to keep older instance types in production as long as possible, so we see AMD increasing its share with expected deployments of its second-generation Epyc processor, aka Rome, in the second half of this year.
Among dedicated accelerators, AMD GPUs currently have only a 1.0% share of instance types, the same share as Xilinx’s Virtex UltraScale+ FPGAs. AMD will have to up its game in deep-learning software to make significant headway against the Nvidia juggernaut and its much deeper, more mature software capabilities.
Intel’s Arria 10 FPGA accounts for only 0.6% of dedicated accelerator instance types. Xilinx and Intel must combat the same Nvidia capabilities that AMD is facing, but FPGAs face additional challenges in data center development and verification tools.
The announcements shared by NVIDIA are still under embargo, but we also had the chance to speak with Justin Walker, Director of Product Management at NVIDIA, specifically about AMD’s newly revealed technologies. On sharpening, he pointed out how NVIDIA Freestyle allowed gamers to tweak that ever since it was introduced in early 2018.
[…] they announced a bunch of technologies, Radeon Sharpening, which is fine. But you know, if you want to go compare that we’ve got, we’ve had sharpening in NVIDIA Freestyle for a very long time. NVIDIA Freestyle has got a whole suite of filters, one of which is sharpening. But there’s also things like HDR toning, and color vibrance, all that stuff. So if you want to play around with sharpening, just keep in mind, you don’t have to wait.
Moving on to the Anti-Lag subject, Walker had this to say:
I think it’s something similar to what we call maximum pre-rendered frames, which is actually something we’ve had in our control panel for some time.
Basically what happens is during the graphics pipeline, the CPU will start processing frames and send them into the pipeline. Now, if you allow it to buffer frames, meaning to see if you just go as fast as you can even if the GPU is not ready, it may send a few frames in the pipeline. You do that to get the max performance, then you can guarantee the GPU is never waiting for the CPU. Because if that happens, you may have to wait a little for the CPU to process before it needs a GPU. So a lot of times, you’ll buffer up a few frames in there, which is great if you’re worried about just straight up performance. However, if you’re sensitive to latency, which if you’re an eSports fan you are, then any mouse movement you make will not affect frames already in that buffer. So if I make a movement, it’ll go into the next frame. But that gets in line behind like a full frame already sitting there in the buffer. And so that can introduce, you know, depending on your frame rate, maybe 20 milliseconds of lag. Now, you can go to our control panel and set it yourself. And I think this is what they are doing, setting the maximum pre-rendered frames to one. And that won’t do any buffering, which may affect your performance a little bit, but it’ll take your latency away.
It should be noted that Walker wasn’t entirely sure AMD’s Radeon Anti-Lag technology was indeed based on this very concept. We’ll have to wait for Anti-Lag to become available to discover whether this is actually the case.
Summit, the world’s fastest supercomputer Triples Its Performance Record
DOING THE MATH: THE REALITY OF HPC AND AI CONVERGENCE
June 17, 2019
There is a more direct approach to converging HPC and AI, and that is to retrofit some of the matrix math libraries that are commonly used in HPC simulations so they can take advantage of dot product engines such as the Tensor Core units that are at the heart of the “Volta” Tesla GPU accelerators that are often at the heart of so-called AI supercomputers such as the “Summit” system at Oak Ridge National Laboratories.
As it turns out, a team of researchers at the University of Tennessee, Oak Ridge. And the University of Manchester, led by Jack Dongarra, one of the creators of the Linpack and HPL benchmarks that are used to gauge the raw performance of supercomputers, have come up with a mixed precision interative refinement solver that can make use of the Tensor Core units inside the Volta and get raw HPC matrix math calculations like those at the heart of Linpack done quicker than if they used the 64-bit math units on the Volta.
This underlying math that implements this iterative refinement approach that has been applied to the Tensor Core units is itself not new, and in fact it dates from the 1940s, according to Dongarra.
The good news is that a new and improved iterative refinement technique is working pretty well by pushing the bulk of the math to the 4×4, 16-bit floating point Tensor Core engines and doing a little 32-bit accumulate and a tiny bit of 64-bit math on top of that to produce an equivalent result to what was produced using only 64-bit math units on the Volta GPU accelerator – but in a much shorter time.
To put the iterative refinement solver to the test, techies at Nvidia worked with the team from Oak Ridge, the University of Tennessee, and the University of Manchester to port the HPL implementation of the Linpack benchmark, which is a 64-bit dense matrix calculation that is used by the Top500, to the new solver – creating what they are tentatively calling HPL-AI – and ran it both ways on the Summit supercomputer. The results were astoundingly good.
Running regular HPL on the full Summit, that worked out to 148.8 petaflops of aggregate compute, and running the HPL-AI variant on the iterative refinement solver in mixed precision it works out to an aggregate of 445 petaflops.
And to be super-precise, about 92 percent of the calculation time in the HPL-AI run was spent in the general matrix multiply (GEMM) library running in FP16 mode, with a little more than 7 percent of wall time being in the accumulate unit of the Tensor Core in FP32 mode and a little less than 1 percent stressing the 64-bit math units on Volta.
Now, the trick is to apply this iterative refinement solver to real HPC applications, and Nvidia is going to be making it available in the CUDA-X software stack so this can be done. Hopefully more and more work can be moved to mixed precision and take full advantage of those Tensor Core units. It’s not quite like free performance – customers are definitely paying for those Tensor Cores on the Volta chips – but it will feel like it is free, and that means Nvidia is going to have an advantage in the HPC market unless and until both Intel and AMD add something like Tensor Core to their future GPU accelerators.
Monster Hunter DLSS implementation on July 17.
July 13, 2019
NVIDIA GeForce RTX 2060 Super Review
July 16, 2019
JULY 11, 2019
Color me impressed! ... The guy is obviously talented and guess if you have the right workshop anything is possible.