Guru3D.com Forums

Go Back   Guru3D.com Forums > General Chat > Frontpage news
Frontpage news Perhaps you have some news to report or want to check out the latest Guru3D headlines and comment ? Check it in here.


Reply
 
Thread Tools Display Modes
NVIDIA Dramatically Simplifies Parallel Programming With CUDA 6
Old
  (#1)
Spets
Maha Guru
 
Spets's Avatar
 
Videocard: GTX780Ti+GTX750Ti+G-Sync
Processor: Intel Core i7 2600k @ 4.5
Mainboard: GA-Z68X-UD7-B3
Memory: G.Skill Ripjaws 16gb 2133
Soundcard:
PSU: EVGA SuperNOVA
Default NVIDIA Dramatically Simplifies Parallel Programming With CUDA 6 - 11-16-2013, 03:36 | posts: 2,053

Quote:
NVIDIA Dramatically Simplifies Parallel Programming With CUDA 6




Unified Memory, Drop-In Libraries Among New Programmability Features to Empower Next Wave of GPU Developers


SANTA CLARA, CA - NVIDIA today announced NVIDIA® CUDA® 6, the latest version of the world's most pervasive parallel computing platform and programming model.

The CUDA 6 platform makes parallel programming easier than ever, enabling software developers to dramatically decrease the time and effort required to accelerate their scientific, engineering, enterprise and other applications with GPUs.

It offers new performance enhancements that enable developers to instantly accelerate applications up to 8X by simply replacing existing CPU-based libraries. Key features of CUDA 6 include:

•Unified Memory -- Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.
•Drop-in Libraries -- Automatically accelerates applications' BLAS and FFTW calculations by up to 8X by simply replacing the existing CPU libraries with the GPU-accelerated equivalents.
•Multi-GPU Scaling -- Re-designed BLAS and FFT GPU libraries automatically scale performance across up to eight GPUs in a single node, delivering over nine teraflops of double precision performance per node, and supporting larger workloads than ever before (up to 512GB). Multi-GPU scaling can also be used with the new BLAS drop-in library.


"By automatically handling data management, Unified Memory enables us to quickly prototype kernels running on the GPU and reduces code complexity, cutting development time by up to 50 percent," said Rob Hoekstra, manager of Scalable Algorithms Department at Sandia National Laboratories. "Having this capability will be very useful as we determine future programming model choices and port more sophisticated, larger codes to GPUs."

"Our technologies have helped major studios, game developers and animators create visually stunning 3D animations and effects," said Paul Doyle, CEO at Fabric Engine, Inc. "They have been urging us to add support for acceleration on NVIDIA GPUs, but memory management proved too difficult a challenge when dealing with the complex use cases in production. With Unified Memory, this is handled automatically, allowing the Fabric compiler to target NVIDIA GPUs and enabling our customers to run their applications up to 10X faster."

In addition to the new features, the CUDA 6 platform offers a full suite of programming tools, GPU-accelerated math libraries, documentation and programming guides.

Version 6 of the CUDA Toolkit is expected to be available in early 2014. Members of the CUDA-GPU Computing Registered Developer Program will be notified when it is available for download. To join the program, register here.

For more information about the CUDA 6 platform, visit NVIDIA booth 613 at SC13, Nov. 18-21 in Denver, and the NVIDIA CUDA website.
http://nvidianews.nvidia.com/Release...UDA-6-a62.aspx
   
Reply With Quote
 
Old
  (#2)
Spets
Maha Guru
 
Spets's Avatar
 
Videocard: GTX780Ti+GTX750Ti+G-Sync
Processor: Intel Core i7 2600k @ 4.5
Mainboard: GA-Z68X-UD7-B3
Memory: G.Skill Ripjaws 16gb 2133
Soundcard:
PSU: EVGA SuperNOVA
Default 11-19-2013, 05:08 | posts: 2,053

Quote:
Unified Memory in CUDA 6

With CUDA 6, we’re introducing one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be allocated in both memories, and explicitly copied between them by the program. This adds a lot of complexity to CUDA programs.



Unified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

In this post I’ll show you how Unified Memory dramatically simplifies memory management in GPU-accelerated applications. The image below shows a really simple example. Both codes load a file from disk, sort the bytes in it, and then use the sorted data on the CPU, before freeing the memory. The code on the right runs on the GPU using CUDA and Unified Memory. The only differences are that the GPU version launches a kernel (and synchronizes after launching it), and allocates space for the loaded file in Unified Memory using the new API cudaMallocManaged().



If you have programmed CUDA C/C++ before, you will no doubt be struck by the simplicity of the code on the right. Notice that we allocate memory once, and we have a single pointer to the data that is accessible from both the host and the device. We can read directly into the allocation from a file, and then we can pass the pointer directly to a CUDA kernel that runs on the device. Then, after waiting for the kernel to finish, we can access the data again from the CPU. The CUDA runtime hides all the complexity, automatically migrating data to the place where it is accessed.

What Unified Memory Delivers

There are two main ways that programmers benefit from Unified Memory.

Simpler Programming and Memory Model

Unified Memory lowers the bar of entry to parallel programming on the CUDA platform, by making device memory management an optimization, rather than a requirement. With Unified Memory, now programmers can get straight to developing parallel CUDA kernels without getting bogged down in details of allocating and copying device memory. This will make both learning to program for the CUDA platform and porting existing code to the GPU simpler. But it’s not just for beginners. My examples later in this post show how Unified Memory also makes complex data structures much easier to use with device code, and how powerful it is when combined with C++.

Performance Through Data Locality

By migrating data on demand between the CPU and GPU, Unified Memory can offer the performance of local data on the GPU, while providing the ease of use of globally shared data. The complexity of this functionality is kept under the covers of the CUDA driver and runtime, ensuring that application code is simpler to write. The point of migration is to achieve full bandwidth from each processor; the 250 GB/s of GDDR5 memory is vital to feeding the compute throughput of a Kepler GPU.

An important point is that a carefully tuned CUDA program that uses streams and cudaMemcpyAsync to efficiently overlap execution with data transfers may very well perform better than a CUDA program that only uses Unified Memory. Understandably so: the CUDA runtime never has as much information as the programmer does about where data is needed and when! CUDA programmers still have access to explicit device memory allocation and asynchronous memory copies to optimize data management and CPU-GPU concurrency. Unified Memory is first and foremost a productivity feature that provides a smoother on-ramp to parallel computing, without taking away any of CUDA’s features for power users.

Unified Memory or Unified Virtual Addressing?

CUDA has supported Unified Virtual Addressing (UVA) since CUDA 4, and while Unified Memory depends on UVA, they are not the same thing. UVA provides a single virtual memory address space for all memory in the system, and enables pointers to be accessed from GPU code no matter where in the system they reside, whether its device memory (on the same or a different GPU), host memory, or on-chip shared memory. It also allows cudaMemcpy to be used without specifying where exactly the input and output parameters reside. UVA enables “Zero-Copy” memory, which is pinned host memory accessible by device code directly, over PCI-Express, without a memcpy. Zero-Copy provides some of the convenience of Unified Memory, but none of the performance, because it is always accessed with PCI-Express’s low bandwidth and high latency.

UVA does not automatically migrate data from one physical location to another, like Unified Memory does. Because Unified Memory is able to automatically migrate data at the level of individual pages between host and device memory, it required significant engineering to build, since it requires new functionality in the CUDA runtime, the device driver, and even in the OS kernel. The following examples aim to give you a taste of what this enables.
More at http://devblogs.nvidia.com/parallelf...ory-in-cuda-6/
   
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump



Powered by vBulletin®
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
vBulletin Skin developed by: vBStyles.com
Copyright (c) 1995-2014, All Rights Reserved. The Guru of 3D, the Hardware Guru, and 3D Guru are trademarks owned by Hilbert Hagedoorn.