CUDA

CUDA (Compute Unified Device Architecture), is a C compiler and set of development tools that allow programmers to use the C programming language to code algorithms for execution on the graphics processing unit (GPU). CUDA has been developed by NVIDIA and to use this architecture requires an Nvidia GPU and the latest drivers, all of which now contain the necessary CUDA components. CUDA works with all NVIDIA GPUs from the G8X series onwards, including GeForce, Quadro and the Tesla line, designed specifically for the compute market. NVIDIA states that programs developed for the GeForce 8 series will also work without modification on all future Nvidia video cards, due to the binary compatibility of the architecture (G9X, GTX and better). CUDA gives developers unfettered access to the native instruction set and memory of the massively parallel computational elements in CUDA GPUs. Using CUDA, the latest NVIDIA GPUs effectively become powerful, programmable open architectures like today’s CPUs (Central Processing Units). Unlike CPUs however, GPUs have a parallel "many-core" architecture, each core capable of running thousands of threads - if an application is suited to this kind of an architecture, they can offer some benefit. By opening up the architecture, CUDA provides developers both with the low-level, deterministic, and the high-level API for repeatable access to hardware which is necessary to develop essential high-level programming tools such as compilers, debuggers, math libraries, and application platforms.

The initial CUDA SDK was made public 15th February 2007.^[1] The compiler in CUDA is based on Open64.

Hardware

The 8-Series (G8X) GPU from NVIDIA, found in the GeForce, Quadro and Tesla lines, is the first series of GPUs to support the CUDA SDK. The 8-Series (G8X) GPUs feature hardware support for 32-bit (single precision) floating point vector processors, using the CUDA SDK as API. (CUDA supports the C "double" data type, however on G8X series GPUs these types will get demoted to 32-bit floats.). NVIDIA recently announced its new GT200 architecture which now supports 64-bit (double precision) floating point. Due to the highly parallel nature of vector processors, GPU assisted hardware stream processing can have a huge impact in specific data processing applications. It is anticipated in the computer gaming industry that graphics cards may be used in future game physics calculations (physical effects like debris, smoke, fire, fluids). CUDA has also been used to accelerate non-graphical applications in computational biology and other fields by an order of magnitude or more.^[2] ^[3]

Advantages

CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.

It uses the standard C language, with some simple extensions.
Scattered writes – code can write to arbitrary addresses in memory.
Shared memory – CUDA exposes a fast shared memory region (16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups. See example here.^[4]
Faster downloads and readbacks to and from the GPU
Full support for integer and bitwise operations

Limitations

Texture rendering is not supported.
Recursive functions are not supported and must be converted to loops.
Various deviations from the IEEE 754 standard. Denormals and signalling NaNs are not supported; only two IEEE rounding modes are supported (chop and round-to-nearest even), and those are specified on a per-instruction basis rather than in a control word; and the precision of division/square root is slightly lower than single precision.
The bus bandwidth and latency between the CPU and the GPU may be a bottleneck.
Threads should be ran in groups of at least 32 for best performance. Branches in the program code do not impact performance significantly, provided that each of 32 threads takes the same execution path; the SIMD execution model becomes a significant limitation for any inherently divergent task (e.g., traversing a ray tracing acceleration data structure).
CUDA-enabled GPUs are only available from Nvidia (GeForce 8 series and above, Quadro and Tesla [1])

References

^ CUDA for GPU Computing
^ Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007). "High-throughput sequence alignment using Graphics Processing Units". BMC Bioinformatics. 8:474: 474. doi:10.1186/1471-2105-8-474.{{cite journal}}: CS1 maint: multiple names: authors list (link) CS1 maint: unflagged free DOI (link)
^ Manavski, Svetlin A. (2008). "CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment". BMC Bioinformatics. 9(Suppl 2):S10: S10. doi:10.1186/1471-2105-9-S2-S10. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)CS1 maint: unflagged free DOI (link)
^ Silberstein, Mark (2007). "Efficient computation of Sum-products on GPUs" (PDF).

External links

Template:Processing units

[1] CUDA for GPU Computing

[2] Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007). "High-throughput sequence alignment using Graphics Processing Units". BMC Bioinformatics. 8:474: 474. doi:10.1186/1471-2105-8-474.{{cite journal}}: CS1 maint: multiple names: authors list (link) CS1 maint: unflagged free DOI (link)

[Manavski2008-3] Manavski, Svetlin A. (2008). "CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment". BMC Bioinformatics. 9(Suppl 2):S10: S10. doi:10.1186/1471-2105-9-S2-S10. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)CS1 maint: unflagged free DOI (link)

[4] Silberstein, Mark (2007). "Efficient computation of Sum-products on GPUs" (PDF).

[1]

[2]

[3]

[4]

Hardware

Advantages

Limitations

See also

References

External links