General Purpose Computation on Graphics Processing Unit

General Purpose Computation on Graphics Processing Unit ( GPGPU for short , from English for general purpose computation on graphics processing unit ( s) ) refers to the use of a graphics processor for calculations beyond its original scope. This can be, for example, calculations for technical or economic simulations. With parallel algorithms , an enormous increase in speed can be achieved compared to the main processor .

overview

GPGPU emerged from the shaders of the graphics processors. Its strength lies in the simultaneous execution of uniform tasks, such as coloring pixels or multiplying large matrices . Since the increase in speed of modern processors can currently no longer (primarily) be achieved by increasing the clock rate, parallelization is an important factor in achieving higher computing power in modern computers. The advantage of using the GPU over the CPU lies in the higher computing power and the higher memory bandwidth. The speed is mainly achieved through the high degree of parallelism of the arithmetic operations of the graphics processor.

model	Theoretical computing power		Memory bus data rate ( GByte / s )	Storage type	Art
	with simple	at double
	Accuracy ( GFlops )
AMD Radeon Pro Duo	16,384	1,024	1,024	HBM	GPU
AMD Radeon R9 Fury X	8,602	538	512	HBM
Nvidia Geforce GTX Titan X	6,144	192	336	GDDR5
AMD FirePro W9100	5,350	2,675	320
Nvidia Tesla K20X	3,950	1.310	250
AMD Radeon HD 7970	3,789	947	264
Intel Xeon Phi 7120	2,420	1,210	352		Co-processor
PlayStation 4 SoC ( AMD )	1,860	-	167		APU
Nvidia Geforce GTX 580	1,581	198	192.4		GPU
Intel Xeon E7-8890 v3	1,440	720	102.4 (?)	DDR4	CPU
AMD A10-7850k	856	-	34	DDR3	APU
Intel Core i7-3930K	307.2	153.6	51.2	DDR3	CPU
Intel Pentium 4 with SSE3, 3.6 GHz	14.4	7.2	6.4	DDR2	CPU

Fragment and vertex shaders can run at the same time. Another advantage is the low price compared to other similarly fast solutions and the fact that suitable graphics cards can be found in almost every PC today.

history

In the beginning, shaders were only associated with special functions that were closely linked to graphical calculations. In order to accelerate the speed of the calculation of individual pixels, it was decided to carry out the calculation of individual pixels at the same time by using several calculators of the same type. Later, the idea of expanding the very limited capabilities of the shaders in order to turn them into massively parallel processing units for any task came up: The first - more or less - freely programmable shaders emerged. The trend of designing shaders in a freely programmable manner continues to this day and is being pushed forward by chip designers with each new generation of technology. Modern GPUs sometimes have over 1000 of these programmable shader units and can therefore also carry out over 1000 computing operations at the same time.

criticism

By OpenCL a uniform interface exists to implement GPGPU calculations. The disadvantage compared to conventional CPUs is the massive parallelism with which the programs must be executed in order to take advantage of these advantages. GPUs are also limited in their functionality. There are special graphic models ( Nvidia Tesla , AMD FireStream ) for the scientific sector . The memory of these graphics cards has error correction procedures and their accuracy when calculating floating point numbers is greater, which is also reflected in the costs.

programming

OpenCL , CUDA and, since 2012, C ++ AMP are mainly available for developing GPGPU-compatible programs . OpenCL is an open standard that is available on many platforms, whereas CUDA is a proprietary framework from Nvidia and can only run on GPUs from this manufacturer. AMP is one of Microsoft initiated C ++ -Spracherweiterung in conjunction with a small template - library that is open in the sense that they neither Microsoft products, nor to certain Accelerator hardware is limited types or certain hardware manufacturers ( thus not only GPGPUs, but also CPUs and, in the future, other parallelization options, such as cloud computing ). In Microsoft's AMP implementation, the GPU is expected to support DirectX Version 11, because it was only with this version that the use of GPUs as GPGPUs was particularly taken into account. If an AMP-using program does not find a sufficiently up-to-date GPU, the algorithm programmed with AMP is automatically executed on the CPU using its parallelization options ( multithreading on several processor cores , SIMD instructions). AMP should therefore create a complete abstraction layer between an algorithm and the hardware equipment of the executing computer. In addition, the restriction to a few new C ++ language constructs and a few new library classes is intended to reduce the previous hurdles and efforts in the development of parallel algorithms. DirectX 11 is already natively hardware-supported by all common GPU series (more recent than the DirectX 11 introduction) (including basic performance GPUs such as Intel's chipset- integrated GPUs), but DirectX 11 was only introduced with Windows 7 and Supplied for Windows Vista only , so that older Windows operating systems cannot be used with AMP. Whether C ++ AMP will ever be adapted by other platforms or C ++ development environments outside of the Windows world is currently still completely open.

A more recent approach is OpenACC , which, like OpenMP, is controlled via compiler pragmas. Ordinary source code, e.g. B. in C ++, automatically parallelized by placing certain compiler pragmas like "#pragma acc parallel" in front of the serially formulated For-Loops. The porting effort is relatively small. However, automatic parallelization does not always lead to optimal solutions. OpenACC can therefore never completely replace explicit parallel programming as in OpenCL. Nevertheless, in many cases it is worthwhile to be able to achieve high acceleration factors on GPGPU in this simple way. OpenACC is supported by commercial compilers like PGI and free compilers like the GNU Compiler Collection .

In order to run programs on a GPU, you need a host program that controls the flow of information. Usually, the GPGPU code formulated in a C -like language is compiled at runtime at the instruction of the host program and sent to the graphics processor for further processing, which then returns the calculated data to the host program.

literature

Matt Pharr: GPU Gems 2 . Addison-Wesley Publishing Company, 2005, ISBN 0-321-33559-7 , Part IV - General-Purpose Computation on GPUs: A Primer.
David B. Kirk: Programming Massively Parallel Processors: A Hands-on Approach [Paperback] . Morgan Kaufmann, 2010, ISBN 978-0-12-381472-2 .

Web links

GPU Gems 2
GPGPU.org
Nvidia working on first GPGPUs for Apple Macs, AppleInsider (January 24, 2008)
GPU4Vision GPGPU Publications, Videos and Software
GPGPU Computing - an overview for beginners and advanced users (planet3dnow.de May 26, 2009)
Tobias Preis, Peter Virnau, Wolfgang Paul, Johannes J. Schneider: GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model. In: Journal of Computational Physics. 228, 2009, pp. 4468-4477, doi : 10.1016 / j.jcp.2009.03.018 .


	according to word length	1-bit architecture • Bit-slice architecture • 4-bit architecture • 8-bit architecture • 16-bit architecture • 32-bit architecture • 64-bit architecture
	according to instruction set structure	CISC • EPIC • NISC • RISC • VLIW • Microarchitecture
	with optimization for purpose	(Main) processor • Graphics processor • GPGPU • Stream processor • Sound processor • Floating point unit • Network processor • Physics accelerator • Vector processor • TensorFlow Processing Unit