OpenCL: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Gif32 (talk | contribs)
Gif32 (talk | contribs)
Line 31: Line 31:


== Example ==
== Example ==
This example will compute a [[Fourier Transform|Fast Fourier Transformation]]:
This example will compute a [[Fast Fourier transform|Fast Fourier Transformation]]:
<ref name=siggraph>{{cite web
<ref name=siggraph>{{cite web
|url=http://s08.idav.ucdavis.edu/munshi-opencl.pdf
|url=http://s08.idav.ucdavis.edu/munshi-opencl.pdf

Revision as of 20:42, 11 October 2008

OpenCL (Open Computing Language) is a language for programming heterogeneous data and task parallel computing across GPUs and CPUs. It was created by Apple in cooperation with others, and is based on C99.

The purpose is to recall OpenGL and OpenAL, which are open industry standards for 3D graphics and computer audio respectively, to extend the power of the GPU beyond graphics (GPGPU).

Apple has proposed OpenCL for Khronos Group where on June 16th 2008 Compute Working Group was formed[2] for the standardization work.

OpenCL is scheduled to be introduced in Mac OS 10.6 ('Snow Leopard').[3] According to the press release:[3]

Snow Leopard further extends support for modern hardware with Open Computing Language (OpenCL), which lets any application tap into the vast gigaflops of GPU computing power previously available only to graphics applications. OpenCL is based on the C programming language and has been proposed as an open standard.

The initial OpenCL implementation is reportedly built on LLVM and Clang compiler technology.[citation needed]

AMD has decided to back OpenCL (and DirectX 11) instead of its now deprecated Close to Metal framework. [4] [5]

Example

This example will compute a Fast Fourier Transformation: [6]

// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);
memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);

// create the compute program
program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL);

// build the compute program executable
clBuildProgramExecutable(program, false, NULL, NULL);

// create the compute kernel
kernel = clCreateKernel(program, fft1D_1024);

// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size);

// set the args values
clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL);

 // execute kernel
clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

The actual calculation:

// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into 
// calls to a radix 16 function, another radix 16 function and then a radix 4 function 
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, 
                          __local float *sMemx, __local float *sMemy) { 
  int tid = get_local_id(0); 
  int blockIdx = get_group_id(0) * 1024 + tid; 
  float2 data[16]; 
  // starting index of data to/from global memory 
  in = in + blockIdx;  out = out + blockIdx; 
  globalLoads(data, in, 64); // coalesced global reads 
  fftRadix16Pass(data);      // in-place radix-16 pass 
  twiddleFactorMul(data, tid, 1024, 0); 
  // local shuffle using local memory 
  localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); 
  fftRadix16Pass(data);               // in-place radix-16 pass 
  twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication 
  localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); 
  // four radix-4 function calls 
  fftRadix4Pass(data); fftRadix4Pass(data + 4); 
  fftRadix4Pass(data + 8); fftRadix4Pass(data + 12); 
  // coalesced global writes 
  globalStores(data, out, 64); 
}

See also

References

  1. ^ "Botan changelog". 2002-08-10. Retrieved 2008-06-09.
  2. ^ "Khronos Launches Heterogeneous Computing Initiative". Khronos Group. 2008-06-16. Retrieved 2008-06-18.
  3. ^ a b "Apple Previews Mac OS X Snow Leopard to Developers". Apple. 2008-06-09. Retrieved 2008-06-09.
  4. ^ "AMD Drives Adoption of Industry Standards in GPGPU Software Development". AMD. 2008-08-06. Retrieved 2008-08-14.
  5. ^ "AMD Backs OpenCL, Microsoft DirectX 11". eWeek. 2008-08-06. Retrieved 2008-08-14.
  6. ^ "OpenCL" (PDF). SIGGRAPH2008. 2008-08-14. Retrieved 2008-08-14.

External Links