#3.8 CUDA编程者使用OpenCL的注意事项

英伟达的CUDA C提供的API与OpenCL类似。代码清单3.6用使用CUDA实现了向量相加，OpenCL和CUDA中的很多命令都可以相互对应。OpenCL API中有更多的参数，这是因为OpenCL需要在运行时去查找平台，并对程序进行编译。CUDA C只针对英伟达的GPU，其只有一个平台可以使用，所以平台的查找自动完成；将程序编译成PTX的过程可以在编译主机端二进制文件的时候进行(CUDA没有运行时编译)。

OpenCL中平台在运行时进行查找，程序需要选择一个目标设备，并在运行时进行编译。程序的编译不能在运行时之外完成，因为不知道哪个具体设备要去执行内核，无法在这种情况下生成中间码(IL/ISA)。比如，一个OpenCL内核在AMD GPU上测试有没有问题，但当其需要运行在Intel的设备上时，编译器生成的中间码就和AMD平台上的不太一样。从而，对与平台的查找，以及在运行时进行编译程序就能避免这样的问题。

OpenCL和CUDA C最大的区别是，CUDA C提供一些特殊的操作完成内核启动，其编译需要使用一套工具链，这套工具链中需要包含英伟达支持的预处理器。预处理器生成的代码，和OpenCL的内核代码十分相近。

{%ace edit=false, lang='c_cpp'%} #include <stdio.h> #include <stdlib.h> #include <math.h>

// CUDA kernel. Eache thread computes one element of C global void vecAdd(int *A, int *B, int *C, int elements){ // Compute the global thread ID using CUDA intrinsics int id = blocakIdx.x * blocakDim.x + threadIdx.x;

// Must check that the thread is not out of bounds if (id < elements) C[id] = A[id] + B[id]; }

int main(int argc, char *argv[]){ // Elements in each array const int elements = 2048;

// Compute the size of the data size_t datasize = sizeof(int) * elements;

// Allocate space for input/output host data int *A = (int *)malloc(datasize); // Input array int *B = (int *)malloc(datasize); // Input array int *C = (int *)malloc(datasize); // Output array

// Device input vectors int *bufA; int *bufB; // Device output vectors int *bufC;

// Allocate memeory for each vector on GPU cudaMalloc(&bufA, datasize); cudaMalloc(&bufB, datasize); cudaMalloc(&bufC, datasize);

int i; // Initialize vectors on host for (i = 0; i < elements; i++){ A[i] = i; B[i] = i; }

// Copy host vectors to device cudaMemcpy(bufA, A, datasize, cudaMemcpyHostToDevice); cudaMemcpy(bufB, B, datasize, cudaMemcpyHostToDevice);

int blockSize, gridSize;

// Number of threads in each thread block blockSize = 256;

// Number of thread blocks in grid gridSize = elements / blockSize;

// Execute the kernel vecAdd<<<gridSize, blockSize>>>(bufA, bufB, bufC, elements);

// Copy array back to host cudaMemcpy(C, bufC, datasize, cudaMemcpyDeviceToHost);

// Release device memeory cudaFree(bufA); cudaFree(bufB); cudaFree(bufC);

// Release host memeoy free(A); free(B); free(C); } {%endace%}

代码清单3.6 CUDA C版本的向量相加

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.8-chinese.md

3.8-chinese.md

Files

3.8-chinese.md

Latest commit

History

3.8-chinese.md

File metadata and controls