IMPORTANT NOTE: this package is not actively developed, please use CUDAdrv instead!
This package wraps the CUDA runtime API. For a wrapper of the driver API, see CUDAdrv.
CUDAdrv.jl is the preferred way to program a GPU from Julia; Only use CUDArt.jl if you really require the runtime API.
This has been tested on Linux, OSX, and Windows. With Windows, at least Visual Studio 2010/2012/2013/2015 are supported.
First, you need to have an NVIDIA GPU device in your computer (one that is available for computation, i.e., most likely not your graphics card), and the CUDA library installed. You have to perform these steps manually. Choose either 32-bit or 64-bit versions to match your Julia installation.
Install the Julia package using:
Pkg.add("CUDArt")
During installation, it should compile a couple of files in the deps/
directory. These
files provide utility functions necessary for certain functionality in this package. If the
build step fails, try fixing the problems and running Pkg.build("CUDArt")
manually.
After installation, it's probably a good idea to run the test/runtests.jl
script to find
out whether everything is working on your system, or just say Pkg.test("CUDArt")
.
In case of errors, one thing to check is your CUDA installation itself. For example, examine
whether the *.ptx
files are present in deps/
and test/
; look at those files and make
sure they seem appropriate. (E.g., if your computer is 64-bit, are they compiled for
64-bit?)
Start by saying using CUDArt
, or import CUDArt
if you prefer to qualify everything with
the module name. For most use cases, you'll also need to install and import the CUDAdrv
package, which among other things provides functionality to launch kernels.
One or more GPUs can be initialized, used for computations, and freed for other uses. There are some complexities in this process due to the interaction with Julia's garbage collection---a CUDA array object allocated in one "session" should not be usable if you close the device and then open a new "session." Fortunately, CUDArt should make the process transparent, and as a user you shouldn't have to think about this at all.
The easiest way to ensure that you get full functionality, with proper cleanup of resources,
is by using the do
block syntax:
result = devices(dev->true) do devlist
# Code that does GPU computations
end
The argument to devices
is a function that accepts an integer input (the integer
representing the CUDA device, starting with 1) and returns true
or false
, indicating
whether the device should or should not be used, respectively. dev->true
means that very
device will be used. The devlist
variable will be defined inside the block, and is a
Vector{Int}
of the available devices.
If you need to make sure that only devices with sufficient capabilities are used, then use a construct like this:
result = devices(dev->capability(dev)[1]>=2) do devlist
# Code that does GPU computations
end
This will select all devices that have a major capability of 2 or higher. You can query any
of the properties of your device; see the device_properties
and attribute
functions and
the list of
fields.
If you want to restrict your computations to just one device (perhaps leaving other devices
for other users), use the nmax
keyword:
result = devices(func, nmax=1) do devlist
# Code that does GPU computations
end
Finally, you can request only those devices that are not busy with other tasks using:
result = devices(func, status=:free) do devlist
# Code that does GPU computations
end
You can wait for specific devices to become available with wait_free(devlist)
.
The do
block syntax initializes the devices and loads some utility functions (defined in
deps/utils.cu
) onto each GPU; it also ensures proper freeing of memory and unloading of
code when the do
block finishes. Should you want to initialize the utilities manually, you
can do so by calling CUDArt.init(devlist)
and CUDArt.close(devlist)
where devlist
is
an integer device number or a list of them, e.g. 0
or [0,1]
. This can be handy in case
of trouble, because unfortunately the do
syntax does not usually result in ideal
backtraces.
If your work doesn't require any of the utility functions, you can manually manage the device:
device(dev)
# Code that does GPU computations
device_reset(dev)
where dev
is the integer device number.
At any point in your code, the command device(dev)
makes dev
the active device. For
example, commands that allocate device memory will be executed on whichever device is
currently active.
Calling dev = device()
will return the currently-active device
CUDArt supports two main types of device arrays: CudaArray
s and CudaPitchedArray
s.
These correspond to contiguous memory blocks and "pitched pointers", respectively.
To declare an uninitialized array on the device, use:
d_A = CudaArray(Float64, (200,300))
d_B = CudaPitchedArray(Int32, (15, 40, 27))
The d_
is a conventional way of reminding yourself that the array is allocated on the device.
To copy a host array to the device, use any of
d_A = CudaArray(A)
d_AP = CudaPitchedArray(A)
copy!(d_A, A)
copy!(d_AP, A)
To copy a device array back to the host, use either of
A = to_host(d_A)
copy!(A, d_A)
Most of the typical Julia functions, like size
, ndims
, reinterpret
, eltype
, fill!
,
etc., work on CUDA array types. One noteworthy omission is that you can't directly index a
CUDA array: d_A[2,4]
will fail. This is not supported because host/device memory transfers
are relatively slow, and you don't want to write code that (on the host side) makes use of
individual elements in a device array. If you want to inspect the values in a device array,
first use to_host
to copy it to host memory.
You can find out which device is storing an array using:
dev = device(d_A)
Another important array type is the HostArray
, which is allocated by the CUDA library
using pinned memory:
h_A = HostArray(Float32, (1000,1200))
There are circumstances where using a HostArray
may improve the speed of memory transfers,
or allow asynchronous operations using Stream
s.
Warning: using a HostArray
in conjunction with a large memory-mapped file has been
observed to cause segfaults; at the present time there is no known workaround.
This will not teach you about CUDA programming; for that, please refer to the CUDA
documentation and other online sources. You can find an example file in deps/utils.cu
.
You can write and use your own custom kernels, first writing a .cu
file and compiling it
as a ptx
module. On Linux, compilation would look something like this:
nvcc -ptx mycudamodule.cu
You can specify that the code should be compiled for compute capability 2.0 devices or higher using:
nvcc -ptx -gencode=arch=compute_20,code=sm_20 mycudamodule.cu
If you want to write code that will support multiple datatypes (e.g., Float32
and
Float64
), it's recommended that you use C++ and write your code using templates. Then use
extern C
to instantiate bindings for each datatype. For example:
template <typename T>
__device__ void kernel_function1(T *data) {
// Code goes here
}
template <typename T1, typename T2>
__device__ void kernel_function2(T1 *data1, T2 *data2) {
// Code goes here
}
extern "C"
{
void __global__ kernel_function1_float(float *data) {kernel_function1(data);}
void __global__ kernel_function1_double(double *data) {kernel_function1(data);}
void __global__ kernel_function2_int_float(int *data1, float *data2) {kernel_function2(data1,data2);}
}
To easily make your kernels available, the recommended approach is to define something
analogous to the following for each ptx
module (this example uses the kernels described in
the previous section):
module MyCudaModule
import CUDAdrv: CuModule, CuModuleFile, CuFunction, cudacall
using CUDArt
export function1
const ptxdict = Dict()
const mdlist = Array{CuModule}(0)
function mdinit(devlist)
global ptxdict
global mdlist
isempty(mdlist) || error("mdlist is not empty")
for dev in devlist
device(dev)
md = CuModuleFile("mycudamodule.ptx")
ptxdict[(dev, "function1", Float32)] = CuFunction(md, "kernel_function1_float")
ptxdict[(dev, "function1", Float64)] = CuFunction(md, "kernel_function1_double")
ptxdict[(dev, "function2", Int32, Float32)] = CuFunction(md, "kernel_function2_int_float")
push!(mdlist, md)
end
end
mdclose() = (empty!(mdlist); empty!(ptxdict))
function init(f::Function, devlist)
local ret
mdinit(devlist)
try
ret = f(devlist)
finally
mdclose()
end
ret
end
function function1{T}(data::CudaArray{T})
dev = device(data)
cufunction1 = ptxdict[(dev, "function1", T)]
# Set up grid and block, see below
cudacall(cufunction1, grid, block, (Ptr{T},), data)
end
...
end # MyCudaModule
Usage will look something like the following:
using CUDArt, MyCudaModule
A = rand(10,5)
result = devices(dev->capability(dev)[1]>=2) do devlist
MyCudaModule.init(devlist) do dev
device(dev)
function1(CudaArray(A))
end
end
To be written.
One can use streams to manage or synchronize computations between the CPU & GPU, or using
multiple CUDA devices. Using Julia's @sync
and @async
macros, here is a short
demonstration that activates processing on multiple devices:
measured_sleep_time = CUDArt.devices(dev->true, nmax=2) do devlist
sleeptime = 0.5
results = Array{Float64}(3*length(devlist))
streams = [(device(dev); Stream()) for dev in devlist]
# Force one run to precompile
cudasleep(sleeptime; dev=devlist[1], stream=streams[1])
wait(streams[1])
i = 1
nextidx() = (idx=i; i+=1; idx)
@sync begin
for idev = 1:length(devlist)
@async begin
while true
idx = nextidx()
if idx > length(results)
break
end
tstart = time()
dev = devlist[idev]
stream = streams[idev]
cudasleep(sleeptime; dev=dev, stream=stream)
wait(stream)
tstop = time()
results[idx] = tstop-tstart
end
end
end
end
results
end
In a more realistic version of this demonstration, you would "feed" work and collect the results from all of your CUDA devices using a single Julia process to organize the efforts.
Julia convention is that matrices are stored in column-major order, whereas C (and CUDA) use row-major. For efficiency this wrapper avoids reordering memory, so that the linear sequence of addresses is the same between main memory and the GPU. For most usages, this is probably what you want.
However, for the purposes of linear algebra, this effectively means that one is storing the
transpose of matrices on the GPU. (TODO: create CudaMatrix
and CudaPitchedMatrix
types
that will automatically take the transpose when copying between main and GPU memory. This
will be useful for cuBLAS.)
Note that the size of a CudaArray/CudaPitchedArray is represented as the size of the corresponding main-memory object; thus, an array's dimensions (as reported by Julia) will not change when you copy it between main and GPU memory.