The Celerity distributed runtime and API aims to bring the power and ease of use of SYCL to distributed memory clusters.
If you want a step-by-step introduction on how to set up dependencies and implement your first Celerity application, check out the tutorial!
Programming modern accelerators is already challenging in and of itself. Combine it with the distributed memory semantics of a cluster, and the complexity can become so daunting that many leave it unattempted. Celerity wants to relieve you of some of this burden, allowing you to target accelerator clusters with programs that look like they are written for a single device.
Celerity makes it a priority to stay as close to the SYCL API as possible. If you have an existing SYCL application, you should be able to migrate it to Celerity without much hassle. If you know SYCL already, this will probably look very familiar to you:
celerity::buffer<float> buf(celerity::range(1024));
queue.submit([&](celerity::handler& cgh) {
celerity::accessor acc(buf, cgh,
celerity::access::one_to_one(), // 1
celerity::write_only, celerity::no_init);
cgh.parallel_for(
celerity::range(1024), // 2
[=](celerity::item<1> item) { // 3
acc[item] = sycl::sin(item[0] / 1024.f); // 4
});
});
-
Provide a range-mapper to tell Celerity which parts of the buffer will be accessed by the kernel.
-
Submit a kernel to be executed by 1024 parallel work items. This kernel may be split across any number of nodes.
-
Kernels can be expressed as C++11 lambda functions, just like in SYCL. In fact, no changes to your existing kernels are required.
-
Access your buffers as if they reside on a single device -- even though they might be scattered throughout the cluster.
The kernel shown above can be run on a single GPU, just like in SYCL, or on a whole cluster -- without having to change anything about the program itself.
For example, if we were to run it on two GPUs using mpirun -n 2 ./my_example
,
the first GPU might compute the range 0-512
of the kernel, while the second
one computes 512-1024
. However, as the user, you don't have to care how
exactly your computation is being split up.
To see how you can use the result of your computation, look at some of our fully-fledged examples, or follow the tutorial!
Celerity uses CMake as its build system. The build process itself is rather simple, however you have to make sure that you have a few dependencies installed first.
- A supported SYCL implementation, either
- AdaptiveCpp,
- DPC++, or
- SimSYCL
- A MPI 2 implementation (tested with OpenMPI 4.0, MPICH 3.3 should work as well)
- CMake (3.13 or newer)
- A C++20 compiler
See the platform support guide on which library and OS versions are supported and automatically tested.
Building can be as simple as calling cmake && make
, depending on your setup
you might however also have to provide some library paths etc.
See our installation guide for more information.
The runtime comes with several examples that can be used as a starting
point for developing your own Celerity application. All examples will also be built
automatically in-tree when the CELERITY_BUILD_EXAMPLES
CMake option is set
(true by default).
Simply run make install
(or equivalent, depending on build system) to copy
all relevant header files and libraries to the CMAKE_INSTALL_PREFIX
. This
includes a CMake package configuration file
which is placed inside the lib/cmake/Celerity
directory. You can then use
find_package(Celerity CONFIG)
to include Celerity into your CMake project.
Once included, you can use the add_celerity_to_target(TARGET target SOURCES source1 source2...)
function to set up the required dependencies for a target (no need to link manually).
Celerity is built on top of MPI, which means a Celerity application can be
executed like any other MPI application (i.e., using mpirun
or equivalent).
There are several environment variables that you can use to influence
Celerity's runtime behavior:
CELERITY_LOG_LEVEL
controls the logging output level. One oftrace
,debug
,info
,warn
,err
,critical
, oroff
.CELERITY_PROFILE_KERNEL
controls whether SYCL queue profiling information should be queried.CELERITY_PRINT_GRAPHS
controls whether task and command graphs are logged at the end of execution (requires log levelinfo
or higher).CELERITY_DRY_RUN_NODES
takes a number and simulates a run with that many nodes without actually executing the commands.CELERITY_HORIZON_STEP
andCELERITY_HORIZON_MAX_PARALLELISM
determine the maximum number of sequential and parallel tasks, respectively, before a new horizon task is introduced.CELERITY_TRACY
controls the Tracy profiler integration. Set tooff
to disable,fast
for light integration with little runtime overhead, andfull
for integration with extensive performance debug information included in the trace. Only available if integration was enabled enabled at build time through the CMake option-DCELERITY_TRACY_SUPPORT=ON
.