Skip to content

CUDA stream support

Jonathan Lifflander edited this page Jan 28, 2021 · 10 revisions

Goals

We want to provide some APIs that support applications using CUDA streams with collections. The main CUDA API we are targeting tracks CUDA events to provide dependencies within or across streams. Slide 12 of this presentation provides an overview of the CUDA event API.

The CUDA event API allows asynchronous polling of events which simplifies the implementation. The basic code looks like this:

    for ( auto& pc: color_pc) {
      pc.sort(space[j]);
      cudaEvent_t event;
      cudaEventCreate(&event);
      cudaEventRecord(event, streams[j]);
      events.push_back(event);
      ++j;
    }
    for (auto& e : events) {
      while (cudaEventQuery(e) != cudaSuccess) ;
    }

The Kokkos DeviceSpace() has the attached stream (passed during construction), but Kokkos does not allow the extraction of the stream from the device space. Thus, we will need to store both to use events until Kokkos provides an API for accessing that private data member.

Beyond this, CUDA provides cudaStreamWaitEvent(stream , event), which tells CUDA to halt progress on stream until the event is complete. Users can actually enqueue future operations for the stream, but they are automatically held back by CUDA until the event completions. This allows the creation of cross-dependencies for different streams. This API is currently not planning to be supported in the initial implementation, but might be needed later for more complex dependencies.

Microbenchmarking Sort

I modified the particle sort benchmarks, written by Matt, to use CUDA events instead of explicit stream synchronization. With 8 colors, we are getting nearly a 3x speedup over the baseline of a single chunk of work. With 16 colors, we get about a 2x speedup. The conclusion is that streams can provide substantial speedups by enabling concurrent kernel dispatch across colors.

Kokkos with streams

To use Kokkos with streams, one has to provide a special wrapper around the RangePolicy, which is then passed as the second argument (with a label) to the parallel_for, etc.

  auto policy = Kokkos::Experimental::require(
    Kokkos::RangePolicy<>(space, 0, num_elms()),
    Kokkos::Experimental::WorkItemProperty::HintLightWeight
  );

The space here is a DeviceSpace(stream) initialized with the CUDA stream. I've done some experiments with the WorkItemProperty, which has three options (no hint, heavy weight, light weight). These options control which memories are used for executing the kernel. Which some micro-benchmarking, I found for the sort that light and no hint are nearly identical in performance, while heavy slows down the kernel:

timing-heavyhint1.log:|   sort2 color: 0.198738 [10]
timing-heavyhint2.log:|   sort2 color: 0.199663 [10]
timing-heavyhint3.log:|   sort2 color: 0.200205 [10]
timing-heavyhint4.log:|   sort2 color: 0.201135 [10]
timing-heavyhint5.log:|   sort2 color: 0.201383 [10]
timing-lighthint1.log:|   sort2 color: 0.14192 [10]
timing-lighthint2.log:|   sort2 color: 0.142516 [10]
timing-lighthint3.log:|   sort2 color: 0.143825 [10]
timing-lighthint4.log:|   sort2 color: 0.14077 [10]
timing-lighthint5.log:|   sort2 color: 0.143892 [10]
timing-nohint1.log:|   sort2 color: 0.143003 [10]
timing-nohint2.log:|   sort2 color: 0.140005 [10]
timing-nohint3.log:|   sort2 color: 0.142604 [10]
timing-nohint4.log:|   sort2 color: 0.140307 [10]
timing-nohint5.log:|   sort2 color: 0.140636 [10]