Skip to content

Commit bf417dd

Browse files
jesus-ramosavagin
authored andcommitted
criu/plugin: Add NVIDIA CUDA plugin
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an r555 or higher driver along with the cuda-checkpoint binary. Signed-off-by: Jesus Ramos <[email protected]>
1 parent 5f486d5 commit bf417dd

File tree

5 files changed

+578
-4
lines changed

5 files changed

+578
-4
lines changed

Makefile

+12-3
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ HOSTCFLAGS += $(WARNINGS) $(DEFINES) -iquote include/
165165
export AFLAGS CFLAGS USERCLFAGS HOSTCFLAGS
166166

167167
# Default target
168-
all: criu lib crit
168+
all: criu lib crit cuda_plugin
169169
.PHONY: all
170170

171171
#
@@ -298,15 +298,19 @@ clean-amdgpu_plugin:
298298
$(Q) $(MAKE) -C plugins/amdgpu clean
299299
.PHONY: clean-amdgpu_plugin
300300

301+
clean-cuda_plugin:
302+
$(Q) $(MAKE) -C plugins/cuda clean
303+
.PHONY: clean-cuda_plugin
304+
301305
clean-top:
302306
$(Q) $(MAKE) -C Documentation clean
303307
$(Q) $(MAKE) $(build)=test/compel clean
304308
$(Q) $(RM) .gitid
305309
.PHONY: clean-top
306310

307-
clean: clean-top clean-amdgpu_plugin
311+
clean: clean-top clean-amdgpu_plugin clean-cuda_plugin
308312

309-
mrproper-top: clean-top clean-amdgpu_plugin
313+
mrproper-top: clean-top clean-amdgpu_plugin clean-cuda_plugin
310314
$(Q) $(RM) $(CONFIG_HEADER)
311315
$(Q) $(RM) $(VERSION_HEADER)
312316
$(Q) $(RM) $(COMPEL_VERSION_HEADER)
@@ -338,6 +342,10 @@ amdgpu_plugin: criu
338342
$(Q) $(MAKE) -C plugins/amdgpu all
339343
.PHONY: amdgpu_plugin
340344

345+
cuda_plugin: criu
346+
$(Q) $(MAKE) -C plugins/cuda all
347+
.PHONY: cuda_plugin
348+
341349
crit: lib
342350
$(Q) $(MAKE) -C crit
343351
.PHONY: crit
@@ -424,6 +432,7 @@ help:
424432
@echo ' lint - Run code linters'
425433
@echo ' indent - Indent C code'
426434
@echo ' amdgpu_plugin - Make AMD GPU plugin'
435+
@echo ' cuda_plugin - Make NVIDIA CUDA plugin'
427436
.PHONY: help
428437

429438
ruff:

Makefile.install

+6-1
Original file line numberDiff line numberDiff line change
@@ -72,12 +72,16 @@ install-amdgpu_plugin: amdgpu_plugin
7272
$(Q) $(MAKE) -C plugins/amdgpu install
7373
.PHONY: install-amdgpu_plugin
7474

75+
install-cuda_plugin: cuda_plugin
76+
$(Q) $(MAKE) -C plugins/cuda install
77+
.PHONY: install-cuda_plugin
78+
7579
install-compel: $(compel-install-targets)
7680
$(Q) $(MAKE) $(build)=compel install
7781
$(Q) $(MAKE) $(build)=compel/plugins install
7882
.PHONY: install-compel
7983

80-
install: install-man install-lib install-crit install-criu install-compel install-amdgpu_plugin ;
84+
install: install-man install-lib install-crit install-criu install-compel install-amdgpu_plugin install-cuda_plugin ;
8185
.PHONY: install
8286

8387
uninstall:
@@ -88,4 +92,5 @@ uninstall:
8892
$(Q) $(MAKE) $(build)=compel $@
8993
$(Q) $(MAKE) $(build)=compel/plugins $@
9094
$(Q) $(MAKE) -C plugins/amdgpu $@
95+
$(Q) $(MAKE) -C plugins/cuda $@
9196
.PHONY: uninstall

plugins/cuda/Makefile

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
PLUGIN_NAME := cuda_plugin
2+
PLUGIN_SOBJ := cuda_plugin.so
3+
4+
DEPS_CUDA := $(PLUGIN_SOBJ)
5+
6+
PLUGIN_INCLUDE := -iquote../../include
7+
PLUGIN_INCLUDE += -iquote../../criu/include
8+
PLUGIN_INCLUDE += -iquote../../criu/arch/$(ARCH)/include/
9+
PLUGIN_INCLUDE += -iquote../../
10+
11+
COMPEL := ../../compel/compel-host
12+
13+
CC := gcc
14+
PLUGIN_CFLAGS := -g -Wall -Werror -shared -nostartfiles -fPIC
15+
16+
__nmk_dir ?= ../../scripts/nmk/scripts/
17+
include $(__nmk_dir)msg.mk
18+
19+
all: $(DEPS_CUDA)
20+
21+
cuda_plugin.so: cuda_plugin.c
22+
$(call msg-gen, $@)
23+
$(Q) $(CC) $(PLUGIN_CFLAGS) $(shell $(COMPEL) includes) $^ -o $@ $(PLUGIN_INCLUDE) $(PLUGIN_LDFLAGS)
24+
25+
clean:
26+
$(call msg-clean, $@)
27+
$(Q) $(RM) $(PLUGIN_SOBJ)
28+
.PHONY: clean
29+
30+
mrproper: clean
31+
32+
install:
33+
$(Q) mkdir -p $(DESTDIR)$(PLUGINDIR)
34+
$(E) " INSTALL " $(PLUGIN_NAME)
35+
$(Q) install -m 644 $(PLUGIN_SOBJ) $(DESTDIR)$(PLUGINDIR)
36+
.PHONY: install
37+
38+
uninstall:
39+
$(E) " UNINSTALL" $(PLUGIN_NAME)
40+
$(Q) $(RM) $(DESTDIR)$(PLUGINDIR)/$(PLUGIN_SOBJ)
41+
.PHONY: uninstall
42+

plugins/cuda/README.md

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
Checkpoint and Restore for CUDA applications with CRIU
2+
======================================================
3+
4+
# Requirements
5+
The cuda-checkpoint utility should be placed somewhere in your $PATH and an r555
6+
or higher GPU driver is required for CUDA CRIU integration support.
7+
8+
## cuda-checkpoint
9+
The cuda-checkpoint utility can be found at:
10+
https://github.com/NVIDIA/cuda-checkpoint
11+
12+
cuda-checkpoint is a binary utility used to issue checkpointing commands to CUDA
13+
applications. Updating the cuda-checkpoint utility between driver releases
14+
should not be necessary as the utility simply exposes some extra driver behavior
15+
so driver updates are all that's needed to get access to newer features.
16+
17+
# Checkpointing Procedure
18+
cuda-checkpoint exposes 4 actions used in the checkpointing process: lock,
19+
checkpoint, restore, unlock.
20+
21+
* lock - Used with the PAUSE_DEVICES hook while a process is still running to
22+
quiesce the application into a state where it can be checkpointed
23+
* checkpoint - Used with the CHECKPOINT_DEVICES hook once a process has been
24+
seized/frozen to perform the actual checkpointing operation
25+
* restore/unlock - Used with the RESUME_DEVICES_LATE hook to restore the CUDA
26+
state and release the process back to it's running state
27+
28+
These actions are facilitated by a CUDA checkpoint+restore thread that the CUDA
29+
plugin will re-wake when needed.
30+
31+
# Known Limitations
32+
* Currently GPU memory contents are brought into main system memory and CRIU
33+
then checkpoints that as part of the normal procedure. On systems with many
34+
GPU's with high GPU memory usage this can cause memory thrashing. A future
35+
CUDA release will add support for dumping the memory contents to files to
36+
alleviate this as well as support in the CRIU plugin.
37+
* There's currently a small race between when a PAUSE_DEVICES hook is called on
38+
a running process and a process calls cuInit() and finishes initializing CUDA
39+
after the PAUSE is issued but before the process is frozen to checkpoint. This
40+
will cause cuda-checkpoint to report that the process is in an illegal state
41+
for checkpointing and it's recommended to just attempt the CRIU procedure
42+
again, this should be very rare.
43+
* Applications that use NVML will leave some leftover device references as NVML
44+
is not currently supported for checkpointing. There will be support for this
45+
in later drivers. A possible temporary workaround is to have the
46+
{DUMP,RESTORE}_EXT_FILE hook just ignore /dev/nvidiactl and /dev/nvidia{0..N}
47+
remaining references for these applications as in most cases NVML is used to
48+
get info such as gpu count and some capabilities and these values are never
49+
accessed again and unlikely to change.
50+
* CUDA applications that fork() but don't call exec() but also don't issue any
51+
CUDA API calls will have some leftover references to /dev/nvidia* and fail to
52+
checkpoint as a result. This can be worked around in a similar fashion to the
53+
NVML case where the leftover references can be ignored as CUDA is not fork()
54+
safe anyway.
55+
* Restore currently requires that you restore on a system with similar GPU's and
56+
same GPU count.
57+
* NVIDIA UVM Managed Memory, MIG (Multi Instance GPU), and MPS (Multi-Process
58+
Service) are currently not supported for checkpointing. Future CUDA releases
59+
will add support for these.

0 commit comments

Comments
 (0)