|
| 1 | +Checkpoint and Restore for CUDA applications with CRIU |
| 2 | +====================================================== |
| 3 | + |
| 4 | +# Requirements |
| 5 | +The cuda-checkpoint utility should be placed somewhere in your $PATH and an r555 |
| 6 | +or higher GPU driver is required for CUDA CRIU integration support. |
| 7 | + |
| 8 | +## cuda-checkpoint |
| 9 | +The cuda-checkpoint utility can be found at: |
| 10 | +https://github.com/NVIDIA/cuda-checkpoint |
| 11 | + |
| 12 | +cuda-checkpoint is a binary utility used to issue checkpointing commands to CUDA |
| 13 | +applications. Updating the cuda-checkpoint utility between driver releases |
| 14 | +should not be necessary as the utility simply exposes some extra driver behavior |
| 15 | +so driver updates are all that's needed to get access to newer features. |
| 16 | + |
| 17 | +# Checkpointing Procedure |
| 18 | +cuda-checkpoint exposes 4 actions used in the checkpointing process: lock, |
| 19 | +checkpoint, restore, unlock. |
| 20 | + |
| 21 | +* lock - Used with the PAUSE_DEVICES hook while a process is still running to |
| 22 | + quiesce the application into a state where it can be checkpointed |
| 23 | +* checkpoint - Used with the CHECKPOINT_DEVICES hook once a process has been |
| 24 | + seized/frozen to perform the actual checkpointing operation |
| 25 | +* restore/unlock - Used with the RESUME_DEVICES_LATE hook to restore the CUDA |
| 26 | + state and release the process back to it's running state |
| 27 | + |
| 28 | +These actions are facilitated by a CUDA checkpoint+restore thread that the CUDA |
| 29 | +plugin will re-wake when needed. |
| 30 | + |
| 31 | +# Known Limitations |
| 32 | +* Currently GPU memory contents are brought into main system memory and CRIU |
| 33 | + then checkpoints that as part of the normal procedure. On systems with many |
| 34 | + GPU's with high GPU memory usage this can cause memory thrashing. A future |
| 35 | + CUDA release will add support for dumping the memory contents to files to |
| 36 | + alleviate this as well as support in the CRIU plugin. |
| 37 | +* There's currently a small race between when a PAUSE_DEVICES hook is called on |
| 38 | + a running process and a process calls cuInit() and finishes initializing CUDA |
| 39 | + after the PAUSE is issued but before the process is frozen to checkpoint. This |
| 40 | + will cause cuda-checkpoint to report that the process is in an illegal state |
| 41 | + for checkpointing and it's recommended to just attempt the CRIU procedure |
| 42 | + again, this should be very rare. |
| 43 | +* Applications that use NVML will leave some leftover device references as NVML |
| 44 | + is not currently supported for checkpointing. There will be support for this |
| 45 | + in later drivers. A possible temporary workaround is to have the |
| 46 | + {DUMP,RESTORE}_EXT_FILE hook just ignore /dev/nvidiactl and /dev/nvidia{0..N} |
| 47 | + remaining references for these applications as in most cases NVML is used to |
| 48 | + get info such as gpu count and some capabilities and these values are never |
| 49 | + accessed again and unlikely to change. |
| 50 | +* CUDA applications that fork() but don't call exec() but also don't issue any |
| 51 | + CUDA API calls will have some leftover references to /dev/nvidia* and fail to |
| 52 | + checkpoint as a result. This can be worked around in a similar fashion to the |
| 53 | + NVML case where the leftover references can be ignored as CUDA is not fork() |
| 54 | + safe anyway. |
| 55 | +* Restore currently requires that you restore on a system with similar GPU's and |
| 56 | + same GPU count. |
| 57 | +* NVIDIA UVM Managed Memory, MIG (Multi Instance GPU), and MPS (Multi-Process |
| 58 | + Service) are currently not supported for checkpointing. Future CUDA releases |
| 59 | + will add support for these. |
0 commit comments