-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU availability inside a nested docker inside dind container #182
Comments
I've been looking into this and trying to get the GPUs loaded into the container created by the CI's To get this far we need to use a nvidia-dind supported container. I searched Docker Hub and used a container image that seemed recent. While this allowed us to load the Nvidia GPUs inside of the dind container, I got stuck at calling
If I list the path:
We can see that nvidia-container-cli seems to pick up the wrong path when trying to load the GPUs from within a container. I wonder if nvidia-container-cli supports being run from within a container to spawn a nested container. I've found some issues that sound similar to what we ran into but no clear resolution.
|
I'm feel like us trying to make ARC work with PyTorch's existing CI process is causing a lot of pain to us here. My gut feeling is ARC likely wasn't designed to work with this level of multiple nested layers of containers inside of other containers. If we could make the build/test run from inside of the runner container without nesting more containers than this we'd be able to run against the GPUs as expected. |
Okay I made a bit of a breakthrough this morning regarding passing the GPU further down into the container created by To get past the issue I was having when using
|
With all that said I think the resolution here is the following:
|
One complication that I just realized is from the runner container we cannot see the Another interesting thing is if we only request GPUs for the dind container but NOT the runner container, then the runner container can see all GPUs available to the host when checking |
While we've been able to get GPUs assigned inside the runner container, this isn't enough to allow GPU build/test to run as PyTorch's build / test code actually runs
docker run
to run the build / test inside a nested container inside of the runner.Since we are deploying dind for ARC this means the dind container is actually the container that requires the GPU and then pass that dind container into
docker run --gpus=all
to before the build / test jobs can actually utilize the GPU. This further complicates the setup necessary to be able to support PyTorch's CI.The text was updated successfully, but these errors were encountered: