Decouple Docker from PyTorch build pipeline (pytorch/test-infra/calculate-docker action) #319

zxiiro · 2025-01-10T17:20:45Z

The title of this issue needs to be improved but the thought here is that the current PyTorch CI build and test pipeline assumes that it is NOT running from inside a container; either a VM or dedicated host. When we tried in 2024H1 to migrate to ARC a container based runner autoscaler backed by Kubernetes this caused us some issues as we then needed to also support a multi-level nested container pipeline as scripts in PyTorch assumed they can just run docker build and docker run as part of the build pipeline.

We had to use things like DIND at multiple nested container levels causing us to have to write many workaround scripts to support this effort.

The goal of this issue is to discuss how we can decouple the assumption that a job could run docker build|run and move into a more GHA native way to build and run pytorch containers for the build and test pipelines allowing us to more easily adopt Container based self-hosted runner autoscalers.

The text was updated successfully, but these errors were encountered:

zxiiro · 2025-01-10T17:34:36Z

I've been looking into the "BUILD" part of the pipeline and at the files _linux-build.yml and test-infra's calculate-docker action as well as the .ci/docker/build.sh to try to understand how that all works. build.sh assumes it can run both the docker build as well as docker run commands but is complicated if the GHA runner it runs on is a container already which would require a solution that supports DIND. To avoid that if we can separate out the docker CLI commands from the script into a GHA action separately it would allow us to not have to deal with DIND Docker.

Docker provides a GHA to handle build and push using buildx (https://github.com/docker/build-push-action) maybe if we can rearchitect how the images are built to use this we could avoid having to deal with nested dockers in a container based runner.

Looking closer at build.sh I think the script serves 3 purposes:

Extract the set of parameters to be used for a docker build based on the provided image name.
Run docker build with the parameters found in step 1
Run the built image and print out the expected and actual versions of various packages installed

We'd want to move items 2 and 3 into a GHA step while leaving 1 as its necessary to gather the docker build parameters. I'm not sure if build.sh is intended to only be run in CI or if developers use it to run local builds too but if we need to support both paths then maybe extracting item 1 to a separate script that can be called by build.sh would allow us to support both methods of building the images.

If we can achieve this then we will successfully decouple at least the image build from the build pipeline such that we are not needing to deal with Docker DIND methods to build the images.

The harder part of this problem though is likely the test pipeline's use of docker run to run the PyTorch tests. That still needs to be analyzed but if we can get that to a point where running a test is a matter of using the built pytorch ci images directly to load on to the self-hosted runner that would save us from needing Docker DIND in the test pipeline as well.

zxiiro · 2025-01-10T18:24:30Z

One thing I missed in my previous comment is the docker build step above only builds a CI image to prepare for a pytorch build.

The actual pytorch build seems to happen in the .ci/pytorch/build.sh file which is run after that via docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh' command.

zxiiro · 2025-01-13T16:36:20Z

There's 2 other interesting files that should be looked at as they also handle creating and publishing CI images: docker-builds.yml, docker-release.yml.

zxiiro self-assigned this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple Docker from PyTorch build pipeline (pytorch/test-infra/calculate-docker action) #319

Decouple Docker from PyTorch build pipeline (pytorch/test-infra/calculate-docker action) #319

zxiiro commented Jan 10, 2025

zxiiro commented Jan 10, 2025

zxiiro commented Jan 10, 2025

zxiiro commented Jan 13, 2025

Decouple Docker from PyTorch build pipeline (pytorch/test-infra/calculate-docker action) #319

Decouple Docker from PyTorch build pipeline (pytorch/test-infra/calculate-docker action) #319

Comments

zxiiro commented Jan 10, 2025

zxiiro commented Jan 10, 2025

zxiiro commented Jan 10, 2025

zxiiro commented Jan 13, 2025