-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple Docker from PyTorch build pipeline (pytorch/test-infra/calculate-docker action) #319
Comments
I've been looking into the "BUILD" part of the pipeline and at the files Docker provides a GHA to handle build and push using buildx (https://github.com/docker/build-push-action) maybe if we can rearchitect how the images are built to use this we could avoid having to deal with nested dockers in a container based runner. Looking closer at build.sh I think the script serves 3 purposes:
We'd want to move items 2 and 3 into a GHA step while leaving 1 as its necessary to gather the docker build parameters. I'm not sure if build.sh is intended to only be run in CI or if developers use it to run local builds too but if we need to support both paths then maybe extracting item 1 to a separate script that can be called by build.sh would allow us to support both methods of building the images. If we can achieve this then we will successfully decouple at least the image build from the build pipeline such that we are not needing to deal with Docker DIND methods to build the images. The harder part of this problem though is likely the test pipeline's use of |
One thing I missed in my previous comment is the docker build step above only builds a CI image to prepare for a pytorch build. The actual pytorch build seems to happen in the |
There's 2 other interesting files that should be looked at as they also handle creating and publishing CI images: |
The title of this issue needs to be improved but the thought here is that the current PyTorch CI build and test pipeline assumes that it is NOT running from inside a container; either a VM or dedicated host. When we tried in 2024H1 to migrate to ARC a container based runner autoscaler backed by Kubernetes this caused us some issues as we then needed to also support a multi-level nested container pipeline as scripts in PyTorch assumed they can just run
docker build
anddocker run
as part of the build pipeline.We had to use things like DIND at multiple nested container levels causing us to have to write many workaround scripts to support this effort.
The goal of this issue is to discuss how we can decouple the assumption that a job could run docker build|run and move into a more GHA native way to build and run pytorch containers for the build and test pipelines allowing us to more easily adopt Container based self-hosted runner autoscalers.
The text was updated successfully, but these errors were encountered: