runner set fails to fetch the latest image #3292

xuzhao9 · 2024-02-15T03:50:38Z

xuzhao9
Feb 15, 2024

Hello! We recently upgraded our ARC cluster to scale set. It works great except that we expect the runner to always pull the latest tag of the image: https://github.com/pytorch/benchmark/blob/main/docker/infra/values.yaml#L226

We will rebuild the image nightly, and we hope all runners are running with the latest tag of image. This works well on the legacy mode of ARC.

However, after upgrade, we found that the runner image is often "out-dated" and does not update to the latest tag even though we have pushed the new image. For example:
We pushed the dev20240214 image at 10:30 AM EST: https://github.com/pytorch/benchmark/actions/runs/7903121973, https://github.com/pytorch/benchmark/pkgs/container/torchbench/178971035?tag=dev20240214

However, the workflow started at 12:00 PM EST still uses the old dev20240213 image: https://github.com/pytorch/benchmark/actions/runs/7904726803/job/21575716541

Is there a way to let the K8s controller upgrade the runner's image more aggressively?

xuzhao9 · 2024-02-16T17:45:51Z

xuzhao9
Feb 16, 2024
Author

I checked that this is because the Pod is running for a long time (17 hrs). Newly created Pods does not have this issue.
Can we somehow set expiration for the Pods so that they are restarted with new Docker Image?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runner set fails to fetch the latest image #3292

{{title}}

Replies: 1 comment

{{title}}

Select a reply

runner set fails to fetch the latest image #3292

xuzhao9 Feb 15, 2024

Replies: 1 comment

xuzhao9 Feb 16, 2024 Author

xuzhao9
Feb 15, 2024

xuzhao9
Feb 16, 2024
Author