-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner to workflow pods take 3 minutes to start on RWX & containerMode: Kubernetes #3834
Comments
@alexgaganashvili @nikola-jokic Hey Nikola & Alex - I've seen y'all encounter to similar issues like this before, let me know if you see something! Deeply appreciated |
I don't think it's the slowness in PV provisioning, since it's the same PV shared between a runner and a workflow pod. Maybe K8s is trying to find a node that fits your resource requests (ACTIONS_RUNNER_USE_KUBE_SCHEDULER=true)? Check also the kube-scheduler logs. |
Hey @alexgaganashvili - thanks for the comment. checked kube-scheduler logs ( The workflow pod does have space to provision in the node (5000m cpu allowable to be requested) - with 1 workflow pod at 3000m cpu request. I feel it has something to do with this process If you look at timestamp, it's stuck for a minute repeating the same pod logs. Wonder what the best way to debug this further would be. |
Sorry, hard to tell what's causing it. I have not personally run into this issue. I'd suggest you also ask in the Discussions. |
@jonathan-fileread , cc: @Link- , @nikola-jokic |
Checks
Controller Version
0.9.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
After initializing the runner pod (which is fairly immediate) - the github actions jobs (6 of them) seems to get stuck polling for 2-3 minutes waiting to spin up the workflow pod to continue the github action job.
The runner pod logs show every 5-10 seconds there is a job that polls for 2-3 minutes before the container hook is called and the workflow pod is spun up.
See Line 6-52 in the scaleset logs gist below, you'll see this line get called every few seconds.
[WORKER 2024-12-03 19:21:58Z INFO HostContext] Well known directory 'Root': '/home/runner'
This bug started occuring when we switched to RWX, new storage class using NFS based azure files. I suppose it might be the slowness to provision a PVC using azure files versus traditional disk based setup on RWO
Describe the expected behavior
After initializing the runner pod on new github actions job- the workflow pods should spin up near immediately to process the docker builds from each GHA job.
Additional Context
Controller Logs
ARC Controller & Scaleset Logs: https://gist.github.com/jonathan-fileread/fd0978bef66784e20d6b50bce50cd3b9
Runner Pod Logs
ARC Controller & Scaleset Logs: https://gist.github.com/jonathan-fileread/fd0978bef66784e20d6b50bce50cd3b9
The text was updated successfully, but these errors were encountered: