-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkout v4 fails with ACTIONS_RUNNER_CONTAINER_HOOKS #145
Comments
Hey @cgundy, I am failing to reproduce the issue. I forced the runner internal node version like you specified, and this is the output: |
Hi @nikola-jokic, thank you very much for testing it out. Yes, I am using the latest runner image
Did you also test on a runner that uses |
I have not, but it would depend on what the template is, right? The hook template modifies the job pod that you specify. So if the spec for the new pod is invalid, then the action would fail. But I'm more worried about the |
Hi, thanks for the quick response. I think you're onto something. I tested checkout v4 without using For completeness, here is my pod template:
And I am using cephfs as a storage class for
I'd rather not change our storageclass since it has been working well with this setup otherwise, but am open to any suggestions or debugging steps I can take. |
@nikola-jokic this is still an ongoing issue for us. We've tried to use checkout@v3, but now we're in a situation where we need to upgrade. I've checked that the permissions are all correct. If you have any suggestions for debugging steps please let me know, as the only option we may have is to not use the kube scheduler anymore, or move to dind. |
Could you share your workflow file? Did you manage to create a reproducible issue? I'm wondering if node we mount is the issue, but I'm not sure here but I'm not sure. It works for ubuntu image so maybe the check which node to mount is wrong (We compile node for alpine to mount it to alpine based containers) |
Just wanted to add a note here that I have been able to observe this issue under very similar circumstances - running in kube scheduler mode with rook ceph fs for multi attach. We're attempting to do some debugging on our end in this area as we're not seeing a consistent link between the checkout and this issue. That is, sometimes checkouts succeed, and tasks following the checkout fail (for example). I will ping back here again if we find additional information that may help. |
We've done some validation on this issue and have some interesting insights. The tests we performed with results are listed below: 1 - Conditions: 2 - Conditions: 3 - Conditions: 4 - Conditions: 5 - Conditions: Conclusion: Its looking like there is some kind of filecsystem level cache, or slight file lag when workloads run on two different nodes read and write the same file (perhaps some kind of stale data). We have seen some examples where checkouts succeed, but we aren't able to re-produce these successes to narrow down exactly what is different in these cases; for now we're assuming this is just good luck - the successful runs seem to be independent of any changes we make and are extremely uncommon. Todo: investigate mount options, sync / cache options and possibly locking options available in ceph. Hopefully this information is useful / not too un-necessary.
|
Hi, we are running a similar setup (github runners backed by ceph volumes). And we had the exact same symptom.
Upgrading our k8s nodes to kernel We suspect the issue lies in some modification made to the ceph driver in linux kernel in We wanted to test with even more recent version of kernel ( We are not sure that our issue is related to the one described first in this issue, but the symptom and error is exactly the same. |
That very interesting that you may have found the potential cause. Please keep the thread updated with your findings, as if you do get to a solution it will be extremely useful for us - it sounds very much like the symptoms you're seeing line up with some of the behaviours we were seeing & no doubt others have hit this too. |
We opened an issue on ceph issue tracker https://tracker.ceph.com/issues/69841 about this. The triage was really quick, but they did not look into it yet. For now we have manually pinned the kernel version in our setup, as a short-medium term solution. But we definitely want to get to the bottom of this and get the issue fixed. What version of the kernel were you running at that time ? Does it concur with our findings ? |
When trying to upgrade the GitHub
checkout
action from v3 to v4 using self-hosted runners with Kubernetes mode, I consistently get the following error:I've tried upgrading the internal runner node version from 16 to 20 using:
But I still see the same error. I believe this is a somewhat urgent issue as GitHub actions won't support node16 after Spring 2024 anymore (post) and we will need to upgrade
checkout
actions from v3 to v4.Thank you!
The text was updated successfully, but these errors were encountered: