checkout v4 fails with ACTIONS_RUNNER_CONTAINER_HOOKS #145

cgundy · 2024-03-13T11:07:15Z

When trying to upgrade the GitHub checkout action from v3 to v4 using self-hosted runners with Kubernetes mode, I consistently get the following error:

Run '/home/runner/k8s/index.js'
  shell: /home/runner/externals/node20/bin/node {0}
undefined:1


SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at new Context (/__w/_actions/actions/checkout/v4/dist/index.js:4766:37)
    at 3030 (/__w/_actions/actions/checkout/v4/dist/index.js:4935:19)
    at __nccwpck_require__ (/__w/_actions/actions/checkout/v4/dist/index.js:18273:43)
    at 5438 (/__w/_actions/actions/checkout/v4/dist/index.js:4836:17)
    at __nccwpck_require__ (/__w/_actions/actions/checkout/v4/dist/index.js:18273:43)
    at 8601 (/__w/_actions/actions/checkout/v4/dist/index.js:1921:[29](https://github.com/<redacted>/actions/runs/8252557433/job/22600517448?pr=42#step:3:30))
    at __nccwpck_require__ (/__w/_actions/actions/checkout/v4/dist/index.js:18273:43)
    at 738 (/__w/_actions/actions/checkout/v4/dist/index.js:477:[32](https://github.com/<redacted>/actions/runs/8252557433/job/22600517448?pr=42#step:3:34))
    at __nccwpck_require__ (/__w/_actions/actions/checkout/v4/dist/index.js:18273:43)

Node.js v20.8.1

I've tried upgrading the internal runner node version from 16 to 20 using:

 env:
     - name: ACTIONS_RUNNER_FORCED_INTERNAL_NODE_VERSION
       value: node20

But I still see the same error. I believe this is a somewhat urgent issue as GitHub actions won't support node16 after Spring 2024 anymore (post) and we will need to upgrade checkout actions from v3 to v4.

Thank you!

The text was updated successfully, but these errors were encountered:

nikola-jokic · 2024-03-21T10:41:06Z

Hey @cgundy,

I am failing to reproduce the issue. I forced the runner internal node version like you specified, and this is the output:

Did you try using the latest runner image? Please let me know ☺️

cgundy · 2024-03-25T09:22:13Z

Hi @nikola-jokic, thank you very much for testing it out. Yes, I am using the latest runner image v2.314.1. I noticed that the checkout succeeds on a simpler runner setup that does not use a container hook template, but fails on my more complex setup. These are the full container specs I am using:

  template:
    spec:
      securityContext:
        fsGroup: 1001
      containers:
        - name: runner
          image: ghcr.io/actions/actions-runner:2.314.1
          imagePullPolicy: IfNotPresent
          command: ["/home/runner/run.sh"]
          env:
            - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
              value: "false"
            - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
              value: /home/runner/pod-templates/custom-config.yaml
            - name: ACTIONS_RUNNER_USE_KUBE_SCHEDULER
              value: "true"
            - name: ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS
              value: "300"
            - name: ACTIONS_RUNNER_FORCED_INTERNAL_NODE_VERSION
              value: node20
          resources:
            requests:
              memory: 1Gi
          volumeMounts:
            - name: pod-templates
              mountPath: /home/runner/pod-templates
              readOnly: true

Did you also test on a runner that uses ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE? Or do you see anything obviously wrong with my config? Thanks a lot! 🙏

nikola-jokic · 2024-03-25T13:18:43Z

I have not, but it would depend on what the template is, right? The hook template modifies the job pod that you specify. So if the spec for the new pod is invalid, then the action would fail. But I'm more worried about the ACTIONS_RUNNER_USE_KUBE_SCHEDULER. If you are using ReadWriteMany volume, could you check that the hook has the permission to read from it? Since the job pod is up, I assume that your hook template is okay. So, the problem may be with the ReadWriteMany volume, but I'm not sure. If you fail to determine the problem, could you please send the extension and the volume spec that you are using, so I can try to reproduce it? Thanks!

cgundy · 2024-03-25T15:35:27Z

Hi, thanks for the quick response. I think you're onto something. I tested checkout v4 without using ACTIONS_RUNNER_USE_KUBE_SCHEDULER and only ReadWriteOnce and it succeeded. So it seems this is where the issue lies. However, the setup worked for checkout v3, so I don't understand where the permissions issues would come from nor do I see anything related to this in the logs.

For completeness, here is my pod template:

apiVersion: v1
kind: PodTemplate
metadata:
  labels:
    app: runner-pod-template
spec:
  securityContext:
    runAsUser: 1001
    fsGroup: 1001
  containers:
    - name: $job
      securityContext:
        privileged: true
      volumeMounts:
        - name: var-sysimage
          mountPath: /var/sysimage
        - name: var-tmp
          mountPath: /var/tmp
      resources:
        requests:
          memory: 20Gi
  volumes:
    - name: var-sysimage
      emptyDir:
        medium: Memory
      readOnly: false
    - name: var-tmp
      emptyDir: {}
      readOnly: false

And I am using cephfs as a storage class for ReadWriteMany:

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: cephfs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 3
  dataPools:
    - name: replicated
      replicated:
        size: 3
  preserveFilesystemOnDelete: false
  metadataServer:
    activeCount: 1
    activeStandby: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: cephfs
  pool: cephfs-replicated
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Delete

I'd rather not change our storageclass since it has been working well with this setup otherwise, but am open to any suggestions or debugging steps I can take.

cgundy · 2024-06-06T08:42:24Z

@nikola-jokic this is still an ongoing issue for us. We've tried to use checkout@v3, but now we're in a situation where we need to upgrade. I've checked that the permissions are all correct. If you have any suggestions for debugging steps please let me know, as the only option we may have is to not use the kube scheduler anymore, or move to dind.

nikola-jokic · 2024-06-13T13:31:52Z

Could you share your workflow file? Did you manage to create a reproducible issue? I'm wondering if node we mount is the issue, but I'm not sure here but I'm not sure. It works for ubuntu image so maybe the check which node to mount is wrong (We compile node for alpine to mount it to alpine based containers)

MrBones757 · 2024-10-08T02:03:53Z

Just wanted to add a note here that I have been able to observe this issue under very similar circumstances - running in kube scheduler mode with rook ceph fs for multi attach.

We're attempting to do some debugging on our end in this area as we're not seeing a consistent link between the checkout and this issue. That is, sometimes checkouts succeed, and tasks following the checkout fail (for example). I will ping back here again if we find additional information that may help.

MrBones757 · 2024-10-09T07:21:05Z

We've done some validation on this issue and have some interesting insights. The tests we performed with results are listed below:

1 - Conditions:
Kube Scheduler Mode: Disabled
Effective Workload Scheduling: Same AZ, Same Node
Checkout Version/s: v3 & v4
Rook Filesystem Configuration: Replicas: 3
Rook Configuration: 3 ODSs, 1 per AZ
Result:
Checkouts succeed every time on both checkout versions

2 - Conditions:
Kube Scheduler Mode: Enabled
Effective Workload Scheduling: Same AZ, Different Node
Checkout Version/s: v3 & v4
Rook Filesystem Configuration: Replicas: 3
Rook Configuration: 3 ODSs, 1 per AZ
Result:
Fail W/ JSON Error

3 - Conditions:
Kube Scheduler Mode: Enabled
Effective Workload Scheduling: Same AZ, Different Node
Checkout Version/s: v3 & v4
Rook Filesystem Configuration: Replicas: 1
Rook Configuration: 3 ODSs, 1 per AZ
Result:
Fail W/ JSON Error

4 - Conditions:
Kube Scheduler Mode: Enabled
Effective Workload Scheduling: Same AZ, Different Node
Checkout Version/s: Custom Fork of V4 W/ debugging
Rook Filesystem Configuration: Replicas: 1 & 3
Rook Configuration: 3 ODSs, 1 per AZ
Result:
Success
Checkout action forked.
https://github.com/actions/checkout/blob/main/dist/index.js#L4941 Updated to add an additional ready & log action
log showed valid json payload
existing payload after log success

5 - Conditions:
Kube Scheduler Mode: Enabled
Effective Workload Scheduling: Same AZ, Different Node
Checkout Version/s: V4 & V3
Rook Filesystem Configuration: Replicas: 3 & CSI Options: --enable-read-affinity=true
Rook Configuration: 3 ODSs, 1 per AZ
Results:
Fail W/ JSON Error

Conclusion: Its looking like there is some kind of filecsystem level cache, or slight file lag when workloads run on two different nodes read and write the same file (perhaps some kind of stale data). We have seen some examples where checkouts succeed, but we aren't able to re-produce these successes to narrow down exactly what is different in these cases; for now we're assuming this is just good luck - the successful runs seem to be independent of any changes we make and are extremely uncommon.

Todo: investigate mount options, sync / cache options and possibly locking options available in ceph.

Hopefully this information is useful / not too un-necessary.
We're aiming to continue looking into this issue for a little longer as we believe we've made some progress.

update: post with correct account :D

WinterNis · 2025-02-05T12:22:50Z

Hi, we are running a similar setup (github runners backed by ceph volumes). And we had the exact same symptom.
Here is what we found out in our investigation:

When the issue arise, the json file does exist, but the content is empty.
Replication 1 or 3 does not change anything
We did not have the issue until we ... upgraded the kernel.

Upgrading our k8s nodes to kernel 6.12.7+ make the issue appear. And that's consistent.

We suspect the issue lies in some modification made to the ceph driver in linux kernel in 6.12.7 (https://www.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.12.7).

We wanted to test with even more recent version of kernel (6.13+), but were not able to do so for now. The latest version contains fixes that could maybe linked to the present issue.

We are not sure that our issue is related to the one described first in this issue, but the symptom and error is exactly the same.

MrBones757 · 2025-02-07T09:24:19Z

That very interesting that you may have found the potential cause.
When we were originally looking into this, we had some conversations with both the rook and ceph devs on their respective slack channels but it didn't yield any success (solution wise) definitely got some great ideas / pointers from both groups. I guess for this issue, you would want to get in touch with some of the ceph kernel module devs for some insight.

Please keep the thread updated with your findings, as if you do get to a solution it will be extremely useful for us - it sounds very much like the symptoms you're seeing line up with some of the behaviours we were seeing & no doubt others have hit this too.

WinterNis · 2025-02-11T14:39:10Z

We opened an issue on ceph issue tracker https://tracker.ceph.com/issues/69841 about this.

The triage was really quick, but they did not look into it yet.
We did not manage to have useful logs so that might not be helping.

For now we have manually pinned the kernel version in our setup, as a short-medium term solution. But we definitely want to get to the bottom of this and get the issue fixed.

What version of the kernel were you running at that time ? Does it concur with our findings ?

nikola-jokic added question Further information is requested k8s labels Mar 21, 2024

WinterNis mentioned this issue Feb 5, 2025

Potential empty file issue in Ceph Volumes after Talos 1.9.2 Upgrade (Kernel 6.12.9) siderolabs/talos#10297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkout v4 fails with ACTIONS_RUNNER_CONTAINER_HOOKS #145

checkout v4 fails with ACTIONS_RUNNER_CONTAINER_HOOKS #145

cgundy commented Mar 13, 2024

nikola-jokic commented Mar 21, 2024

cgundy commented Mar 25, 2024

nikola-jokic commented Mar 25, 2024

cgundy commented Mar 25, 2024

cgundy commented Jun 6, 2024

nikola-jokic commented Jun 13, 2024

MrBones757 commented Oct 8, 2024

MrBones757 commented Oct 9, 2024

WinterNis commented Feb 5, 2025

MrBones757 commented Feb 7, 2025

WinterNis commented Feb 11, 2025

checkout v4 fails with ACTIONS_RUNNER_CONTAINER_HOOKS #145

checkout v4 fails with ACTIONS_RUNNER_CONTAINER_HOOKS #145

Comments

cgundy commented Mar 13, 2024

nikola-jokic commented Mar 21, 2024

cgundy commented Mar 25, 2024

nikola-jokic commented Mar 25, 2024

cgundy commented Mar 25, 2024

cgundy commented Jun 6, 2024

nikola-jokic commented Jun 13, 2024

MrBones757 commented Oct 8, 2024

MrBones757 commented Oct 9, 2024

WinterNis commented Feb 5, 2025

MrBones757 commented Feb 7, 2025

WinterNis commented Feb 11, 2025