Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template expansion may fail in argo 3.5.10 and 3.5.11 #13780

Open
3 of 4 tasks
mist714 opened this issue Oct 18, 2024 · 7 comments
Open
3 of 4 tasks

Template expansion may fail in argo 3.5.10 and 3.5.11 #13780

mist714 opened this issue Oct 18, 2024 · 7 comments
Labels
area/looping `withParams`, `withItems`, and `withSequence` area/templating Templating with `{{...}}` type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@mist714
Copy link

mist714 commented Oct 18, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Template expansion may fail in argo 3.5.10 and 3.5.11.

It doesn't fail every time, but seems to occur when there is a certain amount of load.
Also, we manage multiple clusters, and even though they are the same version, there are some clusters where the problem occurs and some where it doesn't.

This problem does not exist in v3.4.6.
I discovered a problem when upgrading from 3.4.6 to 3.5.10, and tried 3.5.11, but the same error occurred.

Version(s)

v3.5.10,v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sample-
spec:
  entrypoint: entrypoint
  templates:
  - name: entrypoint
    steps:
    - - name: generate-param
        template: generate-param
    - - name: with-param
        template: with-param
        arguments:
          parameters:
          - name: item
            value: '{{item}}'
        withParam: '{{steps.generate-param.outputs.result}}'
  - name: generate-param
    container:
      image: busybox
      command: [echo, '["foo", "bar", "baz"]']
  - name: with-param
    inputs:
      parameters:
      - name: item
    container:
      image: busybox
      command: [echo, '{{inputs.parameters.item}}']

While this workflow works well in many cases, there are some cases where it can fail:

NAME           STATUS      AGE   DURATION   PRIORITY   MESSAGE
sample-twvvt   Failed      38s   20s        0          withParam value could not be parsed as a JSON list: {{steps.generate-param.outputs.result}}: invalid character '{' looking for beginning of object key string
sample-ft2vd   Failed      39s   10s        0          withParam value could not be parsed as a JSON list: {{steps.generate-param.outputs.result}}: invalid character '{' looking for beginning of object key string
sample-hsbz8   Failed      40s   10s        0          withParam value could not be parsed as a JSON list: {{steps.generate-param.outputs.result}}: invalid character '{' looking for beginning of object key string

Logs from the workflow controller

Nothing showed up in the controller log.

Logs from in your workflow's wait container

Since the template expansion failed, init containers are not starting either.
@jswxstw jswxstw added area/templating Templating with `{{...}}` area/looping `withParams`, `withItems`, and `withSequence` labels Oct 18, 2024
@Joibel
Copy link
Member

Joibel commented Oct 21, 2024

Can you ensure you've got pod RBAC correct: https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/

Can you provide logs from the wait container of the generate-param step?

Are you running the 3.5.11 executor in your created pods?

@jswxstw jswxstw added the problem/more information needed Not enough information has been provide to diagnose this issue. label Oct 22, 2024
@agilgur5 agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Oct 22, 2024
@mist714
Copy link
Author

mist714 commented Oct 23, 2024

Can you ensure you've got pod RBAC correct: https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/

This workflow always works with same manifests, but sometimes fails. I don't think rbac is important.

Can you provide logs from the wait container of the generate-param step?

This error only occurs in the production environment, and the argo-workflows in the production environment have already been rolled back.
We will continue to verify in the test environment whether it can be reproduced.

Are you running the 3.5.11 executor in your created pods?

Though I tried also 3.5.11 executor (same as controller) , I got the same error.

Copy link
Contributor

github-actions bot commented Nov 6, 2024

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Nov 6, 2024
@black-snow
Copy link

Seeing this on v3.5.12 as well.

@github-actions github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Nov 9, 2024
@RafaPinzon93
Copy link

We have the issue on v3.5.12, especially when the system is under heavy load.

Can you provide logs from the wait container of the generate-param step?

These are the logs when it fails:

time="2024-11-07T17:33:11.128Z" level=info msg="Starting Workflow Executor" version=v3.5.12
time="2024-11-07T17:33:11.132Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-11-07T17:33:11.132Z" level=info msg="Executor initialized" deadline="2024-11-10 17:33:05 +0000 UTC" includeScriptOutput=false namespace=sonar-workflows podName=sonar-wow-v2-clgcv-read-group-561843697 templateName=read-group version="&Version{Version:v3.5.12,BuildDate:2024-10-30T10:56:15Z,GitCommit:8fe8de2e16ec39a5477df17586a3d212ec63a4bd,GitTag:v3.5.12,GitTreeState:clean,GoVersion:go1.21.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-11-07T17:33:11.196Z" level=info msg="Starting deadline monitor"
time="2024-11-07T17:33:17.211Z" level=info msg="Main container completed" error="<nil>"
time="2024-11-07T17:33:17.211Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-11-07T17:33:17.211Z" level=info msg="Saving output parameters"
time="2024-11-07T17:33:17.211Z" level=info msg="Saving path output parameter: groups"
time="2024-11-07T17:33:17.211Z" level=info msg="Copying /mnt/vol/groups.json from volume mount"
time="2024-11-07T17:33:17.223Z" level=info msg="Successfully saved output parameter: groups"
time="2024-11-07T17:33:17.223Z" level=info msg="No output artifacts"
time="2024-11-07T17:33:17.244Z" level=info msg="Alloc=8593 TotalAlloc=13731 Sys=19557 NumGC=4 Goroutines=8"
time="2024-11-07T17:33:17.251Z" level=info msg="Deadline monitor stopped"
time="2024-11-07T17:33:17.251Z" level=info msg="stopping progress monitor (context done)" error="context canceled"

And when it's working correctly:


time="2024-11-08T11:27:05.754Z" level=info msg="Starting Workflow Executor" version=v3.5.12
time="2024-11-08T11:27:05.851Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-11-08T11:27:05.870Z" level=info msg="Executor initialized" deadline="2024-11-11 11:26:36 +0000 UTC" includeScriptOutput=false namespace=sonar-workflows podName=sonar-wow-v2-fvv7h-read-group-764570069 templateName=read-group version="&Version{Version:v3.5.12,BuildDate:2024-10-30T10:56:15Z,GitCommit:8fe8de2e16ec39a5477df17586a3d212ec63a4bd,GitTag:v3.5.12,GitTreeState:clean,GoVersion:go1.21.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-11-08T11:27:06.246Z" level=info msg="Starting deadline monitor"
time="2024-11-08T11:27:53.751Z" level=info msg="Main container completed" error="<nil>"
time="2024-11-08T11:27:53.751Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-11-08T11:27:53.751Z" level=info msg="Saving output parameters"
time="2024-11-08T11:27:53.751Z" level=info msg="Saving path output parameter: groups"
time="2024-11-08T11:27:53.751Z" level=info msg="Copying /mnt/vol/groups.json from volume mount"
time="2024-11-08T11:27:53.753Z" level=info msg="Successfully saved output parameter: groups"
time="2024-11-08T11:27:53.787Z" level=info msg="No output artifacts"
time="2024-11-08T11:27:53.945Z" level=info msg="Alloc=7697 TotalAlloc=13747 Sys=23397 NumGC=4 Goroutines=8"

These 2 seem to be relevant on the error one:

time="2024-11-07T17:33:17.251Z" level=info msg="Deadline monitor stopped"
time="2024-11-07T17:33:17.251Z" level=info msg="stopping progress monitor (context done)" error="context canceled"

Let me know if I can provide something else to help investigate

@tooptoop4
Copy link
Contributor

@RafaPinzon93 do u have the controller logs? i think this happening even in 3.4.11 as i encounter #13799 when under load

@RafaPinzon93
Copy link

@tooptoop4 It seems we didn't have some controller logs at that time. The last logs that I see are these:

time="2024-11-07T18:10:47.693Z" level=warning msg="Non-transient error: <nil>"
time="2024-11-07T16:38:55.136Z" level=info msg="Workflow step group node sonar-wow-v2-p8h4p-sub-thpqx-4266272020 not yet completed" namespace=sonar-workflows workflow=sonar-wow-v2-p8h4p-sub-thpqx
aks-spotbigcpu16-12311640-vmss000001
time="2024-11-07T16:38:55.133Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=sonar-workflows workflow=sonar-wow-v2-p8h4p-sub-rb9kr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/looping `withParams`, `withItems`, and `withSequence` area/templating Templating with `{{...}}` type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

7 participants