Template expansion may fail in argo 3.5.10 and 3.5.11 #13780

mist714 · 2024-10-18T01:09:08Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Template expansion may fail in argo 3.5.10 and 3.5.11．

It doesn't fail every time, but seems to occur when there is a certain amount of load.
Also, we manage multiple clusters, and even though they are the same version, there are some clusters where the problem occurs and some where it doesn't.

This problem does not exist in v3.4.6.
I discovered a problem when upgrading from 3.4.6 to 3.5.10, and tried 3.5.11, but the same error occurred.

Version(s)

v3.5.10,v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sample-
spec:
  entrypoint: entrypoint
  templates:
  - name: entrypoint
    steps:
    - - name: generate-param
        template: generate-param
    - - name: with-param
        template: with-param
        arguments:
          parameters:
          - name: item
            value: '{{item}}'
        withParam: '{{steps.generate-param.outputs.result}}'
  - name: generate-param
    container:
      image: busybox
      command: [echo, '["foo", "bar", "baz"]']
  - name: with-param
    inputs:
      parameters:
      - name: item
    container:
      image: busybox
      command: [echo, '{{inputs.parameters.item}}']

While this workflow works well in many cases, there are some cases where it can fail:

NAME           STATUS      AGE   DURATION   PRIORITY   MESSAGE
sample-twvvt   Failed      38s   20s        0          withParam value could not be parsed as a JSON list: {{steps.generate-param.outputs.result}}: invalid character '{' looking for beginning of object key string
sample-ft2vd   Failed      39s   10s        0          withParam value could not be parsed as a JSON list: {{steps.generate-param.outputs.result}}: invalid character '{' looking for beginning of object key string
sample-hsbz8   Failed      40s   10s        0          withParam value could not be parsed as a JSON list: {{steps.generate-param.outputs.result}}: invalid character '{' looking for beginning of object key string

Logs from the workflow controller

Nothing showed up in the controller log.

Logs from in your workflow's wait container

Since the template expansion failed, init containers are not starting either.

The text was updated successfully, but these errors were encountered:

Joibel · 2024-10-21T11:42:49Z

Can you ensure you've got pod RBAC correct: https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/

Can you provide logs from the wait container of the generate-param step?

Are you running the 3.5.11 executor in your created pods?

mist714 · 2024-10-23T01:29:40Z

Can you ensure you've got pod RBAC correct: https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/

This workflow always works with same manifests, but sometimes fails. I don't think rbac is important.

Can you provide logs from the wait container of the generate-param step?

This error only occurs in the production environment, and the argo-workflows in the production environment have already been rolled back.
We will continue to verify in the test environment whether it can be reproduced.

Are you running the 3.5.11 executor in your created pods?

Though I tried also 3.5.11 executor (same as controller) , I got the same error.

github-actions · 2024-11-06T02:22:16Z

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

black-snow · 2024-11-08T11:51:08Z

Seeing this on v3.5.12 as well.

RafaPinzon93 · 2024-11-09T10:11:03Z

We have the issue on v3.5.12, especially when the system is under heavy load.

Can you provide logs from the wait container of the generate-param step?

These are the logs when it fails:

time="2024-11-07T17:33:11.128Z" level=info msg="Starting Workflow Executor" version=v3.5.12
time="2024-11-07T17:33:11.132Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-11-07T17:33:11.132Z" level=info msg="Executor initialized" deadline="2024-11-10 17:33:05 +0000 UTC" includeScriptOutput=false namespace=sonar-workflows podName=sonar-wow-v2-clgcv-read-group-561843697 templateName=read-group version="&Version{Version:v3.5.12,BuildDate:2024-10-30T10:56:15Z,GitCommit:8fe8de2e16ec39a5477df17586a3d212ec63a4bd,GitTag:v3.5.12,GitTreeState:clean,GoVersion:go1.21.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-11-07T17:33:11.196Z" level=info msg="Starting deadline monitor"
time="2024-11-07T17:33:17.211Z" level=info msg="Main container completed" error="<nil>"
time="2024-11-07T17:33:17.211Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-11-07T17:33:17.211Z" level=info msg="Saving output parameters"
time="2024-11-07T17:33:17.211Z" level=info msg="Saving path output parameter: groups"
time="2024-11-07T17:33:17.211Z" level=info msg="Copying /mnt/vol/groups.json from volume mount"
time="2024-11-07T17:33:17.223Z" level=info msg="Successfully saved output parameter: groups"
time="2024-11-07T17:33:17.223Z" level=info msg="No output artifacts"
time="2024-11-07T17:33:17.244Z" level=info msg="Alloc=8593 TotalAlloc=13731 Sys=19557 NumGC=4 Goroutines=8"
time="2024-11-07T17:33:17.251Z" level=info msg="Deadline monitor stopped"
time="2024-11-07T17:33:17.251Z" level=info msg="stopping progress monitor (context done)" error="context canceled"

And when it's working correctly:


time="2024-11-08T11:27:05.754Z" level=info msg="Starting Workflow Executor" version=v3.5.12
time="2024-11-08T11:27:05.851Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-11-08T11:27:05.870Z" level=info msg="Executor initialized" deadline="2024-11-11 11:26:36 +0000 UTC" includeScriptOutput=false namespace=sonar-workflows podName=sonar-wow-v2-fvv7h-read-group-764570069 templateName=read-group version="&Version{Version:v3.5.12,BuildDate:2024-10-30T10:56:15Z,GitCommit:8fe8de2e16ec39a5477df17586a3d212ec63a4bd,GitTag:v3.5.12,GitTreeState:clean,GoVersion:go1.21.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-11-08T11:27:06.246Z" level=info msg="Starting deadline monitor"
time="2024-11-08T11:27:53.751Z" level=info msg="Main container completed" error="<nil>"
time="2024-11-08T11:27:53.751Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-11-08T11:27:53.751Z" level=info msg="Saving output parameters"
time="2024-11-08T11:27:53.751Z" level=info msg="Saving path output parameter: groups"
time="2024-11-08T11:27:53.751Z" level=info msg="Copying /mnt/vol/groups.json from volume mount"
time="2024-11-08T11:27:53.753Z" level=info msg="Successfully saved output parameter: groups"
time="2024-11-08T11:27:53.787Z" level=info msg="No output artifacts"
time="2024-11-08T11:27:53.945Z" level=info msg="Alloc=7697 TotalAlloc=13747 Sys=23397 NumGC=4 Goroutines=8"

These 2 seem to be relevant on the error one:

time="2024-11-07T17:33:17.251Z" level=info msg="Deadline monitor stopped"
time="2024-11-07T17:33:17.251Z" level=info msg="stopping progress monitor (context done)" error="context canceled"

Let me know if I can provide something else to help investigate

tooptoop4 · 2024-11-12T19:53:26Z

@RafaPinzon93 do u have the controller logs? i think this happening even in 3.4.11 as i encounter #13799 when under load

RafaPinzon93 · 2024-11-12T20:23:33Z

@tooptoop4 It seems we didn't have some controller logs at that time. The last logs that I see are these:

time="2024-11-07T18:10:47.693Z" level=warning msg="Non-transient error: <nil>"
time="2024-11-07T16:38:55.136Z" level=info msg="Workflow step group node sonar-wow-v2-p8h4p-sub-thpqx-4266272020 not yet completed" namespace=sonar-workflows workflow=sonar-wow-v2-p8h4p-sub-thpqx
aks-spotbigcpu16-12311640-vmss000001
time="2024-11-07T16:38:55.133Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=sonar-workflows workflow=sonar-wow-v2-p8h4p-sub-rb9kr

mist714 added the type/bug label Oct 18, 2024

jswxstw added area/templating Templating with `{{...}}` area/looping `withParams`, `withItems`, and `withSequence` labels Oct 18, 2024

jswxstw added the problem/more information needed Not enough information has been provide to diagnose this issue. label Oct 22, 2024

agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Oct 22, 2024

github-actions bot added the problem/stale This has not had a response in some time label Nov 6, 2024

github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Nov 9, 2024

tooptoop4 mentioned this issue Nov 12, 2024

once-off error under load, {{retries}} not replaced #13799

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template expansion may fail in argo 3.5.10 and 3.5.11 #13780

Template expansion may fail in argo 3.5.10 and 3.5.11 #13780

mist714 commented Oct 18, 2024 •

edited

Loading

Joibel commented Oct 21, 2024 •

edited

Loading

mist714 commented Oct 23, 2024 •

edited

Loading

github-actions bot commented Nov 6, 2024

black-snow commented Nov 8, 2024

RafaPinzon93 commented Nov 9, 2024

tooptoop4 commented Nov 12, 2024

RafaPinzon93 commented Nov 12, 2024

Template expansion may fail in argo 3.5.10 and 3.5.11 #13780

Template expansion may fail in argo 3.5.10 and 3.5.11 #13780

Comments

mist714 commented Oct 18, 2024 • edited Loading

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

Joibel commented Oct 21, 2024 • edited Loading

mist714 commented Oct 23, 2024 • edited Loading

github-actions bot commented Nov 6, 2024

black-snow commented Nov 8, 2024

RafaPinzon93 commented Nov 9, 2024

tooptoop4 commented Nov 12, 2024

RafaPinzon93 commented Nov 12, 2024

mist714 commented Oct 18, 2024 •

edited

Loading

Joibel commented Oct 21, 2024 •

edited

Loading

mist714 commented Oct 23, 2024 •

edited

Loading