-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template expansion may fail in argo 3.5.10 and 3.5.11 #13780
Comments
Can you ensure you've got pod RBAC correct: https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/ Can you provide logs from the wait container of the Are you running the 3.5.11 executor in your created pods? |
This workflow always works with same manifests, but sometimes fails. I don't think rbac is important.
This error only occurs in the production environment, and the argo-workflows in the production environment have already been rolled back.
Though I tried also 3.5.11 executor (same as controller) , I got the same error. |
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs. |
Seeing this on v3.5.12 as well. |
We have the issue on v3.5.12, especially when the system is under heavy load.
These are the logs when it fails:
And when it's working correctly:
These 2 seem to be relevant on the error one:
Let me know if I can provide something else to help investigate |
@RafaPinzon93 do u have the controller logs? i think this happening even in 3.4.11 as i encounter #13799 when under load |
@tooptoop4 It seems we didn't have some controller logs at that time. The last logs that I see are these:
|
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
Template expansion may fail in argo 3.5.10 and 3.5.11.
It doesn't fail every time, but seems to occur when there is a certain amount of load.
Also, we manage multiple clusters, and even though they are the same version, there are some clusters where the problem occurs and some where it doesn't.
This problem does not exist in v3.4.6.
I discovered a problem when upgrading from 3.4.6 to 3.5.10, and tried 3.5.11, but the same error occurred.
Version(s)
v3.5.10,v3.5.11
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
While this workflow works well in many cases, there are some cases where it can fail:
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: