Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-9228] Fleet pod fleet-cleanup does not receive tolerations from values.yaml #3180

Open
kkaempf opened this issue Jan 7, 2025 · 4 comments

Comments

@kkaempf
Copy link
Collaborator

kkaempf commented Jan 7, 2025

SURE-9228

Issue description:

When adding a specific toleration to values.tolerations, the fleet-cleanup-clusterregitrations pod does not use them. See here for the code and the image attached for an example.

Business impact:

The pod cannot run in a cluster where all nodes are tainted.

Repro steps:

  • Create a new cluster using k3s/rke2
  • Taint the node
  • Install rancher

Files, logs, traces

-- see JIRA --

Actual behavior:

Pod does not run in a cluster with all nodes tainted and chart installation fails.

Expected behavior:

Pod uses the tolerations from values.tolerations and can run in a tainted node.

@kkaempf kkaempf added kind/bug JIRA Must shout labels Jan 7, 2025
@kkaempf kkaempf added this to Fleet Jan 7, 2025
@github-project-automation github-project-automation bot moved this to 🆕 New in Fleet Jan 7, 2025
@kkaempf kkaempf moved this from 🆕 New to 📋 Backlog in Fleet Jan 7, 2025
@kkaempf kkaempf added this to the v2.10.2 milestone Jan 7, 2025
@manno manno modified the milestones: v2.10.2, v2.10.3 Jan 13, 2025
@weyfonk weyfonk self-assigned this Jan 30, 2025
@weyfonk weyfonk moved this from 📋 Backlog to 🏗 In progress in Fleet Jan 30, 2025
@weyfonk weyfonk moved this from 🏗 In progress to 👀 In review in Fleet Jan 30, 2025
@weyfonk
Copy link
Contributor

weyfonk commented Jan 31, 2025

Additional QA

Problem

Installing Fleet charts in a cluster where all nodes are tainted would fail, because Fleet jobs did not support node tolerations coming from chart values. This would result in Fleet failing to install.
Affected jobs were the cleanup jobs for cluster registrations and completed jobs.

Solution

The Fleet chart now propagates node tolerations from chart values to those jobs, as it does for other deployed pods.

Testing

Engineering Testing

Manual Testing

This has been tested by tainting one node in a k3d cluster, and checking that:

  • deployment was successful
  • jobs were not scheduled on the tainted node, unless the corresponding toleration was set in chart values

Automated Testing

N/A

QA Testing Considerations

It would make sense to test this in a cluster where all nodes are tainted:

  • if no tolerations are specified in chart values, running the above mentioned jobs should still fail
  • setting tolerations matching at least one node's taints should enable successful deployment, with both jobs being created and run successfully.

Regressions Considerations

N/A

@weyfonk weyfonk moved this from 👀 In review to Needs QA review in Fleet Jan 31, 2025
@weyfonk weyfonk removed their assignment Jan 31, 2025
@mmartin24
Copy link
Collaborator

When checked with Rancher v2.11-1d345932e5f78b3da07ebe13691c8224c62b6240-head / fleet:106.0.0+up0.12.0-alpha.4 the fix could not be observed because the toleration value is not being passed to the fleet-cleanup job via helm operation pod. This needs to be addressed by another team.

We tried to pass the tolerations when installing Rancher via extraTolerations flag and observed that Fleet had those values correctly passed; however, the helm-operation pod did not get scheduled in any of the nodes and therefore no fleet clean-up job/pod was created.

Then we tried manually adding the toleration to the helm operation pod and saw that the toleration value was still not being passed to the fleet- clean-job

Image

@manno manno moved this from Needs QA review to Blocked in Fleet Feb 12, 2025
@manno
Copy link
Member

manno commented Feb 12, 2025

Waiting for fixes on helm-operations pods and #3313 (comment).

@weyfonk
Copy link
Contributor

weyfonk commented Feb 18, 2025

Additional QA

Problem

Installing Fleet charts in a cluster where all nodes are tainted would fail, because Fleet jobs did not support node tolerations coming from chart values. This would result in Fleet failing to install. Affected jobs were the cleanup jobs for cluster registrations and completed jobs.

Solution

The Fleet chart now propagates node tolerations from chart values to those jobs, as it does for other deployed pods.

Testing

Engineering Testing

Manual Testing

This has been tested by tainting one node in a k3d cluster, and checking that:

* deployment was successful

* jobs were not scheduled on the tainted node, unless the corresponding toleration was set in chart values

Automated Testing

N/A

QA Testing Considerations

It would make sense to test this in a cluster where all nodes are tainted:

* if no tolerations are specified in chart values, running the above mentioned jobs should still fail

* setting tolerations matching at least one node's taints should enable successful deployment, with both jobs being created and run successfully.

Regressions Considerations

N/A

This can also be tested as follows, without needing to install any chart:

  1. Add tolerations to a separate values file, eg. values.yaml
  2. Ensure that helm repo list lists https://rancher.github.io/fleet-helm-charts/; if not, add it.
  3. Refresh Helm repositories with helm repo update
  4. Check the Fleet Helm repo for its latest available version, e.g. helm search repo fleet-repo --devel
  5. Use that latest version to template a Fleet chart installation with helm template fleet fleet-repo/fleet --version=$latest_version (where $latest_version is the version found in the previous step), checking that no extra tolerations are applied to jobs fleet-cleanup-clusterregistrations and job_cleanup_gitrepojobs.yaml
  6. Repeat the previous step with flag -f values.yaml and check that tolerations set in that file are applied to both jobs

@kkaempf kkaempf moved this from Blocked to Needs QA review in Fleet Feb 26, 2025
@kkaempf kkaempf moved this from Needs QA review to Blocked in Fleet Feb 26, 2025
@kkaempf kkaempf modified the milestones: v2.10.3, v2.10.4 Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Blocked
Development

No branches or pull requests

4 participants