Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential empty file issue in Ceph Volumes after Talos 1.9.2 Upgrade (Kernel 6.12.9) #10297

Open
WinterNis opened this issue Feb 5, 2025 · 0 comments

Comments

@WinterNis
Copy link

Bug Report

On the advice of @smira , we are posting here details of an issue we faced with ceph volumes while upgrading to talos 1.9.2.

Description

The issue was not easy to debug, as it involves multiple components. We will try to detail our setup and investigation to be as precise as possible, while isolating the issue we identified.

  • In our talos clusters, we are using ceph as the csi. Ceph is manager with rook-ceph (https://rook.io/docs/rook/latest-release/Getting-Started/intro/). Ceph version is 19.2.0
  • We are running github action runner for our ci (https://github.com/actions/actions-runner-controller). The usage of github action here is non relevant, but the behaviour of the runner is.
  • The runner is creating a pod with a volume (provisioned by ceph). This pods writes a file in the volume.
  • Then the runner is creating another pod that also use the same volume, and read the file.
    After we upgraded our cluster to 1.9.2, the second pod started to have issues reading the file created by the fist pod. Most of the time (but not always), the file did exist in the volume, but was actually empty.

The symptom is actually really close to what is described in this github action runner issue: actions/runner-container-hooks#145
This issue never had a proper conclusion.

This behaviour is consistent. Going back to talos 1.9.1 makes the issues disappear, and upgrading again to 1.9.2 makes it appear again.

First, we tried to investigate the components of the cluster. And in particular, ceph components. We did not find any obvious errors.
We were not able to display ceph related operation using "talosctl dmesg"

To find the source of the issue, we then investigated the changes introduced by talos1.9.2 . (https://github.com/siderolabs/talos/releases/tag/v1.9.2)
To try to isolate the change that triggered the issue, we built custom 1.9.2 talos versions with the previous package versions of the 1.9.1 (containerd, kernel)
And we finally identified the culprit, it's the kernel version. Going back to kernel 6.12.6 fixes the issue.

We then dig into the kernel changes that concern the ceph driver (https://www.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.12.7)
To be perfectly honest, we do not have expertise on low level linux internals. So we were not able to assess how the behaviour we were observing was due to these recent changes.

We also tried to test the even more recent version ok kernel (6.13+), but we did not manage to compile the kernel successfully for now.

We plan to open an issue on ceph tracker, but we are still waiting for approval of access to the issue creation process.

Logs

Environment

  • Talos version: 1.9.2
  • Kubernetes version: [kubectl version --short]
  • Platform:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant