You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the advice of @smira , we are posting here details of an issue we faced with ceph volumes while upgrading to talos 1.9.2.
Description
The issue was not easy to debug, as it involves multiple components. We will try to detail our setup and investigation to be as precise as possible, while isolating the issue we identified.
The runner is creating a pod with a volume (provisioned by ceph). This pods writes a file in the volume.
Then the runner is creating another pod that also use the same volume, and read the file.
After we upgraded our cluster to 1.9.2, the second pod started to have issues reading the file created by the fist pod. Most of the time (but not always), the file did exist in the volume, but was actually empty.
The symptom is actually really close to what is described in this github action runner issue: actions/runner-container-hooks#145
This issue never had a proper conclusion.
This behaviour is consistent. Going back to talos 1.9.1 makes the issues disappear, and upgrading again to 1.9.2 makes it appear again.
First, we tried to investigate the components of the cluster. And in particular, ceph components. We did not find any obvious errors.
We were not able to display ceph related operation using "talosctl dmesg"
To find the source of the issue, we then investigated the changes introduced by talos1.9.2 . (https://github.com/siderolabs/talos/releases/tag/v1.9.2)
To try to isolate the change that triggered the issue, we built custom 1.9.2 talos versions with the previous package versions of the 1.9.1 (containerd, kernel)
And we finally identified the culprit, it's the kernel version. Going back to kernel 6.12.6 fixes the issue.
We then dig into the kernel changes that concern the ceph driver (https://www.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.12.7)
To be perfectly honest, we do not have expertise on low level linux internals. So we were not able to assess how the behaviour we were observing was due to these recent changes.
We also tried to test the even more recent version ok kernel (6.13+), but we did not manage to compile the kernel successfully for now.
We plan to open an issue on ceph tracker, but we are still waiting for approval of access to the issue creation process.
Logs
Environment
Talos version: 1.9.2
Kubernetes version: [kubectl version --short]
Platform:
The text was updated successfully, but these errors were encountered:
Bug Report
On the advice of @smira , we are posting here details of an issue we faced with ceph volumes while upgrading to talos
1.9.2
.Description
The issue was not easy to debug, as it involves multiple components. We will try to detail our setup and investigation to be as precise as possible, while isolating the issue we identified.
rook-ceph
(https://rook.io/docs/rook/latest-release/Getting-Started/intro/). Ceph version is19.2.0
After we upgraded our cluster to
1.9.2
, the second pod started to have issues reading the file created by the fist pod. Most of the time (but not always), the file did exist in the volume, but was actually empty.The symptom is actually really close to what is described in this github action runner issue: actions/runner-container-hooks#145
This issue never had a proper conclusion.
This behaviour is consistent. Going back to talos
1.9.1
makes the issues disappear, and upgrading again to1.9.2
makes it appear again.First, we tried to investigate the components of the cluster. And in particular, ceph components. We did not find any obvious errors.
We were not able to display ceph related operation using "talosctl dmesg"
To find the source of the issue, we then investigated the changes introduced by talos
1.9.2
. (https://github.com/siderolabs/talos/releases/tag/v1.9.2)To try to isolate the change that triggered the issue, we built custom
1.9.2
talos versions with the previous package versions of the1.9.1
(containerd, kernel)And we finally identified the culprit, it's the kernel version. Going back to kernel
6.12.6
fixes the issue.We then dig into the kernel changes that concern the ceph driver (https://www.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.12.7)
To be perfectly honest, we do not have expertise on low level linux internals. So we were not able to assess how the behaviour we were observing was due to these recent changes.
We also tried to test the even more recent version ok kernel (
6.13+
), but we did not manage to compile the kernel successfully for now.We plan to open an issue on ceph tracker, but we are still waiting for approval of access to the issue creation process.
Logs
Environment
1.9.2
kubectl version --short
]The text was updated successfully, but these errors were encountered: