You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It does not happen on all pods, and seems rather random.
I was able to reproduce this by terminating a single pod, and weirdly this also caused CPU to drop on all other distributors.
I did note some 503s from Prometheus at the time which is likely why this is happened but interesting a single pod caused this behaviour.
Likely the most critical thing of note is changes to the grpc config to try and align with the GKE graceful termination period of 15 sec for spot instances.
This could possibly be the factor here, in which case spot instances would not be possible.
The CPU drop across all pods is confusing however as I would expect Prometheus to just retry if a single distributor pod had terminated the connection unexpectedly.
Thank you for reporting this. If possible, could you check if you can reproduce it in 2.15? Looking through the panic's traceback, it seems that it came through dskit. Given we don't maintain the changelog when it comes to regular update of the dependencies, it's hard to tell if this was resolved already in the latest Mimir release.
What is the bug?
During a rollout of distributor pods I noticed some pods panic during shutdown.
It does not happen on all pods, and seems rather random.
I was able to reproduce this by terminating a single pod, and weirdly this also caused CPU to drop on all other distributors.
I did note some 503s from Prometheus at the time which is likely why this is happened but interesting a single pod caused this behaviour.
Likely the most critical thing of note is changes to the grpc config to try and align with the GKE graceful termination period of 15 sec for spot instances.
This could possibly be the factor here, in which case spot instances would not be possible.
The CPU drop across all pods is confusing however as I would expect Prometheus to just retry if a single distributor pod had terminated the connection unexpectedly.
How to reproduce it?
Unsure.
Possibly with the GRPC changes
What did you think would happen?
Graceful shutdown of the distirbutor pod.
What was your environment?
Mimir 2.13,
GKE 1.31.5-gke.1068000
Any additional context to share?
I could not see anything related in the release notes through to 2.15, but am happy to upgrade.
The text was updated successfully, but these errors were encountered: