This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Latency spike in distributors but not in ingesters #9709
Labels
question
Further information is requested
Describe the bug
A clear and concise description of what the bug is.
At random times we are observing latency spikes in distributors that last from 1-2 minutes to 1-2 hours. They came and go (resolve on their own).
Whey they happen we don't see any corresponding latency spike in ingesters.
We don't seem to be hitting cpu / memory limits in distributors and adding more distributor pods doesn't seem have any effect.
Mimir overview

Mimir writes
Mimir writes resources
Mimir writes networking
(The gap in metrics is unrelated, our prometheus scraping mimir was stuck in terminating state)
From traces we can see the latency seems to be spent in distributor and not in ingesters.
The time seems to be spent in between 2 span events:
We can see it affects requests from the prometheus replicas from HA pair which is not elected and the data from it is not written to ingesters.
In our dev env we deployed modified version of mimir with 2 extra span events trying to narrow down where is the time spent.
Source with the modifications https://github.com/grafana/mimir/pull/9707/files.
And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.
It happens to in both of our environments production and dev.
It seems to affected subset of our tenants.
To Reproduce
Steps to reproduce the behavior:
These latency spike seem to happen at random, so not sure how to reproduce them.
Expected behavior
No unexpected latency spikes.
Environment
Additional Context
We didn't find anything in the logs of mimir distributor, given it doesn't log each incoming request, and doesn't log much.
The text was updated successfully, but these errors were encountered: