Latency spike in distributors but not in ingesters #9709

jmichalek132 · 2024-10-22T12:41:16Z

Describe the bug

A clear and concise description of what the bug is.

At random times we are observing latency spikes in distributors that last from 1-2 minutes to 1-2 hours. They came and go (resolve on their own).
Whey they happen we don't see any corresponding latency spike in ingesters.
We don't seem to be hitting cpu / memory limits in distributors and adding more distributor pods doesn't seem have any effect.

Mimir overview

Mimir writes

Mimir writes resources

Mimir writes networking

(The gap in metrics is unrelated, our prometheus scraping mimir was stuck in terminating state)

From traces we can see the latency seems to be spent in distributor and not in ingesters.

The time seems to be spent in between 2 span events:

We can see it affects requests from the prometheus replicas from HA pair which is not elected and the data from it is not written to ingesters.

In our dev env we deployed modified version of mimir with 2 extra span events trying to narrow down where is the time spent.
Source with the modifications https://github.com/grafana/mimir/pull/9707/files.

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

It happens to in both of our environments production and dev.
It seems to affected subset of our tenants.

To Reproduce

Steps to reproduce the behavior:

These latency spike seem to happen at random, so not sure how to reproduce them.

Start Mimir 2.14
Perform Operations(Read/Write/Others)

Expected behavior

No unexpected latency spikes.

Environment

Infrastructure: kubernetes v1.29.5 (AKS, azure)
Deployment tool: helm chart mimir-distributed 5.5.0

Additional Context

We didn't find anything in the logs of mimir distributor, given it doesn't log each incoming request, and doesn't log much.

aknuds1 · 2024-10-22T15:40:31Z

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

As that line should be reading from the client connection, is it possible that the latency stems from the client? It certainly looks that way to me, especially considering the trace you included (~18 seconds spent reading the proto payload).

aknuds1 · 2024-10-22T15:42:27Z

I suspect this should be a discussion and not an issue, since there's no indication of a Mimir bug yet. The symptoms so far are of an operational issue.

jmichalek132 · 2024-10-22T18:27:48Z

I suspect this should be a discussion and not an issue, since there's no indication of a Mimir bug yet. The symptoms so far are of an operational issue.

Hi, sorry I forgot about discussions, do you have the permissions by any chances to change into one?

jmichalek132 · 2024-10-22T18:39:29Z

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

As that line should be reading from the client connection, is it possible that the latency stems from the client? It certainly looks that way to me, especially considering the trace you included (~18 seconds spent reading the proto payload).

Yeah that's what I am also starting to suspect the interesting thing is we don't see any increased latency on the ingress (contour based on envoy) but this might be down to what exactly the envoy metrics measure.

aknuds1 added the question Further information is requested label Oct 22, 2024

grafana locked and limited conversation to collaborators Oct 22, 2024

aknuds1 converted this issue into discussion #9714 Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Latency spike in distributors but not in ingesters #9709

Latency spike in distributors but not in ingesters #9709

jmichalek132 commented Oct 22, 2024

aknuds1 commented Oct 22, 2024

aknuds1 commented Oct 22, 2024 •

edited

Loading

jmichalek132 commented Oct 22, 2024

jmichalek132 commented Oct 22, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Latency spike in distributors but not in ingesters #9709

Latency spike in distributors but not in ingesters #9709

Comments

jmichalek132 commented Oct 22, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

aknuds1 commented Oct 22, 2024

aknuds1 commented Oct 22, 2024 • edited Loading

jmichalek132 commented Oct 22, 2024

jmichalek132 commented Oct 22, 2024

This issue was moved to a discussion.

aknuds1 commented Oct 22, 2024 •

edited

Loading