Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency spike in distributors but not in ingesters #9709

Closed
jmichalek132 opened this issue Oct 22, 2024 · 4 comments
Closed

Latency spike in distributors but not in ingesters #9709

jmichalek132 opened this issue Oct 22, 2024 · 4 comments
Labels
question Further information is requested

Comments

@jmichalek132
Copy link
Contributor

Describe the bug

A clear and concise description of what the bug is.

At random times we are observing latency spikes in distributors that last from 1-2 minutes to 1-2 hours. They came and go (resolve on their own).
Whey they happen we don't see any corresponding latency spike in ingesters.
We don't seem to be hitting cpu / memory limits in distributors and adding more distributor pods doesn't seem have any effect.

Mimir overview
Image

Mimir writes

Image

Mimir writes resources

Image

Mimir writes networking

Image

(The gap in metrics is unrelated, our prometheus scraping mimir was stuck in terminating state)

From traces we can see the latency seems to be spent in distributor and not in ingesters.

Image

The time seems to be spent in between 2 span events:

Image

We can see it affects requests from the prometheus replicas from HA pair which is not elected and the data from it is not written to ingesters.

Image

In our dev env we deployed modified version of mimir with 2 extra span events trying to narrow down where is the time spent.
Source with the modifications https://github.com/grafana/mimir/pull/9707/files.

Image

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

It happens to in both of our environments production and dev.
It seems to affected subset of our tenants.

To Reproduce

Steps to reproduce the behavior:

These latency spike seem to happen at random, so not sure how to reproduce them.

  1. Start Mimir 2.14
  2. Perform Operations(Read/Write/Others)

Expected behavior

No unexpected latency spikes.

Environment

  • Infrastructure: kubernetes v1.29.5 (AKS, azure)
  • Deployment tool: helm chart mimir-distributed 5.5.0

Additional Context

We didn't find anything in the logs of mimir distributor, given it doesn't log each incoming request, and doesn't log much.

@aknuds1
Copy link
Contributor

aknuds1 commented Oct 22, 2024

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

As that line should be reading from the client connection, is it possible that the latency stems from the client? It certainly looks that way to me, especially considering the trace you included (~18 seconds spent reading the proto payload).

@aknuds1 aknuds1 added the question Further information is requested label Oct 22, 2024
@aknuds1
Copy link
Contributor

aknuds1 commented Oct 22, 2024

I suspect this should be a discussion and not an issue, since there's no indication of a Mimir bug yet. The symptoms so far are of an operational issue.

@jmichalek132
Copy link
Contributor Author

I suspect this should be a discussion and not an issue, since there's no indication of a Mimir bug yet. The symptoms so far are of an operational issue.

Hi, sorry I forgot about discussions, do you have the permissions by any chances to change into one?

@jmichalek132
Copy link
Contributor Author

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

As that line should be reading from the client connection, is it possible that the latency stems from the client? It certainly looks that way to me, especially considering the trace you included (~18 seconds spent reading the proto payload).

Yeah that's what I am also starting to suspect the interesting thing is we don't see any increased latency on the ingress (contour based on envoy) but this might be down to what exactly the envoy metrics measure.

@grafana grafana locked and limited conversation to collaborators Oct 22, 2024
@aknuds1 aknuds1 converted this issue into discussion #9714 Oct 22, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants