Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Increased Distributor latency in 2.15 #10717

Open
EoinFarrell opened this issue Feb 21, 2025 · 4 comments
Open

Bug: Increased Distributor latency in 2.15 #10717

EoinFarrell opened this issue Feb 21, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@EoinFarrell
Copy link

What is the bug?

Hey Grafana team,

We have an alert that fires when we see latency above 1 second from the distributor using the below query:

histogram_quantile(0.99, avg by (le) (rate(cortex_request_duration_seconds_bucket{job=~"(cortex)/((distributor|cortex|mimir|mimir-write))",route=~"/distributor.Distributor/Push|/httpgrpc.*|api_(v1|prom)_push|otlp_v1_metrics"}[5m]))) > 1

Since upgrading to 2.15 this alert has been consistently firing. We have noticed a large bump in overall performance in Mimir, but we're wondering if increased latency is to be expected as a possible side effect ?

Side Note: Thanks for all the great work, our team was delighted with the performance improvements we've seen overall in 2.15 aside from this latency jump.

How to reproduce it?

Mimir 2.15

What did you think would happen?

N/A

What was your environment?

Kuberenetes

Any additional context to share?

No response

@EoinFarrell EoinFarrell added the bug Something isn't working label Feb 21, 2025
@bboreham
Copy link
Contributor

It's not expected, no. What was typical p99 latency in your environment before?

Do you notice increased CPU or memory usage in the distributors or ingesters?

Do you have distributed tracing, so you can see which component(s) have increased latency?

@EoinFarrell
Copy link
Author

Hey @bboreham, thanks for the reply.

It's not expected, no. What was typical p99 latency in your environment before?

List below shows mean latency on 2.14 for the 2 week period prior to upgrade against last 2 weeks on 2.15.

Distributor

  • P99:
    • 2.14: 1.33s
    • 2.15: 2.66s
  • P50:
    • 2.14: 27.2ms
    • 2.15: 22.9ms
  • Average:
    • 2.14: 117ms
    • 2.15: 250ms

Ingester

  • P99:
    • 2.14: 5.17ms
    • 2.15: 123ms
  • P50:
    • 2.14: 2.51ms
    • 2.15: 2.54ms
  • Average:
    • 2.14: 1.52ms
    • 2.15: 6.21ms

Do you notice increased CPU or memory usage in the distributors or ingesters?

List below shows mean CPU & Memory usage on 2.14 for the 2 week period prior to upgrade against last 2 weeks on 2.15.

Distributor

  • CPU:
    • 2.14: 54.9%
    • 2.15: 56.7%
  • Memory:
    • 2.14: 42.8%
    • 2.15: 43%

Ingester

  • CPU:
    • 2.14: 45.6%
    • 2.15: 38.8%
  • Memory:
    • 2.14: 66.7%
    • 2.15: 66.4%

Do you have distributed tracing, so you can see which component(s) have increased latency?

No, we don't have distributed tracing enabled.

@EoinFarrell
Copy link
Author

EoinFarrell commented Feb 24, 2025

We upgraded to 2.15 on the 10th of Feb, which the crosshair in below panels aligns with.
Latency Charts:

Image

CPU/Memory Charts (legend title is incorrect):

Image

@EoinFarrell
Copy link
Author

Hey @bboreham , after a bit more digging we found a correlation between the increase latency spike and our instance type. Before the 2.15 upgrade we migrated from Intel (m7i) to Graviton (m7g & m8g) based nodes. We were only on Graviton and 2.14 for a couple of days but during this time we do not see any increase in latency in either the Ingester or Distributor.
After upgrading to 2.15 though we notice the latency spike is confined to Ingesters running on the m7g generation of Graviton nodes. Avg and Max CPU usage on these nodes look normal, so I'm going to raise this with AWS to see if they have any insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants