Bug: Increased Distributor latency in 2.15 #10717

EoinFarrell · 2025-02-21T16:18:21Z

What is the bug?

Hey Grafana team,

We have an alert that fires when we see latency above 1 second from the distributor using the below query:

histogram_quantile(0.99, avg by (le) (rate(cortex_request_duration_seconds_bucket{job=~"(cortex)/((distributor|cortex|mimir|mimir-write))",route=~"/distributor.Distributor/Push|/httpgrpc.*|api_(v1|prom)_push|otlp_v1_metrics"}[5m]))) > 1

Since upgrading to 2.15 this alert has been consistently firing. We have noticed a large bump in overall performance in Mimir, but we're wondering if increased latency is to be expected as a possible side effect ?

Side Note: Thanks for all the great work, our team was delighted with the performance improvements we've seen overall in 2.15 aside from this latency jump.

How to reproduce it?

Mimir 2.15

What did you think would happen?

N/A

What was your environment?

Kuberenetes

Any additional context to share?

No response

The text was updated successfully, but these errors were encountered:

bboreham · 2025-02-23T16:04:26Z

It's not expected, no. What was typical p99 latency in your environment before?

Do you notice increased CPU or memory usage in the distributors or ingesters?

Do you have distributed tracing, so you can see which component(s) have increased latency?

EoinFarrell · 2025-02-24T13:42:16Z

Hey @bboreham, thanks for the reply.

It's not expected, no. What was typical p99 latency in your environment before?

List below shows mean latency on 2.14 for the 2 week period prior to upgrade against last 2 weeks on 2.15.

Distributor

P99:
- 2.14: 1.33s
- 2.15: 2.66s
P50:
- 2.14: 27.2ms
- 2.15: 22.9ms
Average:
- 2.14: 117ms
- 2.15: 250ms

Ingester

P99:
- 2.14: 5.17ms
- 2.15: 123ms
P50:
- 2.14: 2.51ms
- 2.15: 2.54ms
Average:
- 2.14: 1.52ms
- 2.15: 6.21ms

Do you notice increased CPU or memory usage in the distributors or ingesters?

List below shows mean CPU & Memory usage on 2.14 for the 2 week period prior to upgrade against last 2 weeks on 2.15.

Distributor

CPU:
- 2.14: 54.9%
- 2.15: 56.7%
Memory:
- 2.14: 42.8%
- 2.15: 43%

Ingester

CPU:
- 2.14: 45.6%
- 2.15: 38.8%
Memory:
- 2.14: 66.7%
- 2.15: 66.4%

Do you have distributed tracing, so you can see which component(s) have increased latency?

No, we don't have distributed tracing enabled.

EoinFarrell · 2025-02-24T13:46:06Z

We upgraded to 2.15 on the 10th of Feb, which the crosshair in below panels aligns with.
Latency Charts:

CPU/Memory Charts (legend title is incorrect):

EoinFarrell · 2025-02-25T09:25:43Z

Hey @bboreham , after a bit more digging we found a correlation between the increase latency spike and our instance type. Before the 2.15 upgrade we migrated from Intel (m7i) to Graviton (m7g & m8g) based nodes. We were only on Graviton and 2.14 for a couple of days but during this time we do not see any increase in latency in either the Ingester or Distributor.
After upgrading to 2.15 though we notice the latency spike is confined to Ingesters running on the m7g generation of Graviton nodes. Avg and Max CPU usage on these nodes look normal, so I'm going to raise this with AWS to see if they have any insights.

EoinFarrell added the bug Something isn't working label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Increased Distributor latency in 2.15 #10717

Bug: Increased Distributor latency in 2.15 #10717

EoinFarrell commented Feb 21, 2025

bboreham commented Feb 23, 2025

EoinFarrell commented Feb 24, 2025

EoinFarrell commented Feb 24, 2025 •

edited

Loading

EoinFarrell commented Feb 25, 2025

Bug: Increased Distributor latency in 2.15 #10717

Bug: Increased Distributor latency in 2.15 #10717

Comments

EoinFarrell commented Feb 21, 2025

What is the bug?

How to reproduce it?

What did you think would happen?

What was your environment?

Any additional context to share?

bboreham commented Feb 23, 2025

EoinFarrell commented Feb 24, 2025

EoinFarrell commented Feb 24, 2025 • edited Loading

EoinFarrell commented Feb 25, 2025

EoinFarrell commented Feb 24, 2025 •

edited

Loading