Documentation feedback: /docs/sources/mimir/manage/run-production-environment/planning-capacity.md #10433

bjorns163 · 2025-01-14T10:10:20Z

I've been looking at your capacity planning documentation, but it not very clear to me how this works.

I've got a few tenants already sending data to my system. And when I add a new tenant I'm trying to see how munch extra resources i'll need.

For the already created tenants, I don't have their number of active series.
Can I fall back to what I see in the /distributor/all_user_stats page? And do I use the ingest rate value or the series value?

Say it was series, it gives me a result

distributor: CPU=61.3Core(s) MEM=61.31GB

But this doesn't take into account how many replica's ill be running, and what would be the recommended limit value for a CPU?
based on the large values file it's set to 2 CPU this would mean I need 31 replica's?

Say it was rate values, it gives me a result of:

distributor: CPU=2.4Core(s) MEM=2.39GB

So two replicas would be enough?

Taking a look at the scaling dashboard
It's telling me I need 7 replicas based on cpu or 11 based on memory.

Currently, I have running 5 replicas.


distributor:
  replicas: 5

  resources:
    limits:
      cpu: 2.5
      memory: 5.7Gi
    requests:
      cpu: 1.2
      memory: 2Gi

Looking at their load, it doesn't seem like I need to scale up:

To summaries my questions.

What are the recommended requests/limit values per replica. To determine the number of replicas.
How do I see the current sum(prometheus_tsdb_head_series) across all tenants. Is <mimir_domain>/distributor/all_user_stats a good source.

The text was updated successfully, but these errors were encountered:

narqo · 2025-02-02T14:19:18Z

I should note right away that I didn't check the maths below against any running system. Take it with a grain of salt.

For the case of the distributors, the docs you referred to suggest scaling base on received samples/s. The docs assume the reader doesn't yet have a running Mimir to collect any real statistics for the analysis, so they opt to the stats from the side of Prometheus agent (i.e. number of active series in the tsdb's head, and the configured scrap interval).

Since you already has Mimir running, you can either grab the data from its metrics (e.g. check "Mimir / Writer" or "Mimir / Tenants" dashboards), but the "all_user_stats" page should also do it. The maths should work as following:

# Total samples rate across all tenants (ref "Total ingestion rate" -- the rate all ingesters receive from the distributors)
710K + 185K + 1K ‎ ≈ 900K samples/s

# The recommendation is to allocate 1 CPU core and 1GB per 25K samples/s
900 / 25 = 36 CPU cores (36 GB)

How to horizontally distribute these 36 cores across the distributor replicas — it is up to your infrastructure and the cost economics. E.g. the values in large.yaml assume a common setup, where one can allocate 2 CPU cores and 4 GB per pod's container. The expectation is that the distributor is effective in utilizing what's available.

Note that this capacity planning doesn't take into account the available network bandwidth. I.e. if one deploys three distributor replicas, given them 12 cores each, it doesn't mean other resource of the node won't be the bottleneck ("Mimir / Write resources" and "Mimir / Write networking" dashboard can help to monitor the overall capacity).

As for the "Mimir / scaling" dashboard, it looks at the usage over the past 24 hours — refer to these recording rules). Could it be that the one-hour time window you were looking at, didn't represent how the system behaves generally?

dimitarvdimitrov · 2025-02-10T17:07:25Z

This looks answered, so we decided to close it. Feel free to reopen if you think otherwise.

dimitarvdimitrov closed this as not planned Won't fix, can't repro, duplicate, stale Feb 10, 2025

jonathancampbell-wk mentioned this issue Feb 17, 2025

Documentation feedback: /docs/sources/mimir/manage/run-production-environment/planning-capacity.md #10677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation feedback: /docs/sources/mimir/manage/run-production-environment/planning-capacity.md #10433

Documentation feedback: /docs/sources/mimir/manage/run-production-environment/planning-capacity.md #10433

bjorns163 commented Jan 14, 2025

narqo commented Feb 2, 2025

dimitarvdimitrov commented Feb 10, 2025

Documentation feedback: /docs/sources/mimir/manage/run-production-environment/planning-capacity.md #10433

Documentation feedback: /docs/sources/mimir/manage/run-production-environment/planning-capacity.md #10433

Comments

bjorns163 commented Jan 14, 2025

narqo commented Feb 2, 2025

dimitarvdimitrov commented Feb 10, 2025