Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation feedback: /docs/sources/mimir/manage/run-production-environment/planning-capacity.md #10433

Closed
bjorns163 opened this issue Jan 14, 2025 · 2 comments

Comments

@bjorns163
Copy link
Contributor

I've been looking at your capacity planning documentation, but it not very clear to me how this works.

I've got a few tenants already sending data to my system. And when I add a new tenant I'm trying to see how munch extra resources i'll need.

For the already created tenants, I don't have their number of active series.
Can I fall back to what I see in the /distributor/all_user_stats page? And do I use the ingest rate value or the series value?

Image

Say it was series, it gives me a result

distributor: CPU=61.3Core(s) MEM=61.31GB

But this doesn't take into account how many replica's ill be running, and what would be the recommended limit value for a CPU?
based on the large values file it's set to 2 CPU this would mean I need 31 replica's?

Say it was rate values, it gives me a result of:

distributor: CPU=2.4Core(s) MEM=2.39GB

So two replicas would be enough?

Taking a look at the scaling dashboard
It's telling me I need 7 replicas based on cpu or 11 based on memory.

Image

Currently, I have running 5 replicas.


distributor:
  replicas: 5

  resources:
    limits:
      cpu: 2.5
      memory: 5.7Gi
    requests:
      cpu: 1.2
      memory: 2Gi

Image

Looking at their load, it doesn't seem like I need to scale up:

Image

To summaries my questions.

  • What are the recommended requests/limit values per replica. To determine the number of replicas.
  • How do I see the current sum(prometheus_tsdb_head_series) across all tenants. Is <mimir_domain>/distributor/all_user_stats a good source.
@narqo
Copy link
Contributor

narqo commented Feb 2, 2025

I should note right away that I didn't check the maths below against any running system. Take it with a grain of salt.

For the case of the distributors, the docs you referred to suggest scaling base on received samples/s. The docs assume the reader doesn't yet have a running Mimir to collect any real statistics for the analysis, so they opt to the stats from the side of Prometheus agent (i.e. number of active series in the tsdb's head, and the configured scrap interval).

Since you already has Mimir running, you can either grab the data from its metrics (e.g. check "Mimir / Writer" or "Mimir / Tenants" dashboards), but the "all_user_stats" page should also do it. The maths should work as following:

# Total samples rate across all tenants (ref "Total ingestion rate" -- the rate all ingesters receive from the distributors)
710K + 185K + 1K ‎ ≈ 900K samples/s

# The recommendation is to allocate 1 CPU core and 1GB per 25K samples/s
900 / 25 = 36 CPU cores (36 GB)

How to horizontally distribute these 36 cores across the distributor replicas — it is up to your infrastructure and the cost economics. E.g. the values in large.yaml assume a common setup, where one can allocate 2 CPU cores and 4 GB per pod's container. The expectation is that the distributor is effective in utilizing what's available.

Note that this capacity planning doesn't take into account the available network bandwidth. I.e. if one deploys three distributor replicas, given them 12 cores each, it doesn't mean other resource of the node won't be the bottleneck ("Mimir / Write resources" and "Mimir / Write networking" dashboard can help to monitor the overall capacity).


As for the "Mimir / scaling" dashboard, it looks at the usage over the past 24 hours — refer to these recording rules). Could it be that the one-hour time window you were looking at, didn't represent how the system behaves generally?

@dimitarvdimitrov
Copy link
Contributor

This looks answered, so we decided to close it. Feel free to reopen if you think otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants