-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation feedback: /docs/sources/mimir/manage/run-production-environment/planning-capacity.md #10433
Comments
I should note right away that I didn't check the maths below against any running system. Take it with a grain of salt. For the case of the distributors, the docs you referred to suggest scaling base on received samples/s. The docs assume the reader doesn't yet have a running Mimir to collect any real statistics for the analysis, so they opt to the stats from the side of Prometheus agent (i.e. number of active series in the tsdb's head, and the configured scrap interval). Since you already has Mimir running, you can either grab the data from its metrics (e.g. check "Mimir / Writer" or "Mimir / Tenants" dashboards), but the "all_user_stats" page should also do it. The maths should work as following:
How to horizontally distribute these 36 cores across the distributor replicas — it is up to your infrastructure and the cost economics. E.g. the values in Note that this capacity planning doesn't take into account the available network bandwidth. I.e. if one deploys three distributor replicas, given them 12 cores each, it doesn't mean other resource of the node won't be the bottleneck ("Mimir / Write resources" and "Mimir / Write networking" dashboard can help to monitor the overall capacity). As for the "Mimir / scaling" dashboard, it looks at the usage over the past 24 hours — refer to these recording rules). Could it be that the one-hour time window you were looking at, didn't represent how the system behaves generally? |
This looks answered, so we decided to close it. Feel free to reopen if you think otherwise. |
I've been looking at your capacity planning documentation, but it not very clear to me how this works.
I've got a few tenants already sending data to my system. And when I add a new tenant I'm trying to see how munch extra resources i'll need.
For the already created tenants, I don't have their number of active series.
Can I fall back to what I see in the /distributor/all_user_stats page? And do I use the ingest rate value or the series value?
Say it was series, it gives me a result
distributor: CPU=61.3Core(s) MEM=61.31GB
But this doesn't take into account how many replica's ill be running, and what would be the recommended limit value for a CPU?
based on the large values file it's set to 2 CPU this would mean I need 31 replica's?
Say it was rate values, it gives me a result of:
distributor: CPU=2.4Core(s) MEM=2.39GB
So two replicas would be enough?
Taking a look at the scaling dashboard
It's telling me I need 7 replicas based on cpu or 11 based on memory.
Currently, I have running 5 replicas.
Looking at their load, it doesn't seem like I need to scale up:
To summaries my questions.
sum(prometheus_tsdb_head_series)
across all tenants. Is <mimir_domain>/distributor/all_user_stats a good source.The text was updated successfully, but these errors were encountered: