Bug: Store-gateway is stuck in starting phase #10381
Replies: 31 comments 8 replies
-
@narqo can you help here? |
Beta Was this translation helpful? Give feedback.
-
From the store-gateway's logs on the screenshot can you tell if the system's managed to load all the discovered blocks, or it has stuck loading any of the remaining ones (e.g. compare the IDs of the blocks in the You may want to collect a goroutine profile and Go runtime trace to explore where exactly it stuck.
Also, could you share more details about the system. Which version of Mimir is that? Can you share the configuration options you're running with (please, make sure to redact any sensitive information from the config)? Can you attach the whole log file from the point of store-gateway's start (the text not a screenshot)? |
Beta Was this translation helpful? Give feedback.
-
Config File
Log
|
Beta Was this translation helpful? Give feedback.
-
It is continuously loading new blocks. I've been unable to query anything in the past 24 hours ts. Previously I was able to query 90 days of data. But after pushing the last 80GB of TSDB data it is stuck in the |
Beta Was this translation helpful? Give feedback.
-
@pracucci can you help here? |
Beta Was this translation helpful? Give feedback.
-
If the store-gateway is stuck in the starting phase and the local disk utilization is also growing, then it means the store-gateway is loading the new blocks (maybe very slow, but loading). On the contrary, if the disk utilization is not growing, then it looks stuck as you say. Which one of the two? |
Beta Was this translation helpful? Give feedback.
-
The disk space is growing, so in the current scenario, we saw something interesting. The blocks at /media/storage/tsdb-sync/anonymous are at |
Beta Was this translation helpful? Give feedback.
-
We tested the same thing on a K8s cluster with a default config (we just added S3 credentials), and the store gateway is still loading new blocks. |
Beta Was this translation helpful? Give feedback.
-
I calculated the tsdb block so there are a total |
Beta Was this translation helpful? Give feedback.
-
Base on these points above, it seems that one single instance of
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Is there any tool that can validate all the blocks in S3? We had around 1TB of blocks, so we directly pushed them to S3. (We tested with ingestor backfill, but a single instance was unable to handle this volume.) I have tested with the mimirtool bucket validation thing, and I don't see any error there. |
Beta Was this translation helpful? Give feedback.
-
Also if you need any metrics from mimir let me know we have Prometheus scraping mimir |
Beta Was this translation helpful? Give feedback.
-
@narqo @pracucci Now the compactor is able to update the mimir/pkg/storage/tsdb/bucketindex/updater.go Line 149 in fc8af05 But the tsdb-sync by store-gateway for getting the index-header is taking long (45 mins per block). Currently store-gateway synced 414718 blocks and compactor latest updated blocks are 414728 Because we're ingesting data the compactor will compact and upload new data to s3 bcoz the tsdb-sync is slow it won't sync with blocks in bucket-index.json.gz Below is the latest config. The avg index file in s3 is 50mb to 160mb
|
Beta Was this translation helpful? Give feedback.
-
Do you still run a single instance of Mimir? How many CPU cores does the VM have? The compactors can be scaled out both horizontally and vertically (ref the "scaling" section of its docs). You may observe it is overloaded and needs more resources in the "Mimir / Compactor" and "Mimir / Compactor resources" dashboards. Could you post screenshots with these dashboards? As for store-gateway, you may check if its If I follow you right, it seems that the clue to where your store-gateway is being busy is in this code path in the
|
Beta Was this translation helpful? Give feedback.
-
We also notice that store-gateway is loading same blocks again and again
|
Beta Was this translation helpful? Give feedback.
-
We have now disabled lazy loading. Store-gateway is synced with bucket-index.json. Now, generating |
Beta Was this translation helpful? Give feedback.
-
In the logs you posted, the chunks |
Beta Was this translation helpful? Give feedback.
-
No it is not crashing. After changing the config we did restart the mimir process |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Also, looking at the durations in the provided logs: if loading one block takes ~450ms, for 400K blocks and the
I.e., in theory, the store-gateway needs 2 hours to load the blocks' meta-data from the bucket. Also from logs, mimir is still running in the monolithic mode, meaning that all of its components are actively fighting for the VM's resources: CPU, memory AND the network bandwidth. Note that your VM only has 8 CPU cores — from your |
Beta Was this translation helpful? Give feedback.
-
I don't see any network issues regarding S3. Also, now that the store gateway has loaded all the blocks and got the index header and spare header, why is it still in the starting phase? What are the possible reasons? We tested this same on a 32cpu 128gb ram VM and still it is in |
Beta Was this translation helpful? Give feedback.
-
@narqo @pracucci We did a small experiment. We configured a new empty bucket to mimir. Because of this, all the components were in |
Beta Was this translation helpful? Give feedback.
-
@narqo Why this happen every 15min? |
Beta Was this translation helpful? Give feedback.
-
This drop is consistent every 20 mins |
Beta Was this translation helpful? Give feedback.
-
Is it related to #8166 (comment)? |
Beta Was this translation helpful? Give feedback.
-
Thanks @pracucci @narqo This issue is now resolved. We moved to multi-tenant architecture where we divided current 400k tsbd to 50k tsdb per tenant and the start time was drop from 2hrs to 5 mins sharing ss. Also the compute consumption is decreased. Previously compactor was stuck processing one tenant which was |
Beta Was this translation helpful? Give feedback.
-
What is the bug?
I recently uploaded around 90GB of TSDB data directly to S3 after that, my store gateway is stuck in the starting phase. I have enabled debug logs but don't see any error (sharing ss). I have used this approach previously for more than 7 times, but now it is causing this problem. [CONTEXT: Doing influx to mimir migration, using promtool to generate tsdb].
How to reproduce it?
Push tsdb block to s3 and query the data using grafana for timestamps greater than 24hrs.
What did you think would happen?
I don't know why it is taking so long to load tsdb block. It was working previously.
What was your environment?
deployment was done using puppet on VM. Currently running single instance on 1 VM.
Any additional context to share?
No response
Beta Was this translation helpful? Give feedback.
All reactions