Bug: Store-gateway is stuck in starting phase #10381

riteshsonawane1372 · 2024-12-19T05:20:45Z

riteshsonawane1372
Dec 19, 2024

What is the bug?

I recently uploaded around 90GB of TSDB data directly to S3 after that, my store gateway is stuck in the starting phase. I have enabled debug logs but don't see any error (sharing ss). I have used this approach previously for more than 7 times, but now it is causing this problem. [CONTEXT: Doing influx to mimir migration, using promtool to generate tsdb].

How to reproduce it?

Push tsdb block to s3 and query the data using grafana for timestamps greater than 24hrs.

What did you think would happen?

I don't know why it is taking so long to load tsdb block. It was working previously.

What was your environment?

deployment was done using puppet on VM. Currently running single instance on 1 VM.

Any additional context to share?

No response

riteshsonawane1372 · 2024-12-19T05:22:18Z

riteshsonawane1372
Dec 19, 2024
Author

@narqo can you help here?

0 replies

narqo · 2024-12-19T11:39:15Z

narqo
Dec 19, 2024
Maintainer

From the store-gateway's logs on the screenshot can you tell if the system's managed to load all the discovered blocks, or it has stuck loading any of the remaining ones (e.g. compare the IDs of the blocks in the loaded new block lines and those that you have in the object store)?

You may want to collect a goroutine profile and Go runtime trace to explore where exactly it stuck.

What was your environment?

Also, could you share more details about the system. Which version of Mimir is that? Can you share the configuration options you're running with (please, make sure to redact any sensitive information from the config)? Can you attach the whole log file from the point of store-gateway's start (the text not a screenshot)?

0 replies

riteshsonawane1372 · 2024-12-19T13:02:09Z

riteshsonawane1372
Dec 19, 2024
Author

mimir --version
Mimir, version 2.14.1 (branch: , revision: )
  go version:       go1.23.2
  platform:         linux/amd64

Config File

multitenancy_enabled: false
common:
   storage:
     backend: s3
     s3:
       endpoint: ""
       region: us-east-1
       access_key_id: ""
       secret_access_key: ""
frontend:
  port: 9001
  address: 0.0.0.0

limits:
  compactor_block_upload_enabled: true
  creation_grace_period: 0
  past_grace_period: 0
  compactor_blocks_retention_period: 6y
  out_of_order_time_window: 42720h
  max_global_series_per_user: 0
  ingestion_burst_size: 260000
  ingestion_rate: 200000
  max_label_names_per_series: 100
  max_metadata_length: 2048
  ignore_ooo_exemplars: true
  max_fetched_chunks_per_query: 0

server:
  log_level: debug
  http_listen_port: 9009
  http_listen_address: 0.0.0.0
  http_server_read_timeout: 6000s
  http_server_write_timeout: 30m
  http_server_idle_timeout: 30m
  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 524288000
  grpc_server_max_send_msg_size: 524288000
  grpc_server_max_concurrent_streams: 2000

distributor:
  ring:
    kvstore:
      store: memberlist
  pool:
    health_check_ingesters: true
ingester:

  read_circuit_breaker:
    enabled: true
    initial_delay: 45s
    cooldown_period: 20s
    failure_threshold_percentage: 800
    request_timeout: 50000s
  push_circuit_breaker:
    enabled: true
    cooldown_period: 20s
    initial_delay: 45s
    request_timeout: 50000s
  instance_limits:
    max_series: 0
    max_inflight_push_requests: 0
  ring:
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512
    kvstore:
      store: inmemory
    replication_factor: 1

blocks_storage:
  backend: s3
  s3:
    endpoint: ""
    access_key_id: ""
    insecure: true
    region: us-east-1
    secret_access_key: ""
    bucket_name: ""

  tsdb:
    dir: /media/storage/mimir/tsdb

  bucket_store:
    bucket_index:
      idle_timeout: 24h
      max_stale_period: 720h
    sync_interval: 15m
    ignore_deletion_mark_delay: 8h
    ignore_deletion_mark_while_querying_delay: 7h
    sync_dir: /media/storage/mimir/tsdb-sync
    block_sync_concurrency: 24
    tenant_sync_concurrency: 4
    meta_sync_concurrency: 30

compactor:
  block_ranges: [2h0m0s,12h0m0s,24h0m0s,168h0m0s]
  data_dir: /media/storage/mimir/compactor
  compaction_concurrency: 4

store_gateway:
  sharding_ring:
    wait_stability_max_duration: 1m
    replication_factor: 1 # Change this back to 3 
ruler_storage:
  backend: filesystem
  filesystem:
    dir: /media/storage/mimir/rules

Log

dia/storage/mimir/tsdb-sync/anonymous/01JEAB0BMBA9QYY1T0Q30RRTFK/sparse-index-header
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.04501268Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=209.793899ms id=01JEATD323G2APVKHP5NNK4YZE
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.045042897Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JF22PTVSN1447970YM3HMTKB
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.045155145Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=debug msg="index header file size" bytes=13453
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.045367867Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loading sparse index-header from disk" id=01JF22PTVSN1447970YM3HMTKB path=/media/storage/mimir/tsdb-sync/anonymous/01JF22PTVSN1447970YM3HMTKB/sparse-index-header
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.045435572Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loaded sparse index-header from disk" id=01JF22PTVSN1447970YM3HMTKB path=/media/storage/mimir/tsdb-sync/anonymous/01JF22PTVSN1447970YM3HMTKB/sparse-index-header elapsed=69.567µs
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.057445191Z caller=push.go:211 level=error user=anonymous msg="detected an error while ingesting Prometheus remote-write request (the request may have been partially ingested)" httpCode=500 err="distributor is unavailable (current state: Terminated)"
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.057486624Z caller=logging.go:128 level=warn trace_id_unsampled=148d21a492a2c6e1 msg="POST /api/v1/push (500) 82.065µs Response: \"distributor is unavailable (current state: Terminated)\\n\" ws: false; Accept-Encoding: gzip; Content-Encoding: snappy; Content-Length: 36053; Content-Type: application/x-protobuf; User-Agent: Telegraf/1.32.3 Go/1.23.3; X-Prometheus-Remote-Write-Version: 0.1.0; "
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.05929124Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=215.225264ms id=01JEP98PZMYW3XT54KMDTXC01R
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.059311234Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JDYM57NQ68RYJ0E469N0SX75
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.059401735Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=debug msg="index header file size" bytes=6610
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.059603143Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loading sparse index-header from disk" id=01JDYM57NQ68RYJ0E469N0SX75 path=/media/storage/mimir/tsdb-sync/anonymous/01JDYM57NQ68RYJ0E469N0SX75/sparse-index-header
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.059655761Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loaded sparse index-header from disk" id=01JDYM57NQ68RYJ0E469N0SX75 path=/media/storage/mimir/tsdb-sync/anonymous/01JDYM57NQ68RYJ0E469N0SX75/sparse-index-header elapsed=53.336µs
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.068255828Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=215.735913ms id=01JF0WHVQFA3ATXY1PJYYT2XNC
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.068275417Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JE6P94DXXKFY2G5QN6GE4BDG
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.0683482Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=debug msg="index header file size" bytes=4139
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.069191484Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loading sparse index-header from disk" id=01JE6P94DXXKFY2G5QN6GE4BDG path=/media/storage/mimir/tsdb-sync/anonymous/01JE6P94DXXKFY2G5QN6GE4BDG/sparse-index-header

0 replies

riteshsonawane1372 · 2024-12-19T13:06:38Z

riteshsonawane1372
Dec 19, 2024
Author

It is continuously loading new blocks. I've been unable to query anything in the past 24 hours ts. Previously I was able to query 90 days of data. But after pushing the last 80GB of TSDB data it is stuck in the starting phase. On the local disk it has loaded around 6GB of index header at tsdb-sync location. For testing, I have also stopped the lazy loading but it is not working.

0 replies

riteshsonawane1372 · 2024-12-20T10:40:38Z

riteshsonawane1372
Dec 20, 2024
Author

@pracucci can you help here?

0 replies

pracucci · 2024-12-20T13:09:07Z

pracucci
Dec 20, 2024
Maintainer

But after pushing the last 80GB of TSDB data it is stuck in the starting phase. On the local disk it has loaded around 6GB of index header at tsdb-sync location.

If the store-gateway is stuck in the starting phase and the local disk utilization is also growing, then it means the store-gateway is loading the new blocks (maybe very slow, but loading). On the contrary, if the disk utilization is not growing, then it looks stuck as you say. Which one of the two?

0 replies

riteshsonawane1372 · 2024-12-22T06:01:35Z

riteshsonawane1372
Dec 22, 2024
Author

The disk space is growing, so in the current scenario, we saw something interesting. The blocks at /media/storage/tsdb-sync/anonymous are at 319910, which doesn't exceed this number. When we calculated all the blocks in s3, they were 416138. Even if we delete all blocks at tsdb-sync, the store-gateway will again get the headers for 319910 only, not 416138. I don't know why we don't see any error.

0 replies

riteshsonawane1372 · 2024-12-22T06:04:06Z

riteshsonawane1372
Dec 22, 2024
Author

We tested the same thing on a K8s cluster with a default config (we just added S3 credentials), and the store gateway is still loading new blocks.

0 replies

riteshsonawane1372 · 2024-12-23T15:39:17Z

riteshsonawane1372
Dec 23, 2024
Author

I calculated the tsdb block so there are a total 416138 blocks with the total size of 1.083TB

0 replies

narqo · 2024-12-24T11:58:20Z

narqo
Dec 24, 2024
Maintainer

deployment was done using puppet on VM. Currently running single instance on 1 VM.
...
The blocks at /media/storage/tsdb-sync/anonymous are at 319910, which doesn't exceed this number.

Base on these points above, it seems that one single instance of store-gateway is stuck after downloading 320M block headers (there should be 2x of that, for the index-header and sparse-index-header indices) to its file system, attached to /media/storage.

You also mentioned there is no single "bad" block that leads the store-gateway into being stuck. Is this right?
Can you check the inode usage on the attached file system? Does the file system have enough capacity to accommodate this number of local files?
Do you also run compactor on this VM host and with this exact disk attached? Did the compactor manage to update the bucket-index successfully? How does the disk utilisation look, in general?

0 replies

riteshsonawane1372 · 2024-12-24T13:14:51Z

riteshsonawane1372
Dec 24, 2024
Author

Yes today we found that the real issue is not with store-gateway but with compactor. There was one block that was mark for deletion because it was older than the retention time(6y). But the compactor didn't delete that block. From the makers/ in s3 we saw that the compactor is stuck and unable to delete any blocks. Also, it is unable to update the index header in s3. To test we took a backup of bucket-index.json.gz from s3 and deleted everything in it. Now we're getting this error Failed to read bucket index for tenant compactor We decreased the cleanup interval but still the same. We manually deleted the bad block. But now everything is crashing we're facing CPU throttling issues. On local compactor is able to get all the tsdb meta.json file with no issue
Yes
We don't see any issue with disk. This same thing was tested on a k8s cluster on a different node. We used the default config and just gave the s3 credentials but faced the same issue.

0 replies

riteshsonawane1372 · 2024-12-24T13:19:21Z

riteshsonawane1372
Dec 24, 2024
Author

Is there any tool that can validate all the blocks in S3? We had around 1TB of blocks, so we directly pushed them to S3. (We tested with ingestor backfill, but a single instance was unable to handle this volume.) I have tested with the mimirtool bucket validation thing, and I don't see any error there.

0 replies

riteshsonawane1372 · 2024-12-26T11:20:28Z

riteshsonawane1372
Dec 26, 2024
Author

Also if you need any metrics from mimir let me know we have Prometheus scraping mimir

0 replies

riteshsonawane1372 · 2025-01-02T05:50:27Z

riteshsonawane1372
Jan 2, 2025
Author

@narqo @pracucci Now the compactor is able to update the bucket-index.json.gz we updated the timeout

mimir/pkg/storage/tsdb/bucketindex/updater.go

Line 149 in fc8af05

ctx, cancel := context.WithTimeout(ctx, time.Minute)

But the tsdb-sync by store-gateway for getting the index-header is taking long (45 mins per block). Currently store-gateway synced 414718 blocks and compactor latest updated blocks are 414728 Because we're ingesting data the compactor will compact and upload new data to s3 bcoz the tsdb-sync is slow it won't sync with blocks in bucket-index.json.gz Below is the latest config. The avg index file in s3 is 50mb to 160mb

tsdb:
    dir: /media/storage/mimir/tsdb

  bucket_store:
    sync_interval: 1m
    tenant_sync_concurrency: 8
    index_header:
      max_idle_file_handles: 30
    block_sync_concurrency: 50
    metadata_cache:
      backend: "memcached"
      memcached: 
        addresses: "dns+memcached:11211"

    sync_dir: /media/storage/mimir/tsdb-sync
    bucket_index:
      max_stale_period: 1440h
      # block_sync_concurrency: 48
      #tenant_sync_concurrency: 4
      #meta_sync_concurrency: 64

compactor:
  block_ranges: [96h0m0s]
  #block_sync_concurrency: 30
    #meta_sync_concurrency: 30
  compaction_interval: 48h
    #compaction_concurrency: 15
  first_level_compaction_wait_period: 3h # Change this later
  cleanup_interval: 48h
    #enabled_tenants: "anonymous"
    #compaction_jobs_order: "newest-blocks-first"
  deletion_delay: 5m
    #max_compaction_time: 0
    #max_block_upload_validation_concurrency: 0 
  data_dir: /media/storage/mimir/compactor
store_gateway:
  sharding_ring:
    #auto_forget_enabled: true
    replication_factor: 1

0 replies

narqo · 2025-01-02T10:41:59Z

narqo
Jan 2, 2025
Maintainer

deployment was done using puppet on VM. Currently running single instance on 1 VM.

Do you still run a single instance of Mimir? How many CPU cores does the VM have?

The compactors can be scaled out both horizontally and vertically (ref the "scaling" section of its docs). You may observe it is overloaded and needs more resources in the "Mimir / Compactor" and "Mimir / Compactor resources" dashboards. Could you post screenshots with these dashboards?

As for store-gateway, you may check if its BucketIndexMetadataFetcher fetched the metadata of all blocks from its metrics (e.g. seek for cortex_blocks_meta_* metrics).

If I follow you right, it seems that the clue to where your store-gateway is being busy is in this code path in the BucketStore. That is, the store-gateways creates an instance of a BucketStore client per tenant — should be one instance in your case — which runs the initial sync on start:

it loads the bucket-index and syncs the tenant's blocks
(if enabled) eagerly loads the index-header of the blocks, which were used by the queries in the previous run

0 replies

riteshsonawane1372 · 2025-01-02T13:52:36Z

riteshsonawane1372
Jan 2, 2025
Author

We also notice that store-gateway is loading same blocks again and again

journalctl -u mimir.service | grep "01JF2WX3YMWH4CKFZG24608J6S"
Jan 02 09:17:44 "instance-name" mimir-test[347222]: ts=2025-01-02T09:17:44.58861918Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JF2WX3YMWH4CKFZG24608J6S
Jan 02 09:17:44 "instance-name" mimir-test[347222]: ts=2025-01-02T09:17:44.648707792Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=60.087156ms id=01JF2WX3YMWH4CKFZG24608J6S
Jan 02 09:29:37 "instance-name" mimir-test[347466]: ts=2025-01-02T09:29:37.849236239Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JF2WX3YMWH4CKFZG24608J6S
Jan 02 09:29:37 "instance-name" mimir-test[347466]: ts=2025-01-02T09:29:37.854285673Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=5.046796ms id=01JF2WX3YMWH4CKFZG24608J6S
Jan 02 09:46:05 "instance-name" mimir-test[347525]: ts=2025-01-02T09:46:05.706051598Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JF2WX3YMWH4CKFZG24608J6S
Jan 02 09:46:05 "instance-name" mimir-test[347525]: ts=2025-01-02T09:46:05.707841531Z caller=stream_binary_reader.go:232 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loading sparse index-header from full index-header" id=01JF2WX3YMWH4CKFZG24608J6S
Jan 02 09:46:05 "instance-name" mimir-test[347525]: ts=2025-01-02T09:46:05.707919203Z caller=stream_binary_reader.go:229 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loaded sparse index-header from full index-header" id=01JF2WX3YMWH4CKFZG24608J6S elapsed=76.95µs
Jan 02 09:46:05 "instance-name" mimir-test[347525]: ts=2025-01-02T09:46:05.707929362Z caller=stream_binary_reader.go:254 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="writing sparse index-header to disk" id=01JF2WX3YMWH4CKFZG24608J6S path=/media/storage/mimir/tsdb-sync/anonymous/01JF2WX3YMWH4CKFZG24608J6S/sparse-index-header
Jan 02 09:46:05 "instance-name" mimir-test[347525]: ts=2025-01-02T09:46:05.713444379Z caller=stream_binary_reader.go:251 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="wrote sparse index-header to disk" id=01JF2WX3YMWH4CKFZG24608J6S path=/media/storage/mimir/tsdb-sync/anonymous/01JF2WX3YMWH4CKFZG24608J6S/sparse-index-header elapsed=5.511356ms
Jan 02 09:46:05 "instance-name" mimir-test[347525]: ts=2025-01-02T09:46:05.713476062Z caller=stream_binary_reader.go:158 user=anonymous method=indexheader.NewStreamBinaryReader level=debug msg="built sparse index-header file" id=01JF2WX3YMWH4CKFZG24608J6S path=/media/storage/mimir/tsdb-sync/anonymous/01JF2WX3YMWH4CKFZG24608J6S/sparse-index-header
Jan 02 09:46:06 "instance-name" mimir-test[347525]: ts=2025-01-02T09:46:06.096226587Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=390.173096ms id=01JF2WX3YMWH4CKFZG24608J6S

0 replies

riteshsonawane1372 · 2025-01-02T16:22:00Z

riteshsonawane1372
Jan 2, 2025
Author

sharing dashboard ss

0 replies

riteshsonawane1372 · 2025-01-02T16:28:29Z

riteshsonawane1372
Jan 2, 2025
Author

deployment was done using puppet on VM. Currently running single instance on 1 VM.

Do you still run a single instance of Mimir? How many CPU cores does the VM have?

The compactors can be scaled out both horizontally and vertically (ref the "scaling" section of its docs). You may observe it is overloaded and needs more resources in the "Mimir / Compactor" and "Mimir / Compactor resources" dashboards. Could you post screenshots with these dashboards?

As for store-gateway, you may check if its BucketIndexMetadataFetcher fetched the metadata of all blocks from its metrics (e.g. seek for cortex_blocks_meta_* metrics).

If I follow you right, it seems that the clue to where your store-gateway is being busy is in this code path in the BucketStore. That is, the store-gateways creates an instance of a BucketStore client per tenant — should be one instance in your case — which runs the initial sync on start:

it loads the bucket-index and syncs the tenant's blocks

(if enabled) eagerly loads the index-header of the blocks, which were used by the queries in the previous run

We have now disabled lazy loading. Store-gateway is synced with bucket-index.json. Now, generating sparse-index-header is taking a long time. I think this is the final step before the store gateway ring is active(did some POC on other vm). But I observe it is getting slow (72 files per minute and suddenly 2 files per minute). I don't know why this is happening. We have enough resources on this VM, we have also tested disk io and don't see any issue.

0 replies

narqo · 2025-01-02T16:30:57Z

narqo
Jan 2, 2025
Maintainer

We also notice that store-gateway is loading same blocks again and again

In the logs you posted, the chunks mimir-test[347222], mimir-test[347466], mimir-test[347525], etc indicate different PIDs. Did you restart the process manually between those time chunks? Can you confirm the process is not crashing, e.g. because of OOM?

0 replies

riteshsonawane1372 · 2025-01-02T16:35:53Z

riteshsonawane1372
Jan 2, 2025
Author

No it is not crashing. After changing the config we did restart the mimir process

0 replies

riteshsonawane1372 · 2025-01-02T16:38:17Z

riteshsonawane1372
Jan 2, 2025
Author

We also notice that store-gateway is loading same blocks again and again

In the logs you posted, the chunks mimir-test[347222], mimir-test[347466], mimir-test[347525], etc indicate different PIDs. Did you restart the process manually between those time chunks? Can you confirm the process is not crashing, e.g. because of OOM?

0 replies

narqo · 2025-01-02T17:13:07Z

narqo
Jan 2, 2025
Maintainer

Jan 02 13:35:00 "instance-name" mimir-test[352903]: ts=2025-01-02T13:35:00.881143882Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=455.406896ms id=01JEDN2D4TZKB3PJ8C5ESG2FS2

Also, looking at the durations in the provided logs: if loading one block takes ~450ms, for 400K blocks and the block_sync_concurrency=24 (this was in the config you shared previously), it should take

550ms * 400_000 / 24 ≈ 2hr

I.e., in theory, the store-gateway needs 2 hours to load the blocks' meta-data from the bucket. Also from logs, mimir is still running in the monolithic mode, meaning that all of its components are actively fighting for the VM's resources: CPU, memory AND the network bandwidth. Note that your VM only has 8 CPU cores — from your htop screenshot above. From the same screenshot, the CPU doesn't look busy. With that, try seeking for clues in the IO. Check if maybe the network interface that the inbound/outbound connections to the s3 are made isn't saturated.

0 replies

riteshsonawane1372 · 2025-01-03T12:30:45Z

riteshsonawane1372
Jan 3, 2025
Author

I don't see any network issues regarding S3. Also, now that the store gateway has loaded all the blocks and got the index header and spare header, why is it still in the starting phase? What are the possible reasons? We tested this same on a 32cpu 128gb ram VM and still it is in starting phase. We increased the block_sync_concurreny to 512(on 32cpu) still it is taking more than ~3hrs to get those blocks and create index header. Is this bcoz the index file size in s3 is too big?

0 replies

riteshsonawane1372 · 2025-01-07T10:25:56Z

riteshsonawane1372
Jan 7, 2025
Author

@narqo @pracucci We did a small experiment. We configured a new empty bucket to mimir. Because of this, all the components were in Running state. We started the aws sync with our old bucket which synced all the blocks to this new bucket. We saw no issue after the sync was completed we were able to query data for the past 90 days to 6 months with no problem. But When we restarted mimir we got the same issue the store gateway was again stuck in Starting Phase. Sharing a ss with you which shows the bucket iteration for store-gateway before and after restart

7 replies

narqo Jan 10, 2025
Maintainer

1. The `mimir-metrics` datasource

Do you scrap, push and query mimir's own metrics from the instance you're debugging? You'd mentioned before you had a separate prometheus instance. While debugging the system I suggest you try to minimize to number of variables, and stop producing and querying the mimir instance under debug.

Instead, use a (small) dedicated prometheus, that scraps and holds the metrics from the mimir instance under debug. Query the debugging metrics from this prometheus.

2. Counter reset

Counters are expected to only increase or reset to zero. But the reset is only expected to occur when a service restarts or an application resets its metrics (store-gateways doesn't resets these metrics during its runtime).

Can you provide a screenshot with this counter metric but from the POV of the external prometheus, who scraps the metrics from mimir? Can you also show how this prometheus sees the up{} metric of the mimir's target?

Also to exclude possible crashes in the middle of start up, can you provide this exact output from the logs, from the several hours window on the screenshot, you shared above:

journalctl -u mimir.service | grep "Starting application"

riteshsonawane1372 Jan 10, 2025
Author

1. The mimir-metrics datasource

Do you scrap, push and query mimir's own metrics from the instance you're debugging? You'd mentioned before you had a separate prometheus instance. While debugging the system I suggest you try to minimize to number of variables, and stop producing and querying the mimir instance under debug.

Instead, use a (small) dedicated prometheus, that scraps and holds the metrics from the mimir instance under debug. Query the debugging metrics from this prometheus.

2. Counter reset

Counters are expected to only increase or reset to zero. But the reset is only expected to occur when a service restarts or an application resets its metrics (store-gateways doesn't resets these metrics during its runtime).

Can you provide a screenshot with this counter metric but from the POV of the external prometheus, who scraps the metrics from mimir? Can you also show how this prometheus sees the up{} metric of the mimir's target?

Also to exclude possible crashes in the middle of start up, can you provide this exact output from the logs, from the several hours window on the screenshot, you shared above:
journalctl -u mimir.service | grep "Starting application"

mimir-metrics is an external separate prometheus scraping mimir.

riteshsonawane1372 Jan 10, 2025
Author

@narqo Do you think any corrupted data can cause this issue? We have divided anonymous into multiple tenants to check this issue(Currently s3 sync is happening and cannot provide journalctl output but in the logs previously I could see only loading new blocks log). I was able to query last 6 months of data without any issue does this means I have no corrupted data in my bucket(before restarting I was able to do this)?

riteshsonawane1372 Jan 10, 2025
Author

FYI If you need any metrics I can get you that via mimir-metrics datasource (external prometheus)

narqo Jan 10, 2025
Maintainer

I don't say there is a case of a corrupted data. So far, from the above discussion and the screenshots, I'm trying to proof (or disproof) that your system kills mimir process during (the very long) startup.

Do you manage the mimir process with systemd? I assume yes, because you shared the journalctl command earlier.
Did you write the systemd unit yourself, or do you use the one from mimir?

The unit configuration, we have in mimir sets the startup timeout of 20 minutes. This correlates to what I think I see on screenshots.

(Note, inside GL we don't have ongoing practical experience with running mimir outside kubernetes. The systemd configs and packaging is mainly maintained by the community)

riteshsonawane1372 · 2025-01-07T12:10:20Z

riteshsonawane1372
Jan 7, 2025
Author

Sharing cpu and Mem

0 replies

riteshsonawane1372 · 2025-01-08T15:42:45Z

riteshsonawane1372
Jan 8, 2025
Author

@narqo @pracucci We did a small experiment. We configured a new empty bucket to mimir. Because of this, all the components were in Running state. We started the aws sync with our old bucket which synced all the blocks to this new bucket. We saw no issue after the sync was completed we were able to query data for the past 90 days to 6 months with no problem. But When we restarted mimir we got the same issue the store gateway was again stuck in Starting Phase. Sharing a ss with you which shows the bucket iteration for store-gateway before and after restart

@narqo Why this happen every 15min?

1 reply

anjuls Jan 9, 2025

@narqo @pracucci any guidance will be appreciated. We are completely blocked.

riteshsonawane1372 · 2025-01-09T06:57:42Z

riteshsonawane1372
Jan 9, 2025
Author

This drop is consistent every 20 mins

0 replies

anjuls · 2025-01-09T08:01:53Z

anjuls
Jan 9, 2025

Is it related to #8166 (comment)?

0 replies

riteshsonawane1372 · 2025-01-14T06:56:45Z

riteshsonawane1372
Jan 14, 2025
Author

Thanks @pracucci @narqo This issue is now resolved. We moved to multi-tenant architecture where we divided current 400k tsbd to 50k tsdb per tenant and the start time was drop from 2hrs to 5 mins sharing ss. Also the compute consumption is decreased. Previously compactor was stuck processing one tenant which was anonymous. Now with current architecture the query for 6 months (which previously crashed our server) is working with no issue

0 replies

Bug: Store-gateway is stuck in starting phase #10381

riteshsonawane1372 Dec 19, 2024

What is the bug?

How to reproduce it?

What did you think would happen?

What was your environment?

Any additional context to share?

Replies: 31 comments · 8 replies

riteshsonawane1372 Dec 19, 2024 Author

narqo Dec 19, 2024 Maintainer

riteshsonawane1372 Dec 19, 2024 Author

riteshsonawane1372 Dec 19, 2024 Author

riteshsonawane1372 Dec 20, 2024 Author

pracucci Dec 20, 2024 Maintainer

riteshsonawane1372 Dec 22, 2024 Author

riteshsonawane1372 Dec 22, 2024 Author

riteshsonawane1372 Dec 23, 2024 Author

narqo Dec 24, 2024 Maintainer

riteshsonawane1372 Dec 24, 2024 Author

riteshsonawane1372 Dec 24, 2024 Author

riteshsonawane1372 Dec 26, 2024 Author

riteshsonawane1372 Jan 2, 2025 Author

narqo Jan 2, 2025 Maintainer

riteshsonawane1372 Jan 2, 2025 Author

riteshsonawane1372 Jan 2, 2025 Author

riteshsonawane1372 Jan 2, 2025 Author

narqo Jan 2, 2025 Maintainer

riteshsonawane1372 Jan 2, 2025 Author

riteshsonawane1372 Jan 2, 2025 Author

narqo Jan 2, 2025 Maintainer

riteshsonawane1372 Jan 3, 2025 Author

riteshsonawane1372 Jan 7, 2025 Author

narqo Jan 10, 2025 Maintainer

1. The mimir-metrics datasource

2. Counter reset

riteshsonawane1372 Jan 10, 2025 Author

1. The mimir-metrics datasource

2. Counter reset

riteshsonawane1372 Jan 10, 2025 Author

riteshsonawane1372 Jan 10, 2025 Author

narqo Jan 10, 2025 Maintainer

riteshsonawane1372 Jan 7, 2025 Author

riteshsonawane1372 Jan 8, 2025 Author

anjuls Jan 9, 2025

riteshsonawane1372 Jan 9, 2025 Author

anjuls Jan 9, 2025

riteshsonawane1372 Jan 14, 2025 Author

riteshsonawane1372
Dec 19, 2024

Replies: 31 comments 8 replies

riteshsonawane1372
Dec 19, 2024
Author

narqo
Dec 19, 2024
Maintainer

riteshsonawane1372
Dec 19, 2024
Author

riteshsonawane1372
Dec 19, 2024
Author

riteshsonawane1372
Dec 20, 2024
Author

pracucci
Dec 20, 2024
Maintainer

riteshsonawane1372
Dec 22, 2024
Author

riteshsonawane1372
Dec 22, 2024
Author

riteshsonawane1372
Dec 23, 2024
Author

narqo
Dec 24, 2024
Maintainer

riteshsonawane1372
Dec 24, 2024
Author

riteshsonawane1372
Dec 24, 2024
Author

riteshsonawane1372
Dec 26, 2024
Author

riteshsonawane1372
Jan 2, 2025
Author

narqo
Jan 2, 2025
Maintainer

riteshsonawane1372
Jan 2, 2025
Author

riteshsonawane1372
Jan 2, 2025
Author

riteshsonawane1372
Jan 2, 2025
Author

narqo
Jan 2, 2025
Maintainer

riteshsonawane1372
Jan 2, 2025
Author

riteshsonawane1372
Jan 2, 2025
Author

narqo
Jan 2, 2025
Maintainer

riteshsonawane1372
Jan 3, 2025
Author

riteshsonawane1372
Jan 7, 2025
Author

narqo Jan 10, 2025
Maintainer

1. The `mimir-metrics` datasource

riteshsonawane1372 Jan 10, 2025
Author

1. The `mimir-metrics` datasource

riteshsonawane1372 Jan 10, 2025
Author

riteshsonawane1372 Jan 10, 2025
Author

narqo Jan 10, 2025
Maintainer

riteshsonawane1372
Jan 7, 2025
Author

riteshsonawane1372
Jan 8, 2025
Author

riteshsonawane1372
Jan 9, 2025
Author

anjuls
Jan 9, 2025

riteshsonawane1372
Jan 14, 2025
Author