Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data recovery steps in runbook for block builder alerts #10751

Merged
merged 5 commits into from
Mar 5, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1724,6 +1724,25 @@ How to **investigate**:

- Check the block-builder logs to see what its pods have been busy with. The block-builder logs the `start consuming` and `done consuming` log messages, that mark per-partition conume-cycles. These log records include the details about the cycle, the Kafka topic's offsets, etc. Troubleshoot based on that.

Data recovery / temporary mitigation:

While the block builder is still getting mature, ingester is likely still uploading blocks to a backup bucket (if you name your bucket as `*-blocks`, the backup bucket could be `*-ingester-blocks`).

If the block builder permanently missed consuming some portion of the partition, or there is an ongoing issue preventing it from consuming, try the following.

If there is a backup `*-ingester-blocks` bucket

1. Firstly, reconfigure ingesters to upload blocks to the main `*-blocks` bucket.
2. Use `copyblocks` tool from Mimir to copy over the blocks from the backup bucket `*-ingester-blocks` to the main `*-blocks` bucket for the duration block builder did not produce blocks. It is advised to have the start time for copying blocks to be few hours prior (maybe 4hrs) to when the issue started.
3. Fix the issue in block builder. Let it catch up with the backlog. Once it catches up, you can switch back ingesters to the backup bucket.

If there is no backup bucket that ingester is uploading to

- Increase the retention of the Kafka topic to buy some time.
- Spin up a new set of block builder with an older version with a **new kafka topic name**, so that it does not conflict with the existing block builders. Choose the last version where no issue was seen; in most cases it will be the version before the one that caused the alert.
- If the `block-builder.lookback-on-no-commit` does not cover the time when the issue started, set it long enough so that these new block builders start back far enough to cover the missing data.
- If there is any corrupt data in the partition that is hard failing the block builder, then this solution will not work. Keep increasing the kafka topic retention until the issue is resolved and block builder has caught up.

### MimirBlockBuilderLagging

This alert fires when the block-builder instances report a large number of unprocessed records in the Kafka partitions.
Expand All @@ -1740,6 +1759,8 @@ How to **investigate**:
- Check if the per-partition lag, reported by the `cortex_blockbuilder_consumer_lag_records` metric, has been growing over the past hours.
- Explore the block-builder logs for any errors reported while it processed the partition.

Data recovery / temporary mitigation: Refer the runbook for `MimirBlockBuilderNoCycleProcessing` above.

### MimirBlockBuilderCompactAndUploadFailed

How it **works**:
Expand All @@ -1752,6 +1773,8 @@ How to **investigate**:

- Explore the block-builder logs to check what errors are there.

Data recovery / temporary mitigation: Refer the runbook for `MimirBlockBuilderNoCycleProcessing` above.

## Errors catalog

Mimir has some codified error IDs that you might see in HTTP responses or logs.
Expand Down