Bug: Samples loss in ingester after rebalance/exit read-only mode #10582

alexweav · 2025-02-04T16:50:15Z

During ingester scale-up, two things are likely to happen:

Ingesters that are currently in read-only mode exit read-only mode and begin accepting writes again
Series are rebalanced among existing ingesters

The series churn on a TSDB instance that's been existing for a while, coupled with leaving read-only mode, is extremely likely to trigger multiple head GCs + compactions in a row (logs)

This seems to trigger a known race condition in TSDB between writes and the head GC: prometheus/prometheus#8365

This manifests as samples going missing in an ingester during scale-up. The trace below, for a request affected by this bug, showed a full success for an expected 9 samples:

But when the ingester is directly queried (verifiable with the grpcurl-query-ingesters tool) only 8 samples were present for the expected time range.

If this happens in a situation where we only barely have quorum, it can result in sample loss in queries. (ex. RF=3, but available replicas=2 due to restarts or readonly mode. If the bug occurs on one of the 2 remaining ingesters, a sample will only be present on 1 of 3 ingesters. The roughly 1/3 of queries that don't manage to hit this ingester will not include that sample which was accepted by the database.)

alexweav · 2025-02-04T20:38:07Z

My current understanding of the prometheus bug:

A writer can fetch a *memSeries to which it is preparing to append. This briefly locks the read lock for that stripe, but unlocks and returns the reference
Then, GC can happen, which deletes that *memSeries entirely. Whatever previously fetched the memSeries is holding on to something that's been deleted from stripeSeries's maps.
That thing can call append() on that zombie memSeries. Appended samples are lost.

Proposals for fixing:

From the prior comments, it looks like we don't want to force writers to lock when taking memSeries and maintain the lock all the way through through the append. That would probably solve the problem, but maintainers suggested it would cause a perf impact. (perhaps this is worth measuring)
Alternatively, the GC shouldn't cull a memSeries if something's holding onto it. Maybe the memSeries pointer can be wrapped with something that's ref-counted, and GC will skip deletion if there are outstanding references?
When the GC deletes a memSeries, it can set some flag on that memSeries that marks it as "invalid." This flag can prevent appends from succeeding and indicate to the holder that they need to re-acquire. If there are no oustanding references to that memSeries this is a no-op.

Alternatively, we could find a way to avoid the GC by introducing a way of forcibly preventing GC's for a while when rebalancing. This does not fix the core TSDB bug, it only sidesteps the problem under the condition that we detected it in. Downsides of this are we don't know if this bug is reachable through other conditions, so doing this might leave other vectors for this problem lying around - since it's less ideal I did less research down this path for now.

Based on logs, we can confirm that the offending GC is the one called by truncateSeriesAndChunkDiskMapper through truncateMemory and then Truncate() -> does this need to be called in the first place in our situation?

alexweav · 2025-02-04T23:15:20Z

I've written a test: prometheus/prometheus@6ea3c2d that seems to reproduce the TSDB bug.

colega · 2025-02-12T17:43:06Z

Uh, this is an interesting one.

I think this part of your test isn't right:

		// The below logic is taken from Append().
		// It's just hand-repeating what Append() does internally. This test does the same thing as the previous test.

There's one thing missing in your dissection, the statement:

		s.pendingCommit = true

This is the statement that should prevent it from being gc-ed:

		if len(series.mmappedChunks) > 0 || series.headChunks != nil || series.pendingCommit ||
			(series.ooo != nil && (len(series.ooo.oooMmappedChunks) > 0 || series.ooo.oooHeadChunk != nil)) {
			seriesMint := series.minTime()
			if seriesMint < actualMint {
				actualMint = seriesMint
			}
			return
		}

But still, good catch 👏 , I understand that the race is happening here:

	s := a.head.series.getByID(chunks.HeadSeriesRef(ref))
	if s == nil {
		var err error
		s, _, err = a.getOrCreate(lset)
		if err != nil {
			return 0, err
		}
	}

	s.Lock()

As you mentioned, we unlock the stripe and then we lock the series.
We should lock the series before we unlock the stripe in order to prevent this happening.

I think this could be solved by adding an extra arg to getByID, getOrCreate and getByHash

func (s *stripeSeries) getByID(id chunks.HeadSeriesRef, locked bool) *memSeries {

WDYT?

colega · 2025-02-17T09:52:16Z

@alexweav do you want to take care of this or would you like me to send a fix for the bug?

alexweav · 2025-02-17T23:10:27Z

@colega I appreciate the second pair of eyes! I will try to re-work the test taking that into account.

This issue was a bit on the backburner last week, but I still am interested in submitting the fix; if anything, to gain more familiarity with this component. I should be able to balance some more time to this issue this week.

colega · 2025-02-20T09:22:39Z

I'll let you send the fix then.

I wouldn't worry about the test so much though. It's really hard to test this especially in a way that wouldn't keep the test outdated. I think the important thing is that you found it and you can explain it in words. Concurrency is hard.

56quarters added the component/ingester label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Samples loss in ingester after rebalance/exit read-only mode #10582

Bug: Samples loss in ingester after rebalance/exit read-only mode #10582

alexweav commented Feb 4, 2025 •

edited

Loading

alexweav commented Feb 4, 2025 •

edited

Loading

alexweav commented Feb 4, 2025

colega commented Feb 12, 2025 •

edited

Loading

colega commented Feb 17, 2025

alexweav commented Feb 17, 2025

colega commented Feb 20, 2025

Bug: Samples loss in ingester after rebalance/exit read-only mode #10582

Bug: Samples loss in ingester after rebalance/exit read-only mode #10582

Comments

alexweav commented Feb 4, 2025 • edited Loading

alexweav commented Feb 4, 2025 • edited Loading

alexweav commented Feb 4, 2025

colega commented Feb 12, 2025 • edited Loading

colega commented Feb 17, 2025

alexweav commented Feb 17, 2025

colega commented Feb 20, 2025

alexweav commented Feb 4, 2025 •

edited

Loading

alexweav commented Feb 4, 2025 •

edited

Loading

colega commented Feb 12, 2025 •

edited

Loading