Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Samples loss in ingester after rebalance/exit read-only mode #10582

Open
alexweav opened this issue Feb 4, 2025 · 6 comments
Open

Bug: Samples loss in ingester after rebalance/exit read-only mode #10582

alexweav opened this issue Feb 4, 2025 · 6 comments

Comments

@alexweav
Copy link
Contributor

alexweav commented Feb 4, 2025

During ingester scale-up, two things are likely to happen:

  • Ingesters that are currently in read-only mode exit read-only mode and begin accepting writes again
  • Series are rebalanced among existing ingesters

The series churn on a TSDB instance that's been existing for a while, coupled with leaving read-only mode, is extremely likely to trigger multiple head GCs + compactions in a row (logs)

This seems to trigger a known race condition in TSDB between writes and the head GC: prometheus/prometheus#8365

This manifests as samples going missing in an ingester during scale-up. The trace below, for a request affected by this bug, showed a full success for an expected 9 samples:

Image

But when the ingester is directly queried (verifiable with the grpcurl-query-ingesters tool) only 8 samples were present for the expected time range.

If this happens in a situation where we only barely have quorum, it can result in sample loss in queries. (ex. RF=3, but available replicas=2 due to restarts or readonly mode. If the bug occurs on one of the 2 remaining ingesters, a sample will only be present on 1 of 3 ingesters. The roughly 1/3 of queries that don't manage to hit this ingester will not include that sample which was accepted by the database.)

@alexweav
Copy link
Contributor Author

alexweav commented Feb 4, 2025

My current understanding of the prometheus bug:

  • A writer can fetch a *memSeries to which it is preparing to append. This briefly locks the read lock for that stripe, but unlocks and returns the reference
  • Then, GC can happen, which deletes that *memSeries entirely. Whatever previously fetched the memSeries is holding on to something that's been deleted from stripeSeries's maps.
  • That thing can call append() on that zombie memSeries. Appended samples are lost.

Proposals for fixing:

  • From the prior comments, it looks like we don't want to force writers to lock when taking memSeries and maintain the lock all the way through through the append. That would probably solve the problem, but maintainers suggested it would cause a perf impact. (perhaps this is worth measuring)
  • Alternatively, the GC shouldn't cull a memSeries if something's holding onto it. Maybe the memSeries pointer can be wrapped with something that's ref-counted, and GC will skip deletion if there are outstanding references?
  • When the GC deletes a memSeries, it can set some flag on that memSeries that marks it as "invalid." This flag can prevent appends from succeeding and indicate to the holder that they need to re-acquire. If there are no oustanding references to that memSeries this is a no-op.

Alternatively, we could find a way to avoid the GC by introducing a way of forcibly preventing GC's for a while when rebalancing. This does not fix the core TSDB bug, it only sidesteps the problem under the condition that we detected it in. Downsides of this are we don't know if this bug is reachable through other conditions, so doing this might leave other vectors for this problem lying around - since it's less ideal I did less research down this path for now.

  • Based on logs, we can confirm that the offending GC is the one called by truncateSeriesAndChunkDiskMapper through truncateMemory and then Truncate() -> does this need to be called in the first place in our situation?

@alexweav
Copy link
Contributor Author

alexweav commented Feb 4, 2025

I've written a test: prometheus/prometheus@6ea3c2d that seems to reproduce the TSDB bug.

@colega
Copy link
Contributor

colega commented Feb 12, 2025

Uh, this is an interesting one.

I think this part of your test isn't right:

		// The below logic is taken from Append().
		// It's just hand-repeating what Append() does internally. This test does the same thing as the previous test.

There's one thing missing in your dissection, the statement:

		s.pendingCommit = true

This is the statement that should prevent it from being gc-ed:

		if len(series.mmappedChunks) > 0 || series.headChunks != nil || series.pendingCommit ||
			(series.ooo != nil && (len(series.ooo.oooMmappedChunks) > 0 || series.ooo.oooHeadChunk != nil)) {
			seriesMint := series.minTime()
			if seriesMint < actualMint {
				actualMint = seriesMint
			}
			return
		}

But still, good catch 👏 , I understand that the race is happening here:

	s := a.head.series.getByID(chunks.HeadSeriesRef(ref))
	if s == nil {
		var err error
		s, _, err = a.getOrCreate(lset)
		if err != nil {
			return 0, err
		}
	}

	s.Lock()

As you mentioned, we unlock the stripe and then we lock the series.
We should lock the series before we unlock the stripe in order to prevent this happening.

I think this could be solved by adding an extra arg to getByID, getOrCreate and getByHash

func (s *stripeSeries) getByID(id chunks.HeadSeriesRef, locked bool) *memSeries {

WDYT?

@colega
Copy link
Contributor

colega commented Feb 17, 2025

@alexweav do you want to take care of this or would you like me to send a fix for the bug?

@alexweav
Copy link
Contributor Author

@colega I appreciate the second pair of eyes! I will try to re-work the test taking that into account.

This issue was a bit on the backburner last week, but I still am interested in submitting the fix; if anything, to gain more familiarity with this component. I should be able to balance some more time to this issue this week.

@colega
Copy link
Contributor

colega commented Feb 20, 2025

I'll let you send the fix then.

I wouldn't worry about the test so much though. It's really hard to test this especially in a way that wouldn't keep the test outdated. I think the important thing is that you found it and you can explain it in words. Concurrency is hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants