Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor freelist management & add more dedicated tests #789

Open
5 of 6 tasks
ahrtr opened this issue Jul 8, 2024 · 7 comments
Open
5 of 6 tasks

Refactor freelist management & add more dedicated tests #789

ahrtr opened this issue Jul 8, 2024 · 7 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented Jul 8, 2024

Backgroud

There are some long standing data corruption issues, which indicate that there might be potential bug(s) in freelist management.

I am not satisfied with the bbolt freelist management for a long time. It's hard to understand and also tightly coupled with bbolt. I have been thinking to refactor it to improve the understandability & testability.

Refactor

The high level idea is to

  • simplify the implementation to improve understandability;
  • and introduce interface and decouple it with the bbolt TXN workflow to improve testability.

What we have done and are going to do:

We also need to continue to refactor & simplify the interface from user (bbolt) perspective, in other words, we should have a clear understanding on how the interface will & should be used by bbolt. The motivation is to improve testablity.

The freelist management is the most sensitive & important part. So let's do it step by step.

Test

Any unit tests are welcome.

But more importantly, we should add dedicated randomized test case to simulate concurrent (multiple) read TXNs and (single) writing TXN.

  • We need to record all the requests or operations sent to the freelist module;
  • We need to set up a list of expected behavior or invariable properties, and verify all the invariable properties are not broken during the test.
@ahrtr
Copy link
Member Author

ahrtr commented Jul 8, 2024

A good news is that I kept the concurrent test case running (and randomly kill & restart the test) for 22 days and no any issue occurred.
Read #769 (comment)

@tjungblu
Copy link
Contributor

tjungblu commented Jul 8, 2024

Here's one obvious addition I also wrote down during #775.

ReadWriter

Read / Write / EstimatedWritePageSize implementations are already well decoupled through interface functions and thus a very low hanging fruit to refactor.

I propose to split this into another struct, allowing us to embed/compose it in the freelist itself.

Nice side-effect: Copyall can go in there as a private implementation function, removed entirely from the interface.
tx_checker.go requires only a list of pending and free pages, which can be handled with other functions + sorting.

Benefit: four exported functions from the freelist interface less. Test cases become very simple serialization/deserialization of integer slices. We can improve this later on more easily, e.g. with vint compression to reduce the serialized size. We should definitely add a checksum hash there too...


decouple it with the bbolt TXN workflow

Yep, I think this is the biggest one. The main responsibility we still need to rethink seems to track when a page was allocated and when it was freed again (when in terms of transaction ID) and having strong assertions around those. The semantic leaking of "pending" pages also irks me a bit, this should be an implementation detail.

The remaining parts that care about different allocation styles and span merging seem OK to me. Even though there seems to be a lot of double-accounting and ensuring all data structures stay consistent at all times.

@tjungblu
Copy link
Contributor

tjungblu commented Jul 8, 2024

As for the testing, we might also want to consider digging up the lazyfs work in #567. I didn't get very far with getting meaningful tests up last year, but torn writes scare me a lot more now :)

@ahrtr
Copy link
Member Author

ahrtr commented Jul 18, 2024

There is a flaw on the freelist management.

The allocs is a mapping of Pgid -> Txid, see below,

allocs map[common.Pgid]common.Txid // mapping of Txid that allocated a pgid.

It only records the page IDs which are allocated from the freelist. But if the freelist can't meet the allocation requirement, then bbolt will try to resize the db file and allocate more disk space. In such case, the new allocated page IDs aren't recorded in the allocs.

bbolt/db.go

Lines 1159 to 1165 in efc3eb6

p.SetId(db.freelist.Allocate(txid, count))
if p.Id() != 0 {
return p, nil
}
// Resize mmap() if we're at the end.
p.SetId(db.rwtx.meta.Pgid())

Usually it isn't a big problem. Because in such case, bbolt needs to re-map the db file, so it needs to wait for all the existing readonly TXNs to finish. But there is still a very small windows, in which new readonly TXNs may be created after re-mapping but before the writing TXN finishes committing.

The new allocated pages in such case are actually invisible to such readonly TXNs. But since the Txid (allocating the page ID) isn't recorded in such case, when the pages are freed in following writing TXNs, they can't be released as long as the readonly TXNs created in the small windows are still running.

It isn't a big problem,

  • it's low possibility to run into such case as mentioned above.
  • even when it happens, the only side effect is that the freed pages can't be released in time, accordingly can't be reused in time.

@ahrtr ahrtr pinned this issue Jul 18, 2024
@ahrtr
Copy link
Member Author

ahrtr commented Jul 18, 2024

I am planning to simplify & update the logic related to the pending released free pages, please take a look at doc below and feedback if anyone is interested,
https://docs.google.com/spreadsheets/d/1T7ORCbv3mguq1hQoZkUg0XEzuUXFrNFFj1sUfQysZ2k/edit?gid=0#gid=0

@tjungblu
Copy link
Contributor

tjungblu commented Jul 22, 2024

It isn't a big problem,

I think the more general problem here is that we have tracking of transactions and their pages, but the allocation is also somewhere else entirely. I don't want to open another can of worms, but maybe we can also find a good way to incoperate the transaction and page allocation management along with it.

I am planning to simplify & update the logic related to the pending released free pages

the doc looks good, I see the code has some short-cuts around the txids to avoid looping more than necessary, we should start to simplify first though.

@ahrtr
Copy link
Member Author

ahrtr commented Oct 28, 2024

@tjungblu @fuweid @ivanvc @Elbehery since the refactoring may have a potential big impact, I propose to do it in bbolt 2.0 instead of 1.4.0. Note that etcd 3.6 will depend on bbolt 1.4, so that the refactor will not impact etcd 3.6. I am thinking to stablize the current bbolt main branch, and cut release-1.4 in a proper timing. The near / next plan is to release/tag 1.4.0-alpha.1 or 1.4.0-beta.1 as mentioned in #553 (comment) cc @jmhbnz @serathius @spzala

Noter we also need more review on #789 (comment), Of course target for 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants