-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor freelist management & add more dedicated tests #789
Comments
A good news is that I kept the concurrent test case running (and randomly kill & restart the test) for 22 days and no any issue occurred. |
Here's one obvious addition I also wrote down during #775. ReadWriterRead / Write / EstimatedWritePageSize implementations are already well decoupled through interface functions and thus a very low hanging fruit to refactor. I propose to split this into another struct, allowing us to embed/compose it in the freelist itself. Nice side-effect: Benefit: four exported functions from the freelist interface less. Test cases become very simple serialization/deserialization of integer slices. We can improve this later on more easily, e.g. with vint compression to reduce the serialized size. We should definitely add a checksum hash there too...
Yep, I think this is the biggest one. The main responsibility we still need to rethink seems to track when a page was allocated and when it was freed again (when in terms of transaction ID) and having strong assertions around those. The semantic leaking of "pending" pages also irks me a bit, this should be an implementation detail. The remaining parts that care about different allocation styles and span merging seem OK to me. Even though there seems to be a lot of double-accounting and ensuring all data structures stay consistent at all times. |
As for the testing, we might also want to consider digging up the lazyfs work in #567. I didn't get very far with getting meaningful tests up last year, but torn writes scare me a lot more now :) |
There is a flaw on the freelist management. The bbolt/internal/freelist/shared.go Line 22 in efc3eb6
It only records the page IDs which are allocated from the freelist. But if the freelist can't meet the allocation requirement, then bbolt will try to resize the db file and allocate more disk space. In such case, the new allocated page IDs aren't recorded in the Lines 1159 to 1165 in efc3eb6
Usually it isn't a big problem. Because in such case, bbolt needs to re-map the db file, so it needs to wait for all the existing readonly TXNs to finish. But there is still a very small windows, in which new readonly TXNs may be created after re-mapping but before the writing TXN finishes committing. The new allocated pages in such case are actually invisible to such readonly TXNs. But since the Txid (allocating the page ID) isn't recorded in such case, when the pages are freed in following writing TXNs, they can't be released as long as the readonly TXNs created in the small windows are still running. It isn't a big problem,
|
I am planning to simplify & update the logic related to the pending released free pages, please take a look at doc below and feedback if anyone is interested, |
I think the more general problem here is that we have tracking of transactions and their pages, but the allocation is also somewhere else entirely. I don't want to open another can of worms, but maybe we can also find a good way to incoperate the transaction and page allocation management along with it.
the doc looks good, I see the code has some short-cuts around the txids to avoid looping more than necessary, we should start to simplify first though. |
@tjungblu @fuweid @ivanvc @Elbehery since the refactoring may have a potential big impact, I propose to do it in bbolt 2.0 instead of 1.4.0. Note that etcd 3.6 will depend on bbolt 1.4, so that the refactor will not impact etcd 3.6. I am thinking to stablize the current bbolt main branch, and cut release-1.4 in a proper timing. The near / next plan is to release/tag Noter we also need more review on #789 (comment), Of course target for 2.0. |
Backgroud
There are some long standing data corruption issues, which indicate that there might be potential bug(s) in freelist management.
I am not satisfied with the bbolt freelist management for a long time. It's hard to understand and also tightly coupled with bbolt. I have been thinking to refactor it to improve the understandability & testability.
Refactor
The high level idea is to
What we have done and are going to do:
We also need to continue to refactor & simplify the interface from user (bbolt) perspective, in other words, we should have a clear understanding on how the interface will & should be used by bbolt. The motivation is to improve testablity.
The freelist management is the most sensitive & important part. So let's do it step by step.
Test
Any unit tests are welcome.
But more importantly, we should add dedicated randomized test case to simulate concurrent (multiple) read TXNs and (single) writing TXN.
The text was updated successfully, but these errors were encountered: