-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8939] Fixing concurrency handling during upgrade #12737
Conversation
032cd44
to
0093e06
Compare
0093e06
to
c0344ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet to go review the code. But, wondering if we need more complexty of new lock provider and new configs. I have a few high level questions:
- Why do we necessarily need a new NoopLockProvider if we are removing the lock configs during upgrade? Shouldn't the txn manager for the upgrade write client understand based on its write config that lock is not required? Conceptually, just removing lock configs and disabling auto adjustment should be enough.
- Why do we need a new config to decide whether or not to reuse time generator w/ or w/o lock? TimeGenerator API takes a flag to indicate whether locking is required or not. So, if the existing configs are being propagated properly and all callers of TimeGenerator API are passing the flag based on the config, then I don't think there is a need for another config.
- I think the goal was to identify the malicious caller, as we discussed, but we still don't know that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, and as such, the patch is good to unblock 1.0.1. But, there are still some open questions:
- Does
TimeGenerator
API always need a lock provider, even when there is no real lock requirement (say single writer, all inline table service)? - Is upgrade (esp
rollbackFailedWritesAndCompact
) the only path where this issue happens? For COW tables wit explicit InProcessLockProvider configured, I have noticed thattestPartitionFieldsWithUpgrade
fails due to NPE after upgrade. This patch has somehow fixed it, but I don't have good understanding of what exactly was causing that NPE. For ref, draft patch based off of current master that repro the COW issue - [DO NOT MERGE] Investigate COW failure for null lock provider #12739
Let's revisit the above soon.
* minor fixes to upgrade path * Fixes for concurrency handling during upgrade * fix build failure --------- Co-authored-by: Sagar Sumit <[email protected]>
Change Logs
Problem scenario:
Root cause is the re-entrant locking.
We are making 3 fixes in this patch. w/ all of the fixes, the control flow is as follows:
Dissecting each fix:
after this fix, the control flow is as follows:
Again, we were still hitting the exception.
TransactionManager was explicitly setting the lock provider to InProcessLockProvider if there is no LockProvider configured. If user configures explicitly, txnManager re-uses the same.
Introducing NoopLockProvider which just allows anyone to acquire the lock (synonymous to single writer). So, for UpgradeHandler code blocks, we override the lock provider to use NoopLockProvider.
after this fix, the control flow is as follows:
Even w/ above fix, we were still hitting re-entrant locks.
So, w/ all of above 3 fixes, our solution is as follows
Impact
Seamless upgrade irrespective of lock provider used.
Risk level (write none, low medium or high below)
medium
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist