-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix [MQB]: mqbc::StorageMgr: Transition to available only when all primary active #416
Fix [MQB]: mqbc::StorageMgr: Transition to available only when all primary active #416
Conversation
80f71b6
to
e5d59b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple questions...
|
||
pinfo.setPrimaryStatus(value); | ||
if (bmqp_ctrlmsg::PrimaryStatus::E_ACTIVE == value) { | ||
d_fileStores[partitionId]->setPrimary(pinfo.primary(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this in setPrimaryStatusForPartition
and not in setPrimaryForPartition
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileStore should only set primary after the primary becomes active. The FileStore is unable to work with a passive primary. I will rename the function as FileStore::setActivePrimary
to make the point clear.
<< " primary, this advisory could " | ||
<< "be from the true one. Will" | ||
<< " buffer the advisory for now."; | ||
d_storageManager_p->bufferPrimaryStatusAdvisory(primaryAdv, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, does pinfo.primaryNode()
get assigned upon PrimaryStatusAdvisory or in the FSM flow there is another trigger? We receive PrimaryStatusAdvisory and we do not have partition primary, why not assign the primary then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In FSM mode, the source of truth for partition assignments are in the cluster state snapshot of the CSL file. As part of healing, a new leader assigns partitions and then applies the assignments in its first CSL advisory. Primary status adviosries can be stale; that's why we have a lot of checks in this function in the first place. My original idea was to simply ignore all primary status advisories and purely rely upon FSM for partition assignments. However, FSM can heal a replica but neglect to set a primary as active. Thus, I came up with the idea of buffering primary status advisories. If an advisory is not stale (i.e. matching primary node and leaseId), then we trust the availability advisory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 250 of commit e5d59b2 has completed with FAILURE
835ba43
to
0116bdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 251 of commit 0116bdf has completed with FAILURE
@dorjesinpo Back to you |
06c92b2
to
337db61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 256 of commit 337db61 has completed with FAILURE
337db61
to
8059ba6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 266 of commit 8059ba6 has completed with FAILURE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one question
} | ||
pinfo.setPrimaryStatus(cit->first.status()); | ||
if (bmqp_ctrlmsg::PrimaryStatus::E_ACTIVE == cit->first.status()) { | ||
d_fileStores[partitionId]->setActivePrimary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to
if(allParitionsAvailable()) {
d_recoveryStatusCb(0);
or is it done implicitly and if so, maybe comment should explain that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a comment explaining how d_recoveryStatusCb
is called a bit later.
<< pinfo.primaryLeaseId() << "]"; | ||
continue; // CONTINUE | ||
} | ||
pinfo.setPrimaryStatus(cit->first.status()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you get a series of ACTIVE, then PASSIVE, then ACTIVE buffered advisories? How are you going to process them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we do not do anything special when a primary becomes ACTIVE -> PASSIVE -> ACTIVE. Thus, we also do no-op here. Arguably, we can improve the logic, but it's beyond the scope of this PR.
8059ba6
to
9a258b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 297 of commit 9a258b4 has completed with FAILURE
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
9a258b4
to
f6f34d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 301 of commit f6f34d4 has completed with FAILURE
…imary active (bloomberg#416) * mqbc::StorageMgr: Ban 'processPrimaryStatusAdvisory' in non-FSM mode Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Transition to available only when all primary active Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: clang-format Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Healing replica buffers primary status advisories Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbs::FileStore: Rename setPrimary -> setActivePrimary Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Comment about check if all partitions available Signed-off-by: Yuan Jing Vincent Yan <[email protected]> --------- Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
…imary active (bloomberg#416) * mqbc::StorageMgr: Ban 'processPrimaryStatusAdvisory' in non-FSM mode Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Transition to available only when all primary active Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: clang-format Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Healing replica buffers primary status advisories Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbs::FileStore: Rename setPrimary -> setActivePrimary Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Comment about check if all partitions available Signed-off-by: Yuan Jing Vincent Yan <[email protected]> --------- Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Christopher Beard <[email protected]> fixing Solaris build (bloomberg#434) Signed-off-by: dorjesinpo <[email protected]> Remove `-DBMQ_ENABLE_MSG_GROUPID` from the build system We do not ever want to build with this flag when releasing, and users often manage to enable this flag accidentally. Because message group IDs are not fully implemented, we remove this temporary definition. It can be added in later if we ever come back to this feature. Signed-off-by: Patrick M. Niedzielski <[email protected]> Make unit tests pass without `BMQ_ENABLE_MSG_GROUPID` The unit tests currently assume that message group IDs are enabled, and since have updated our build system to no longer enable this feature, the unit tests now fail in CI. This patch guards the message group ID tests with preprocessor conditionals, disabling the parts of tests that try to set and check message group IDs. When `BMQ_ENABLE_MSG_GROUPID` is set, these parts of the unit tests run again. Signed-off-by: Patrick M. Niedzielski <[email protected]> Fix mqbstat doc formatting (bloomberg#438) Signed-off-by: Christopher Beard <[email protected]> Fix[bmqeval]: limit expression length to avoid stack overflow (bloomberg#441) Signed-off-by: Evgeny Malygin <[email protected]> Fix Solaris unit tests (bloomberg#440) Signed-off-by: Anton Pryakhin <[email protected]> Docs[BMQ]: Use `.dox` files rather than `.md` files Package group documentation in `libbmq` was converted to Markdown files named `README.md`, and which was tied to the directory containing the code for the package group using Doxygen `@dir` commands. However, when generating the documentation, this left several empty pages in the documentation named `README`, which we were not able to remove. The solution for this that this patch uses is to switch from `.md` files to `.dox` files, which contain a single Doxygen-style C++ comment containing the `@dir` command. Unlike `.md` files, these do not automatically create pages, so there is no empty `README` page created for each package group. The cost of this is that `.dox` files cannot be simple Markdown files, but instead need to be wrapped in a C++ comment. Signed-off-by: Patrick M. Niedzielski <[email protected]> Docs[BMQ] bde -> doxygen conversion fixes (bloomberg#443) * Doc[BMQT] minor bde -> doxygen docs * Doc[BMQA] minor bde -> doxygen docs * Doc[BMQA] re-wrap data member comments * Doc[BMQT] re-wrap data member comments * Apply suggestions from code review --------- Signed-off-by: Christopher Beard <[email protected]> Signed-off-by: Chris Beard <[email protected]> Co-authored-by: Evgeny Malygin <[email protected]> Feat: track queue depth per appId (bloomberg#320) Signed-off-by: Evgeny Malygin <[email protected]> configurator, bmqit: mode protos (bloomberg#447) Signed-off-by: Jean-Louis Leroy <[email protected]> Revert "configurator, bmqit: mode protos (bloomberg#447)" (bloomberg#449) This reverts commit a4b20db. Fix[mqbs_virtualstoragecatalog.cpp]: fix Solaris build (bloomberg#450) Signed-off-by: Evgeny Malygin <[email protected]> fix: configurator: apply app ids (bloomberg#452) Signed-off-by: Jean-Louis Leroy <[email protected]> Fix [MQB]: mqbc::StorageMgr: Transition to available only when all primary active (bloomberg#416) * mqbc::StorageMgr: Ban 'processPrimaryStatusAdvisory' in non-FSM mode Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Transition to available only when all primary active Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: clang-format Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Healing replica buffers primary status advisories Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbs::FileStore: Rename setPrimary -> setActivePrimary Signed-off-by: Yuan Jing Vincent Yan <[email protected]> * mqbc::StorageMgr: Comment about check if all partitions available Signed-off-by: Yuan Jing Vincent Yan <[email protected]> --------- Signed-off-by: Yuan Jing Vincent Yan <[email protected]> Fix some compiler warnings in mqb (bloomberg#455) * -Wunused-parameter * -Wshadow * -Wswitch-enum Signed-off-by: Christopher Beard <[email protected]> It: Include full path for admin stat it test failures (bloomberg#453) * It: Include full path for admin stat it test failures This patch makes it a little easier to debug the metric & operation that causes an integration test for stats to fail. Signed-off-by: Christopher Beard <[email protected]> * Update src/integration-tests/test_admin_client.py Co-authored-by: Evgeny Malygin <[email protected]> Signed-off-by: Chris Beard <[email protected]> --------- Signed-off-by: Christopher Beard <[email protected]> Signed-off-by: Chris Beard <[email protected]> Co-authored-by: Evgeny Malygin <[email protected]> Feat: Add queue history size metric (bloomberg#436) * [WIP] Feat: Add queue history size metric This adds a new queue metric that counts the number of GUIDs in that queue's history. This is useful for identifying excessive memory utilization from history and potential history garbage collection issues (where history is filled up faster than it's cleaned up). Signed-off-by: Christopher Beard <[email protected]> * It: Extend admin it for history size stat Signed-off-by: Christopher Beard <[email protected]> --------- Signed-off-by: Christopher Beard <[email protected]> Feat[plugins]: report queue depth per appId to prometheus (bloomberg#446) Signed-off-by: Evgeny Malygin <[email protected]> [Fix] m_bmqstoragetool::FileManagerImpl: Asserts not have side effects (bloomberg#461) Signed-off-by: Yuan Jing Vincent Yan <[email protected]> Feat[MQB]: Enhance queue consumption monitor alarm log with additional details (bloomberg#420) Enhance filebackedstorage alarm log Signed-off-by: Aleksandr Ivanov <[email protected]> Cleanup Signed-off-by: Aleksandr Ivanov <[email protected]> Add test to mqbu_capacitymeter.t Signed-off-by: Aleksandr Ivanov <[email protected]> mqbc::StorageUtil, mqbi::StorageMgr: updateQueue -> updateQueuePrimary (bloomberg#466) Signed-off-by: Yuan Jing Vincent Yan <[email protected]> Fix[MQB]: misc warnings (bloomberg#464) Allow dots in subscription property names Message properties allow arbitrary strings for property names, but our subscription expression language is more limited, requiring an initial alphabetic character followed by any number of alphanumeric characters and underscores. Some producers have begun using hierarchical message property names, separated by dots (“.”), and are unable to use subscriptions to filter or route according to these message properties. This patch extends the expression language grammar to enable matching on subscription property names with dots in them. This change is a pure extension: the language recognized by the subscription expression grammar after this patch is a strict superset of the language recognized by the subscription expression grammar before this patch. This patch also extends the unit test for the lexer to ensure this is a strict superset. Signed-off-by: Patrick M. Niedzielski <[email protected]> fix: clean app subscriptions on reconfigure Signed-off-by: dorjesinpo <[email protected]> Fix[mqbstat_domainstats.cpp]: empty tier StringRef (bloomberg#431) Signed-off-by: Evgeny Malygin <[email protected]> Fix Solaris build, it does not support ctor delegation Signed-off-by: Aleksandr Ivanov <[email protected]> Doc: Document app subscriptions (bloomberg#463) * Docs upgrade jekyll -> 4.3.3 Signed-off-by: Christopher Beard <[email protected]> * Docs: Document app subscriptions Signed-off-by: Christopher Beard <[email protected]> * Expand on difference in subscriptions Signed-off-by: Christopher Beard <[email protected]> * Minor subscription doc clarifications Signed-off-by: Christopher Beard <[email protected]> * Elaborate on subscription details Signed-off-by: Christopher Beard <[email protected]> * Clarify consumer subscription on broker Signed-off-by: Christopher Beard <[email protected]> --------- Signed-off-by: Christopher Beard <[email protected]> fix: enhanced detection of duplciate PUSHes (bloomberg#472) Signed-off-by: dorjesinpo <[email protected]> Fix ntf-core version in build_darwin.sh Signed-off-by: Aleksandr Ivanov <[email protected]> Add logAppsSubscriptionInfoCb into InMemoryStorage Signed-off-by: Aleksandr Ivanov <[email protected]> Add IT for capacity meter enhanced log Signed-off-by: Aleksandr Ivanov <[email protected]> Fix comments Signed-off-by: Aleksandr Ivanov <[email protected]> Fix [CI] ntf-core version for macosx build (bloomberg#473) Merge mwc into bmq MWC or "MiddleWare Core" was a package group developed to support a myriad of applications at Bloomberg. It's been useful to share common middleware components between similar technologies, but doesn't make much sense to support as its own open source library. Moving forward we are merging it into the BMQ package group to better maintain it for the BlazingMQ project. Signed-off-by: Taylor Foxhall <[email protected]> Fix conflict Signed-off-by: Aleksandr Ivanov <[email protected]> Fix conflict Signed-off-by: Aleksandr Ivanov <[email protected]> Fix mwctst Signed-off-by: Aleksandr Ivanov <[email protected]>
Fixes two flaky integration tests in FSM mode:
In
test_basic
oftest_restart.py
, there was an issue where a replica could advertise availability before all primaries are active; then, a proxy could repoen queue and post message with no avail. The fix is to transition to available only when all primaries are active.In
test_kill_post_start
oftest_strong_consistency.py
, there was an issue where replicas are not issuing receipts to the primary after restart. This was because healing replicas in FSM mode were not buffering primary status advisories to process later, and thus not setting the correct primary in the FileStore. After I added the buffering logic, the tests pass.