Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix [MQB]: mqbc::StorageMgr: Transition to available only when all primary active #416

Merged

Conversation

kaikulimu
Copy link
Collaborator

@kaikulimu kaikulimu commented Sep 4, 2024

Fixes two flaky integration tests in FSM mode:

  1. In test_basic of test_restart.py, there was an issue where a replica could advertise availability before all primaries are active; then, a proxy could repoen queue and post message with no avail. The fix is to transition to available only when all primaries are active.

  2. In test_kill_post_start of test_strong_consistency.py, there was an issue where replicas are not issuing receipts to the primary after restart. This was because healing replicas in FSM mode were not buffering primary status advisories to process later, and thus not setting the correct primary in the FileStore. After I added the buffering logic, the tests pass.

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch 3 times, most recently from 80f71b6 to e5d59b2 Compare September 10, 2024 21:54
@kaikulimu kaikulimu marked this pull request as ready for review September 11, 2024 13:59
@kaikulimu kaikulimu requested a review from a team as a code owner September 11, 2024 13:59
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple questions...


pinfo.setPrimaryStatus(value);
if (bmqp_ctrlmsg::PrimaryStatus::E_ACTIVE == value) {
d_fileStores[partitionId]->setPrimary(pinfo.primary(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this in setPrimaryStatusForPartition and not in setPrimaryForPartition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileStore should only set primary after the primary becomes active. The FileStore is unable to work with a passive primary. I will rename the function as FileStore::setActivePrimary to make the point clear.

<< " primary, this advisory could "
<< "be from the true one. Will"
<< " buffer the advisory for now.";
d_storageManager_p->bufferPrimaryStatusAdvisory(primaryAdv,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, does pinfo.primaryNode() get assigned upon PrimaryStatusAdvisory or in the FSM flow there is another trigger? We receive PrimaryStatusAdvisory and we do not have partition primary, why not assign the primary then?

Copy link
Collaborator Author

@kaikulimu kaikulimu Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FSM mode, the source of truth for partition assignments are in the cluster state snapshot of the CSL file. As part of healing, a new leader assigns partitions and then applies the assignments in its first CSL advisory. Primary status adviosries can be stale; that's why we have a lot of checks in this function in the first place. My original idea was to simply ignore all primary status advisories and purely rely upon FSM for partition assignments. However, FSM can heal a replica but neglect to set a primary as active. Thus, I came up with the idea of buffering primary status advisories. If an advisory is not stale (i.e. matching primary node and leaseId), then we trust the availability advisory.

Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 250 of commit e5d59b2 has completed with FAILURE

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch from 835ba43 to 0116bdf Compare September 11, 2024 22:22
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 251 of commit 0116bdf has completed with FAILURE

@kaikulimu
Copy link
Collaborator Author

@dorjesinpo Back to you

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch 2 times, most recently from 06c92b2 to 337db61 Compare September 17, 2024 10:29
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 256 of commit 337db61 has completed with FAILURE

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch from 337db61 to 8059ba6 Compare September 19, 2024 14:31
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 266 of commit 8059ba6 has completed with FAILURE

Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question

}
pinfo.setPrimaryStatus(cit->first.status());
if (bmqp_ctrlmsg::PrimaryStatus::E_ACTIVE == cit->first.status()) {
d_fileStores[partitionId]->setActivePrimary(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to

  if(allParitionsAvailable()) {
            d_recoveryStatusCb(0);

or is it done implicitly and if so, maybe comment should explain that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment explaining how d_recoveryStatusCb is called a bit later.

<< pinfo.primaryLeaseId() << "]";
continue; // CONTINUE
}
pinfo.setPrimaryStatus(cit->first.status());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you get a series of ACTIVE, then PASSIVE, then ACTIVE buffered advisories? How are you going to process them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we do not do anything special when a primary becomes ACTIVE -> PASSIVE -> ACTIVE. Thus, we also do no-op here. Arguably, we can improve the logic, but it's beyond the scope of this PR.

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch from 8059ba6 to 9a258b4 Compare October 8, 2024 21:57
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 297 of commit 9a258b4 has completed with FAILURE

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch from 9a258b4 to f6f34d4 Compare October 9, 2024 17:51
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 301 of commit f6f34d4 has completed with FAILURE

@kaikulimu kaikulimu merged commit 060cd9f into bloomberg:main Oct 10, 2024
34 of 35 checks passed
@kaikulimu kaikulimu deleted the fsm-test--available-only-active-primary branch October 10, 2024 14:59
alexander-e1off pushed a commit to alexander-e1off/blazingmq that referenced this pull request Oct 24, 2024
…imary active (bloomberg#416)

* mqbc::StorageMgr: Ban 'processPrimaryStatusAdvisory' in non-FSM mode

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Transition to available only when all primary active

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: clang-format

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Healing replica buffers primary status advisories

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbs::FileStore: Rename setPrimary -> setActivePrimary

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Comment about check if all partitions available

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

---------

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
alexander-e1off pushed a commit to alexander-e1off/blazingmq that referenced this pull request Oct 24, 2024
…imary active (bloomberg#416)

* mqbc::StorageMgr: Ban 'processPrimaryStatusAdvisory' in non-FSM mode

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Transition to available only when all primary active

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: clang-format

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Healing replica buffers primary status advisories

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbs::FileStore: Rename setPrimary -> setActivePrimary

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Comment about check if all partitions available

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

---------

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
alexander-e1off pushed a commit to alexander-e1off/blazingmq that referenced this pull request Oct 24, 2024
Signed-off-by: Christopher Beard <[email protected]>

fixing Solaris build (bloomberg#434)

Signed-off-by: dorjesinpo <[email protected]>

Remove `-DBMQ_ENABLE_MSG_GROUPID` from the build system

We do not ever want to build with this flag when releasing, and users
often manage to enable this flag accidentally.  Because message group
IDs are not fully implemented, we remove this temporary definition.  It
can be added in later if we ever come back to this feature.

Signed-off-by: Patrick M. Niedzielski <[email protected]>

Make unit tests pass without `BMQ_ENABLE_MSG_GROUPID`

The unit tests currently assume that message group IDs are enabled, and
since have updated our build system to no longer enable this feature,
the unit tests now fail in CI.  This patch guards the message group ID
tests with preprocessor conditionals, disabling the parts of tests that
try to set and check message group IDs.  When `BMQ_ENABLE_MSG_GROUPID`
is set, these parts of the unit tests run again.

Signed-off-by: Patrick M. Niedzielski <[email protected]>

Fix mqbstat doc formatting (bloomberg#438)

Signed-off-by: Christopher Beard <[email protected]>

Fix[bmqeval]: limit expression length to avoid stack overflow (bloomberg#441)

Signed-off-by: Evgeny Malygin <[email protected]>

Fix Solaris unit tests (bloomberg#440)

Signed-off-by: Anton Pryakhin <[email protected]>

Docs[BMQ]: Use `.dox` files rather than `.md` files

Package group documentation in `libbmq` was converted to Markdown files
named `README.md`, and which was tied to the directory containing the
code for the package group using Doxygen `@dir` commands.  However, when
generating the documentation, this left several empty pages in the
documentation named `README`, which we were not able to remove.

The solution for this that this patch uses is to switch from `.md` files
to `.dox` files, which contain a single Doxygen-style C++ comment
containing the `@dir` command.  Unlike `.md` files, these do not
automatically create pages, so there is no empty `README` page created
for each package group.  The cost of this is that `.dox` files cannot be
simple Markdown files, but instead need to be wrapped in a C++ comment.

Signed-off-by: Patrick M. Niedzielski <[email protected]>

Docs[BMQ] bde -> doxygen conversion fixes (bloomberg#443)

* Doc[BMQT] minor bde -> doxygen docs

* Doc[BMQA] minor bde -> doxygen docs

* Doc[BMQA] re-wrap data member comments

* Doc[BMQT] re-wrap data member comments

* Apply suggestions from code review

---------

Signed-off-by: Christopher Beard <[email protected]>
Signed-off-by: Chris Beard <[email protected]>
Co-authored-by: Evgeny Malygin <[email protected]>

Feat: track queue depth per appId (bloomberg#320)

Signed-off-by: Evgeny Malygin <[email protected]>

configurator, bmqit: mode protos (bloomberg#447)

Signed-off-by: Jean-Louis Leroy <[email protected]>

Revert "configurator, bmqit: mode protos (bloomberg#447)" (bloomberg#449)

This reverts commit a4b20db.

Fix[mqbs_virtualstoragecatalog.cpp]: fix Solaris build (bloomberg#450)

Signed-off-by: Evgeny Malygin <[email protected]>

fix: configurator: apply app ids (bloomberg#452)

Signed-off-by: Jean-Louis Leroy <[email protected]>

Fix [MQB]: mqbc::StorageMgr: Transition to available only when all primary active (bloomberg#416)

* mqbc::StorageMgr: Ban 'processPrimaryStatusAdvisory' in non-FSM mode

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Transition to available only when all primary active

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: clang-format

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Healing replica buffers primary status advisories

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbs::FileStore: Rename setPrimary -> setActivePrimary

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

* mqbc::StorageMgr: Comment about check if all partitions available

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

---------

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

Fix some compiler warnings in mqb (bloomberg#455)

* -Wunused-parameter
* -Wshadow
* -Wswitch-enum

Signed-off-by: Christopher Beard <[email protected]>

It: Include full path for admin stat it test failures (bloomberg#453)

* It: Include full path for admin stat it test failures

This patch makes it a little easier to debug the metric & operation that
causes an integration test for stats to fail.

Signed-off-by: Christopher Beard <[email protected]>

* Update src/integration-tests/test_admin_client.py

Co-authored-by: Evgeny Malygin <[email protected]>
Signed-off-by: Chris Beard <[email protected]>

---------

Signed-off-by: Christopher Beard <[email protected]>
Signed-off-by: Chris Beard <[email protected]>
Co-authored-by: Evgeny Malygin <[email protected]>

Feat: Add queue history size metric (bloomberg#436)

* [WIP] Feat: Add queue history size metric

This adds a new queue metric that counts the number of GUIDs in that
queue's history. This is useful for identifying excessive memory
utilization from history and potential history garbage collection issues
(where history is filled up faster than it's cleaned up).

Signed-off-by: Christopher Beard <[email protected]>

* It: Extend admin it for history size stat

Signed-off-by: Christopher Beard <[email protected]>

---------

Signed-off-by: Christopher Beard <[email protected]>

Feat[plugins]: report queue depth per appId to prometheus (bloomberg#446)

Signed-off-by: Evgeny Malygin <[email protected]>

[Fix] m_bmqstoragetool::FileManagerImpl: Asserts not have side effects (bloomberg#461)

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

Feat[MQB]: Enhance queue consumption monitor alarm log with additional details (bloomberg#420)

Enhance filebackedstorage alarm log

Signed-off-by: Aleksandr Ivanov <[email protected]>

Cleanup

Signed-off-by: Aleksandr Ivanov <[email protected]>

Add test to mqbu_capacitymeter.t

Signed-off-by: Aleksandr Ivanov <[email protected]>

mqbc::StorageUtil, mqbi::StorageMgr: updateQueue -> updateQueuePrimary (bloomberg#466)

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

Fix[MQB]: misc warnings (bloomberg#464)

Allow dots in subscription property names

Message properties allow arbitrary strings for property names, but our
subscription expression language is more limited, requiring an initial
alphabetic character followed by any number of alphanumeric characters
and underscores.  Some producers have begun using hierarchical message
property names, separated by dots (“.”), and are unable to use
subscriptions to filter or route according to these message properties.

This patch extends the expression language grammar to enable matching on
subscription property names with dots in them.  This change is a pure
extension: the language recognized by the subscription expression grammar
after this patch is a strict superset of the language recognized by the
subscription expression grammar before this patch.  This patch also
extends the unit test for the lexer to ensure this is a strict superset.

Signed-off-by: Patrick M. Niedzielski <[email protected]>

fix: clean app subscriptions on reconfigure

Signed-off-by: dorjesinpo <[email protected]>

Fix[mqbstat_domainstats.cpp]: empty tier StringRef (bloomberg#431)

Signed-off-by: Evgeny Malygin <[email protected]>

Fix Solaris build, it does not support ctor delegation

Signed-off-by: Aleksandr Ivanov <[email protected]>

Doc: Document app subscriptions (bloomberg#463)

* Docs upgrade jekyll -> 4.3.3

Signed-off-by: Christopher Beard <[email protected]>

* Docs: Document app subscriptions

Signed-off-by: Christopher Beard <[email protected]>

* Expand on difference in subscriptions

Signed-off-by: Christopher Beard <[email protected]>

* Minor subscription doc clarifications

Signed-off-by: Christopher Beard <[email protected]>

* Elaborate on subscription details

Signed-off-by: Christopher Beard <[email protected]>

* Clarify consumer subscription on broker

Signed-off-by: Christopher Beard <[email protected]>

---------

Signed-off-by: Christopher Beard <[email protected]>

fix: enhanced detection of duplciate PUSHes (bloomberg#472)

Signed-off-by: dorjesinpo <[email protected]>

Fix ntf-core version in build_darwin.sh

Signed-off-by: Aleksandr Ivanov <[email protected]>

Add logAppsSubscriptionInfoCb into InMemoryStorage

Signed-off-by: Aleksandr Ivanov <[email protected]>

Add IT for capacity meter enhanced log

Signed-off-by: Aleksandr Ivanov <[email protected]>

Fix comments

Signed-off-by: Aleksandr Ivanov <[email protected]>

Fix [CI] ntf-core version for macosx build (bloomberg#473)

Merge mwc into bmq

MWC or "MiddleWare Core" was a package group developed to support
a myriad of applications at Bloomberg. It's been useful to share
common middleware components between similar technologies, but doesn't
make much sense to support as its own open source library. Moving
forward we are merging it into the BMQ package group to better maintain
it for the BlazingMQ project.

Signed-off-by: Taylor Foxhall <[email protected]>

Fix conflict

Signed-off-by: Aleksandr Ivanov <[email protected]>

Fix conflict

Signed-off-by: Aleksandr Ivanov <[email protected]>

Fix mwctst

Signed-off-by: Aleksandr Ivanov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants