Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for race condition in node-join/node-left loop #15521

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

rahulkarajgikar
Copy link
Contributor

@rahulkarajgikar rahulkarajgikar commented Aug 30, 2024

Description

Fix for race condition in node-join/node-left loop.

Scenario where race condition can happen:

  • Suppose a node disconnects from the cluster due to some normal reason.
  • This queues a node-left task on cluster manager thread.
  • Then cluster manager then computes the new cluster state based on the node-left task.
  • The cluster manager now tries to send the new state to all the nodes and waits for all nodes to ack back.
  • Suppose this takes a long time due to lagging nodes or slow applying of the state or any other reason.
  • While this is happening, the node that just left sends a join request to the cluster manager to rejoin the cluster. [This happens in an infinite loop on transport layer and the frequency is controlled by discovery.find_peers_interval setting]
  • The role of this join request is to re-establish any required connections and do some pre-validations before queuing a new task.
  • After join request is validated by cluster manager node, cluster manager queues a node-join task into its thread.
  • This node-join task would only start after the node-left task is completed since cluster-manager is single threaded.
  • Now suppose the node-left task has completed publication and has started to apply the new state on the cluster manager.
  • As part of applying the cluster state of node-left task, cluster manager wipes out the connection to the leaving node.
  • The node-left task then completes and the node-join task begins.
  • Now the node-join task starts. This task assumes that because the previous join request succeeded, that the connection to the joining node would still be there.
  • So then the cluster manager computes the new state.
  • Then it tells the FollowersChecker thread to add this new node.
  • Then it tries to publish the new state to all the nodes.
  • However, at this point, the FollowerChecker thread fails with NodeNotConnectedException because the connection was wiped and triggers a new node-left task.
  • If the new node-left task also takes time, we end up in an infinite loop of node-left and node-join tasks.
  • Even if the FollowerChecker is modified to handle this NodeNotConnectedException gracefully without triggering a node-left task, the state publication to this joining node still fails because the connection was wiped. So the node-join state never completes on the joining node and it forever remains in candidate phase.

To summarise, if we allow a node-join task into the queue before the node-left task disconnects from the node, we will see the race condition happen.

Fix:

As part of the fix for this, we now reject the initial join request from a node that has an ongoing node-left task.
The join request will only succeed after the node-left task completes committing state on cluster manager, so the connection that gets created as part of the join request does not get wiped out and cause node-join task to fail.

This is done by marking nodes as pending disconnect right before publish state of node-left task.
We mark the nodes as completed disconnect after commit state of node-left task, or on re-election of cluster manager.

If there is a connection request from cluster-manager to any other node during this time, we reject the connection request. This blocks join requests because during join requests, cluster-manager tries to connect to the node trying to join.

The join request will keep retrying, and once the node-left succeeds, the join request will be able to make a connection and succeed.

Main classes:

  • Coordinator - where publication begins
  • ClusterConnectionManager - low level connection class. has connection logic and book-keeping. Used by both TransportService and NodeConnectionsService. Core changes made here.
  • ClusterApplierService - entry point of node connects/disconnects in cluster state commit flow.
  • NodeConnectionsService - abstraction used only by ClusterApplierService to handle connections/disconnections.
  • TransportService - called by Coordinator and NodeConnectionsService to connect/disconnect.
  • NodeJoinLeftIT - IT to simulate the issue

Related Issues

Resolves #4874

Check List

  • Functionality includes testing.
  • [N/A] API changes companion pull request created, if applicable.
  • [N/A] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 9496aa1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 94ff71b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 78b6fdd: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 2, 2024

❌ Gradle check result for 8411788: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 2, 2024

❌ Gradle check result for f0cf40c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 3, 2024

❌ Gradle check result for a1db9fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 3, 2024

❌ Gradle check result for 61eac52: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 4, 2024

❌ Gradle check result for dd14cc8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rahulkarajgikar
Copy link
Contributor Author

rebased main

Copy link
Contributor

github-actions bot commented Sep 4, 2024

❌ Gradle check result for 4a46aa6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 4, 2024

❌ Gradle check result for 412eada: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 4, 2024

❌ Gradle check result for dd68f30: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 4, 2024

❌ Gradle check result for ca48626: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 5, 2024

❌ Gradle check result for 36e473d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 5, 2024

❌ Gradle check result for 9c788fb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 5, 2024

❌ Gradle check result for e02aa10: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 6, 2024

❌ Gradle check result for b0a7ae3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 6, 2024

❌ Gradle check result for 0d3db12: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rahulkarajgikar
Copy link
Contributor Author

force pushed to rebase from main

Copy link
Contributor

github-actions bot commented Sep 6, 2024

❌ Gradle check result for 9e322c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 6, 2024

❌ Gradle check result for 85a9e37: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Rahul Karajgikar added 25 commits September 19, 2024 22:44
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Copy link
Contributor

❌ Gradle check result for fc170ee: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rahul Karajgikar <[email protected]>
Copy link
Contributor

❕ Gradle check result for a107371: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[BUG] Race in node-left and node-join can prevent node from joining the cluster indefinitely
4 participants