Fix for race condition in node-join/node-left loop #15521

rahulkarajgikar · 2024-08-30T06:50:35Z

Description

Fix for race condition in node-join/node-left loop.

Scenario where race condition can happen:

Suppose a node disconnects from the cluster due to some normal reason.
This queues a node-left task on cluster manager thread.
Then cluster manager then computes the new cluster state based on the node-left task.
The cluster manager now tries to send the new state to all the nodes and waits for all nodes to ack back.
Suppose this takes a long time due to lagging nodes or slow applying of the state or any other reason.
While this is happening, the node that just left sends a join request to the cluster manager to rejoin the cluster. [This happens in an infinite loop on transport layer and the frequency is controlled by discovery.find_peers_interval setting]
The role of this join request is to re-establish any required connections and do some pre-validations before queuing a new task.
After join request is validated by cluster manager node, cluster manager queues a node-join task into its thread.
This node-join task would only start after the node-left task is completed since cluster-manager is single threaded.
Now suppose the node-left task has completed publication and has started to apply the new state on the cluster manager.
As part of applying the cluster state of node-left task, cluster manager wipes out the connection to the leaving node.
The node-left task then completes and the node-join task begins.
Now the node-join task starts. This task assumes that because the previous join request succeeded, that the connection to the joining node would still be there.
So then the cluster manager computes the new state.
Then it tells the FollowersChecker thread to add this new node.
Then it tries to publish the new state to all the nodes.
However, at this point, the FollowerChecker thread fails with NodeNotConnectedException because the connection was wiped and triggers a new node-left task.
If the new node-left task also takes time, we end up in an infinite loop of node-left and node-join tasks.
Even if the FollowerChecker is modified to handle this NodeNotConnectedException gracefully without triggering a node-left task, the state publication to this joining node still fails because the connection was wiped. So the node-join state never completes on the joining node and it forever remains in candidate phase.

To summarise, if we allow a node-join task into the queue before the node-left task disconnects from the node, we will see the race condition happen.

Fix:

As part of the fix for this, we now reject the initial join request from a node that has an ongoing node-left task.
The join request will only succeed after the node-left task completes committing state on cluster manager, so the connection that gets created as part of the join request does not get wiped out and cause node-join task to fail.

This is done by marking nodes as pending disconnect right before publish state of node-left task.
We mark the nodes as completed disconnect after commit state of node-left task, or on re-election of cluster manager.

If there is a connection request from cluster-manager to any other node during this time, we reject the connection request. This blocks join requests because during join requests, cluster-manager tries to connect to the node trying to join.

The join request will keep retrying, and once the node-left succeeds, the join request will be able to make a connection and succeed.

Main classes:

Coordinator - where publication begins
ClusterConnectionManager - low level connection class. has connection logic and book-keeping. Used by both TransportService and NodeConnectionsService. Core changes made here.
ClusterApplierService - entry point of node connects/disconnects in cluster state commit flow.
NodeConnectionsService - abstraction used only by ClusterApplierService to handle connections/disconnections.
TransportService - called by Coordinator and NodeConnectionsService to connect/disconnect.
NodeJoinLeftIT - IT to simulate the issue

Related Issues

Resolves #4874

Check List

Functionality includes testing.
[N/A] API changes companion pull request created, if applicable.
[N/A] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-08-30T06:59:30Z

❌ Gradle check result for 9496aa1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-08-30T08:45:30Z

❌ Gradle check result for 94ff71b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-08-30T14:56:38Z

❌ Gradle check result for 78b6fdd: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java

github-actions · 2024-09-02T05:35:42Z

❌ Gradle check result for 8411788: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-02T08:07:34Z

❌ Gradle check result for f0cf40c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

server/src/main/java/org/opensearch/transport/TransportService.java

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

server/src/main/java/org/opensearch/transport/ConnectionManager.java

server/src/main/java/org/opensearch/transport/RemoteConnectionManager.java

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

github-actions · 2024-09-03T07:49:39Z

❌ Gradle check result for a1db9fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-03T09:44:55Z

❌ Gradle check result for 61eac52: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-04T05:58:32Z

❌ Gradle check result for dd14cc8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rahulkarajgikar · 2024-09-04T06:13:31Z

rebased main

github-actions · 2024-09-04T06:50:32Z

❌ Gradle check result for 4a46aa6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-04T07:56:41Z

❌ Gradle check result for 412eada: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-04T14:33:57Z

❌ Gradle check result for dd68f30: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-04T20:10:37Z

❌ Gradle check result for ca48626: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-05T06:10:05Z

❌ Gradle check result for 36e473d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-05T07:34:57Z

❌ Gradle check result for 9c788fb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-05T08:01:19Z

❌ Gradle check result for e02aa10: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

server/src/main/java/org/opensearch/transport/ConnectionManager.java

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java

github-actions · 2024-09-06T09:59:36Z

❌ Gradle check result for b0a7ae3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-06T10:15:12Z

❌ Gradle check result for 0d3db12: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rahulkarajgikar · 2024-09-06T10:22:01Z

force pushed to rebase from main

github-actions · 2024-09-06T11:19:23Z

❌ Gradle check result for 9e322c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-06T13:39:31Z

❌ Gradle check result for 85a9e37: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rahul Karajgikar <[email protected]>

github-actions · 2024-09-19T18:31:02Z

❌ Gradle check result for fc170ee: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rahul Karajgikar <[email protected]>

github-actions · 2024-09-20T12:49:38Z

❕ Gradle check result for a107371: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

rajiv-kv reviewed Aug 30, 2024

View reviewed changes

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java Outdated Show resolved Hide resolved

server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java Outdated Show resolved Hide resolved

rajiv-kv reviewed Sep 2, 2024

View reviewed changes

rahulkarajgikar force-pushed the race_condition_2 branch from 4a46aa6 to d7be8b7 Compare September 4, 2024 06:12

github-actions bot added bug Something isn't working Cluster Manager labels Sep 5, 2024

rahulkarajgikar force-pushed the race_condition_2 branch from 9c788fb to e02aa10 Compare September 5, 2024 07:25

rajiv-kv reviewed Sep 5, 2024

View reviewed changes

rahulkarajgikar force-pushed the race_condition_2 branch from 0d3db12 to 9e322c4 Compare September 6, 2024 10:21

opensearch-ci-bot mentioned this pull request Sep 6, 2024

[AUTOCUT] Gradle Check Flaky Test Report for PluginInfoIT #15814

Open

Rahul Karajgikar added 25 commits September 19, 2024 22:44

remove log line

ebdccab

Signed-off-by: Rahul Karajgikar <[email protected]>

fix merge conflict

6d02854

Signed-off-by: Rahul Karajgikar <[email protected]>

remove log line

c457992

Signed-off-by: Rahul Karajgikar <[email protected]>

Additional check to fix UTs

d1f99d8

Signed-off-by: Rahul Karajgikar <[email protected]>

empty commit for gradle check rerun

f286eb0

Signed-off-by: Rahul Karajgikar <[email protected]>

fix log

29518f5

Signed-off-by: Rahul Karajgikar <[email protected]>

cleanup

b4a1175

Signed-off-by: Rahul Karajgikar <[email protected]>

rename variable names, update logs and comments

3a32d0b

Signed-off-by: Rahul Karajgikar <[email protected]>

add changelog

638fd2e

Signed-off-by: Rahul Karajgikar <[email protected]>

Address comments + minor changes

30c3391

Signed-off-by: Rahul Karajgikar <[email protected]>

fix targetsbynode logic

24d2d25

Signed-off-by: Rahul Karajgikar <[email protected]>

fix tests instead of updating disconnectFromNodesExcept

7aca983

Signed-off-by: Rahul Karajgikar <[email protected]>

Minor changes in IT

1f49e9b

Signed-off-by: Rahul Karajgikar <[email protected]>

empty commit for gradle check

e418663

Signed-off-by: Rahul Karajgikar <[email protected]>

Add test for disconnect during node-left

98ac61d

Signed-off-by: Rahul Karajgikar <[email protected]>

fix spotless

77e64db

Signed-off-by: Rahul Karajgikar <[email protected]>

Cleanup pendingDisconnection entries during cluster-manager failover

45d9b87

Signed-off-by: Rahul Karajgikar <[email protected]>

Use NodeConnectionsService instead of transportService in Coordinator

bc818e1

Signed-off-by: Rahul Karajgikar <[email protected]>

remove doc from IT

8e389d1

Signed-off-by: Rahul Karajgikar <[email protected]>

empty commit to rerun gradle check

832907f

Signed-off-by: Rahul Karajgikar <[email protected]>

change debug logs to trace logs to remove noise

5f36588

Signed-off-by: Rahul Karajgikar <[email protected]>

add new tests and refactor existing tests

b1f955a

Signed-off-by: Rahul Karajgikar <[email protected]>

remove unused code

73d027a

Signed-off-by: Rahul Karajgikar <[email protected]>

add assertions on exception message

83b16de

Signed-off-by: Rahul Karajgikar <[email protected]>

changes to tests based on comments

fc170ee

Signed-off-by: Rahul Karajgikar <[email protected]>

rahulkarajgikar force-pushed the race_condition_2 branch from 56d4e15 to fc170ee Compare September 19, 2024 17:15

empty commit

a107371

Signed-off-by: Rahul Karajgikar <[email protected]>

opensearch-ci-bot mentioned this pull request Sep 20, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreStatsIT #14310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for race condition in node-join/node-left loop #15521

Fix for race condition in node-join/node-left loop #15521

rahulkarajgikar commented Aug 30, 2024 •

edited

Loading

github-actions bot commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 4, 2024

rahulkarajgikar commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

rahulkarajgikar commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 19, 2024

github-actions bot commented Sep 20, 2024

Fix for race condition in node-join/node-left loop #15521

Are you sure you want to change the base?

Fix for race condition in node-join/node-left loop #15521

Conversation

rahulkarajgikar commented Aug 30, 2024 • edited Loading

Description

Fix:

Main classes:

Related Issues

Check List

github-actions bot commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 4, 2024

rahulkarajgikar commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

rahulkarajgikar commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 19, 2024

github-actions bot commented Sep 20, 2024

rahulkarajgikar commented Aug 30, 2024 •

edited

Loading