Fair redistribution RATIS_THREE pipeline mechanism #6248

Montura · 2024-02-21T11:05:02Z

Montura
Feb 21, 2024

There is a desing doc saying that:
"For example, with 5 and replication factor of 3, you would end up with 2 datanode not being a part of any pipeline."

This prevents pipelines from being distributed fairly among the datanodes in the cluster. Especially when datanodes become ready one by one.

I’m thinking in a way only for cluster startup scenario:

Ratis pipelines creating in a greedy way on some datanode triplet. When heartbeats from all 3 datanodes are received, pipeline becomes OPENED. And there is a time slot when there are some OPENED pipelines exists, but data writing from client still doesn’t start.
I assume that we could close these OPENED (and not involved in data writing yet) pipelines to allow BackgroundPipelineCreator to create pipelines again to include datanodes that started later (I’ve talked about these specific nodes in the previous message).

Some thoughts:

Maybe PipelineManagerImpl::scrubPipelines that collects too long ALLOCATED and CLOSED pipelines could be tuned or use some options (like OZONE_SCM_PIPELINE_ALLOCATED_TIMEOUT) to collect also OPENED pipelines without containers?

Ex. Cluster of 5 datanodes, first 3 start up fast, 2 nodes start later, OZONE_DATANODE_PIPELINE_LIMIT = 10:

Current pipeline distribution (across 1-2-3 DNs)

Expected pipeline distribution (across 1-2-3-4-5 DNs)

Datanode	Pipeline Count
1	10
2	10
3	10
4	0
5	0

DN set	Pipeline Count	Total Pipelines
1-2-3	10	10
1-2-4	0	10
1-2-5	0	10
1-3-4	0	10
1-3-5	0	10
1-4-5	0	10
2-3-4	0	10
2-3-5	0	10
2-4-5	0	10
3-4-5	0	10

Datanode	Pipeline Count
1	10
2	10
3	10
4	10
5	8

DN set	Pipeline Count	Total Pipelines
1-2-3	2	2
1-2-4	2	4
1-2-5	2	6
1-3-4	2	8
1-3-5	1	9
1-4-5	1	10
2-3-4	2	12
2-3-5	1	13
2-4-5	1	14
3-4-5	2	16

Total pipeline count could be 16 instead of 10. And all 5 datanodes will be utilized as much as possible.

Montura · 2024-02-21T11:10:32Z

Montura
Feb 21, 2024
Author

There is a test org.apache.hadoop.ozone.TestMiniOzoneClusterFairPipelineDistribution

It starts MiniOzoneCluster with different count of DNs and verifies that:

The difference between pipelines count for each datanodes pair is less or equal 3 (because of replication factor RATIS_THREE)
If there are datanodes which number of pipelines differs more than 3 from other datanodes, then the number of these datanodes must be 1 (cluster_datanode_count % 3 == 1) or 2 (cluster_datanode_count % 3 == 2).
I expect that the number of pipelines for these datanodes must be greater that zero.

For now it's passed to get successful CI build.

To reproduce a problem you have to uncomment next lines (one, two) in org.apache.hadoop.hdds.scm.pipeline.TestRatisThreePipelineDistribution#verifyPipelineCountForNodes method.

P.S. verifyPipelineCountForNodes verifies that the difference for pipeline count between each two nodes doesn't exceed 3. Could be 1 or 2 nodes that have another count of pipelines and count must be > 0:

    case 1:
      // todo: Uncomment to make TestMiniOzoneClusterFairPipelineDistribution fail
//      int nodeIdx = nodeIndexesArr[0];
//      assertThat(pipelinesCountPerDn[nodeIdx]).isGreaterThan(0);

and

    case 2:
      int firstNodeIdx = nodeIndexesArr[0];
      int firstNodePipelineCount = pipelinesCountPerDn[firstNodeIdx];
      int secondNodeIndex = nodeIndexesArr[1];
      int secondNodePipelineCount = pipelinesCountPerDn[secondNodeIndex];
      assertEquals(firstNodePipelineCount, secondNodePipelineCount);

      // todo: Uncomment to make TestMiniOzoneClusterFairPipelineDistribution fail
//      assertThat(firstNodePipelineCount).isGreaterThan(0);
//      assertThat(secondNodePipelineCount).isGreaterThan(0);
      break

0 replies

Montura · 2024-02-21T11:10:52Z

Montura
Feb 21, 2024
Author

@siddhantsangwan told that:

Basically because of long lived Ratis pipelines and there are some open jiras about this. Ratis pipelines are "long lived” meaning that they aren’t closed regularly once created.
Right now, you can try manually closing pipelines through the CLI or increase the configured limit of pipelines per Datanode.
We already close pipelines that stay ALLOCATED for too long.
EC pipelines won’t face this issue. They’re not pre-created and should get closed regularly as the containers they’re associated to get full and close.

More thoughts about reopening pipelines that doesn't take part in data writing on cluster startup:

One problem I see is that moving pipelines from OPEN -> CLOSE when they were just created will end up wasting the work Ratis just did to create the pipeline.
We try to pre create pipelines so that the cluster will have them available for good write performance. Multiple pipelines can be mapped to multiple disks in a DN, which increases overall disk utilization.
But as you said, in your situation there is a time slot where the pipelines are OPEN but no clients are writing yet and you don’t need all pipelines available immediately. So, how about slowing down the rate of creating of pipelines using ozone.scm.pipeline.creation.interval? The default is 120 seconds. This will allow time for all your DNs to come up.

2 replies

Montura Feb 21, 2024
Author

One more question about pipeline creation:

There is a PipelinePlacementPolicy and there was an HDDS-4710 that changed using first node to the using random one.

And the problem is described looks like the same: "I was trying to configure a four-node cluster to test SCM HA, after configuring the ozone.datanode.pipeline.limit to 3 and 5 separately, the pipeline existed only chose the first three data nodes, the fourth node was never chosen.(They are equal nodes)”

So, what if I increase ozone.scm.pipeline.creation.interval maybe I will face with a problem, that random datanode will be selected, not the first one (DNs selected from the list where DNs are sorted by the ascending order of pipelines count)?

I mean that 1 or 2 extra DNs (without any pipelines in my current scenario) will be chosen randomly for a new pipeline instead of picking one-by-one from the sorted list of DNs.

Or I’m wrong?

Montura Feb 28, 2024
Author

@siddhantsangwan, what about this question? Could you comment something please?

Montura · 2024-02-21T11:12:55Z

Montura
Feb 21, 2024
Author

@nandakumar131 told that:

Currently we don't have logic to redistribution pipelines if there is new datanode (less than 3 datanodes) registered and the existing datanodes are already at pipeline capacity.
We want to address this, we have a Jira for the same:
- https://issues.apache.org/jira/browse/HDDS-4689
- https://issues.apache.org/jira/browse/HDDS-6225

How to fix current issue for now:

Currently, you can delay the pipeline creation in your cluster to address this issue. We can change the config such that the pipeline creation is started after all the datanodes are registered (hopefully)
hdds.scm.safemode.pipeline.creation to false will make sure that the Pipelines are not created until the SCM is out of safemode.
hdds.scm.wait.time.after.safemode.exit can be set to higher value, this will delay the Pipeline creation even if SCM is out of safemode. Default is 5m.
Once you give enough time for the datanodes to register before starting the pipeline creation process, the pipelines will be distributed.
This might affect the cluster readiness time after restart. Writes might be blocked until there is at least one pipeline ready.

0 replies

siddhantsangwan · 2024-02-22T07:35:48Z

siddhantsangwan
Feb 22, 2024
Collaborator

@Montura if you choose to go with these configurations suggested by Nanda:

hdds.scm.safemode.pipeline.creation to false will make sure that the Pipelines are not created until the SCM is out of safemode.
hdds.scm.wait.time.after.safemode.exit can be set to higher value, this will delay the Pipeline creation even if SCM is out of safemode. Default is 5m.

All your Datanodes will have enough time to get registered and be part of pipelines. Then, if you have multiple racks, the default PipelinePlacementPolicy makes the first node from the list of Datanodes sorted by pipeline load as the anchor node.

If you don't have multiple racks, Datanodes are picked randomly, which is eventually expected to lead to fair pipeline distribution.

2 replies

Montura Feb 22, 2024
Author

@siddhantsangwan , I've tried to set

hdds.scm.safemode.pipeline.creation to false
hdds.scm.wait.time.after.safemode.exit to 30 sec and more

It doesn't work because of SCM goes out of safe mode when at least 3 datanodes are ready (this value is set for HDDS_SCM_SAFEMODE_MIN_DATANODE for MiniOzoneCluster here)

So BackgroundPipelineCreator when starts it uptades oneShot = true value and starts creating pipelines. It doesn't wait for hdds.scm.wait.time.after.safemode.exit for the first time being run. It will wait for hdds.scm.wait.time.after.safemode.exit in next iterations:

BackgroundPipelineCreator is notified by NewNodeHandler with NEW_NODE_HANDLER_TRIGGERED event
BackgroundPipelineCreator is notified by SCMSafeModeManager with PRE_CHECK_COMPLETED event
BackgroundPipelineCreator sets a field oneShot to true
BackgroundPipelineCreator starts creating pipelines immediately after SCM exits from safe mode

So, when I change the value of HDDS_SCM_SAFEMODE_MIN_DATANODE to the number of datanodes in cluster everything becomes ok.

I ever don't need to set any of hdds.scm.safemode.pipeline.creation or hdds.scm.wait.time.after.safemode.exit properties. I'm just don't leave SCM safe mode until all datanodes in cluster become healthy.

Is it valid way too?

And yes, I understand that in this way won't be any pipelines (for read|write) until all DNs become ready.

siddhantsangwan Feb 29, 2024
Collaborator

OK, sounds like you've got it working! This approach seems valid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fair redistribution RATIS_THREE pipeline mechanism #6248

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Fair redistribution RATIS_THREE pipeline mechanism #6248

Montura Feb 21, 2024

Replies: 4 comments · 4 replies

Montura Feb 21, 2024 Author

Montura Feb 21, 2024 Author

Montura Feb 21, 2024 Author

Montura Feb 28, 2024 Author

Montura Feb 21, 2024 Author

siddhantsangwan Feb 22, 2024 Collaborator

Montura Feb 22, 2024 Author

siddhantsangwan Feb 29, 2024 Collaborator

Montura
Feb 21, 2024

Replies: 4 comments 4 replies

Montura
Feb 21, 2024
Author

Montura
Feb 21, 2024
Author

Montura Feb 21, 2024
Author

Montura Feb 28, 2024
Author

Montura
Feb 21, 2024
Author

siddhantsangwan
Feb 22, 2024
Collaborator

Montura Feb 22, 2024
Author

siddhantsangwan Feb 29, 2024
Collaborator