Weight aware Globally even distribute Rebalancer

Introduction

We propose a new weight-aware globally-even distribute Rebalancer to better meet the application's requirement. Compared with the existing CRUSH-based Full-Auto rebalancers, the WAGED rebalancer will have the following major improvements.

Partition weight-aware

The WAGED rebalancer calculates partition assignment according to the various resource usage of partitions. Helix users can configure multi-dimensional weights of the partition's resource usage. In addition, they will set up the corresponding instance capacities.

Globally-even distribution

The WAGED rebalancer optimizes the partition assignment in terms of total partition weights across all the instances. This means all the partitions (replications) in different Helix resources are rebalanced together. While these partitions share the same physical resource, the rebalancer should assign partitions based on the global resource usage view.

The trade-off between partition movements and even distribution

The WAGED rebalancer minimizes unnecessary partition movements. In addition, it allows the Helix users to specify a preference between the evenness and partition movement. In short, the rebalancer can support strict even distribution but will cause more partition movements. Or it can reduce the partition movements but the distribution becomes less even. In the production environment, a stateless system would prefer evenness since partition movement cost is very little. On the other hand, a stateful system would prefer to keep the current data location even it might hurt the evenness. Moreover, the new rebalancer will support cluster reconfiguration gracefully. It minimizes the effort required to do the migration. The WAGED rebalancer remembers the rebalance state and adjusts the assignment incrementally.

Flexibility

The WAGED rebalancer is extendable. It can be configured or extended to fit different rebalance requirements. Interfaces of the WAGED rebalancer will be more generic. The developers can even extend the rebalance algorithm without modifying the rebalancer's main logic.

Proposed Design

Objectives

Partition Weight-Aware Globally-Even Distribute

As the name suggested, "Weight-Aware" and "Globally-Even" are the most important goals for the new rebalancer. Firstly, partition weight-awareness enables the rebalancer to understand the real system workload. The partition weight should be customizable and multi-dimensional. Helix users need to specify per-partition weight based on the expected system resource usage. Such as CPU, memory, storage usage. In addition, correspondingly to the weights, the users are required to specify the instance capacity. As an example, we will see the weight/capacity configurations as shown below.

Capacity / weight example Instance Capacity:

CPU: 1000

MEMORY: 5000

STORAGE: 500

DB Partition Weight:

CPU: 50

MEMORY: 100

STORAGE: 10

As for globally-even distribution, we expect all the instances in the cluster having the uniform workload distribution. Actually, global evenness is increasingly important when the rebalancer calculates based on multi-dimensional usage. For example, if two Helix resources have very different usage requirement, the rebalancer can co-locate their partitions so the overall system utilization will be improved.

Avoid Extra Partition Movements

We want a sticky and consistent partition assignment. To Helix, this means minimal management cost. However, strictly no movement is not optimal as well. First of all, the partitions on the deactivated instances have to be moved out. Secondly, the rebalancer might need to move some partitions for even distribution.

To be more accurate, the goal is to minimize the EXTRA partition movement. By "extra", we mean the partition movements that are between two unchanged and load-balanced instances. In theory, these movements are not necessary.

Note the difference between the extra movement and the optional movement. For example, once a resource is removed, should the rebalancer trigger partition movements to fill the instances that become idle? These partition movements are optional but might benefit the even distribution of workload. The WAGED rebalancer will avoid extra partition movement. But it will allow the optional movement if evenness is required.

Backward Compatibility

It is very important that the WAGED rebalancer keeps the same assumptions and capabilities as the existing rebalancers. For example, rack-aware, delayed rebalancing, partition movement throttling, etc. If any of these features are missing, the application might need to change their logic for the migrating.

Moreover, besides the rebalance logic, the WAGED rebalancer should calculate the result as fast as the existing ones. If an emergency happens, the rebalancer needs to calculate the new mapping within 100ms or less. Otherwise, there is not enough time to proceed with the Best Possible assignment and send state transition messages. Considering that the WAGED rebalancer needs to process more information, our goal is not to shorten the calculation time but keep it the same as it is now.

Last but not least, the WAGED rebalancer should be easy to use and fully integrate with the Helix Full-Auto rebalance mode. There should be minor user application logic change. And the additional steps that are required to activate the new rebalancer only happens once during the initial configuration.

Design Overview

We propose to implement the WAGED rebalancer based on the constraint-based rebalance algorithm. The basic idea of the constraint-based algorithm is evaluating each possible partition allocation by a set of constraints. Then, the partition is assigned to the instance that has the highest evaluated score. One of the reasons to choose the constraint-based rebalance algorithm is that it calculates fast. The time complexity is O(number_constraint * number_total_replication * number_instance). It helps the Helix to rebalance fast. In addition, the constraint-based algorithm has a flexible and extendable framework. For example, to make the rebalancer aware of partition weight, we just need to add a new constraint about partition weight.

However, only the constraint-based rebalance algorithm is not enough. Our current rebalancer's workflow has several limitations. For example, the pipeline triggers rebalance for every single resource repeatedly and separately. Although the input of rebalance call contains the complete Cluster Data Cache, the cache object does no track the ongoing partition assignment. So it is not feasible to do globally-even rebalance. Another example, the rebalance pipeline is executed on any type of cluster change events. Many of those events, such as the Current State change, should not impact the assignment. However, since the rebalancer is triggered anyway, it will process the unnecessary rebalancing. The extra rebalancing is not only a waste of the computing resource but also making the rebalancer more conservative on adjusting the assignment. In short, a deterministic algorithm becomes the must. Because of these concerns, we propose to implement a new rebalancer instead of developing a new Full-Auto rebalance strategy. The Helix event processing pipeline will also be refactored accordingly.

Quick Tutorial

The minimal set of steps to activate the WAGED rebalancer.

Configuring the capacity keys in the Cluster Config. For example, DISK_USAGE_GB, etc.

ClusterConfig.setInstanceCapacityKeys(List capacityKeys)

As mentioned, Helix does not understand the meaning of these capacity keys. They will be used to match the instance capacity and partition weight.

Configuring the instance capacity in the Instance Config. For example, DISK_USAGE_GB = 1000.

ClusterConfig.setDefaultInstanceCapacityMap(Map<String, Integer> capacityDataMap)

Configuring the partition weight in the Resource Config. For example, DISK_USAGE_GB = 20.

ClusterConfig.setDefaultPartitionWeightMap(Map<String, Integer> weightDataMap)

Modifying the Helix resource IdealStates to update with the WAGED rebalancer classname.

For example,

{ "id" : "DBName_1", "simpleFields" : { "IDEAL_STATE_MODE" : "AUTO_REBALANCE", "MIN_ACTIVE_REPLICAS" : "2", "NUM_PARTITIONS" : "1024", "REBALANCER_CLASS_NAME" : "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", "REBALANCE_MODE" : "FULL_AUTO", "REPLICAS" : "3", "STATE_MODEL_DEF_REF" : "LeaderStandby" }, "mapFields" : {}, "listFields" : {} }

Workflow

To high-levelly understand how the WAGED rebalancer works, please check the following workflow diagram. Note that this workflow combines multiple components' functionality.

Architecture

Rebalance Coordinator

The Rebalance Coordinator controls the workflow of the rebalance process. It depends on multiple independent modules to calculate the Best Possible assignment. As shown in the architecture diagram, after the BestPossibleState Calculation Stage makes a rebalance call, the coordinator initializes and call the related components.

Rebalance Workflow

The following flowchart demonstrates how the Rebalance Coordinator conducts a rebalance process. As shown in the diagram, we have two different branches both calculating for one partition assignment. This special design is for achieving fast rebalance speed while ensuring the eventual assignment optimization.

Global Baseline Calculation

Calculating for a globally optimized partition assignment. The Global Baseline Calculation does not consider any temporary status, such as participants' offline/disabled. So it reduces the randomness of partition assignment. The calculation result is named as the Baseline assignment. It is used as the anchor of partitions allocation.

The Global Baseline Calculation is only triggered when a substantial permanent cluster change happens. For example, cluster topology is changed, or new resources are added. For each calculation, the algorithm will re-assign all the partitions. Please note this does not imply that all the existing partitions will be shuffled. The Global Rebalance Calculation takes the previous Baseline assignment as an input. The intention is to give the algorithm a chance to re-allocate the partitions if necessary. For example, once a new instance joins the cluster, most of the existing partitions will be kept on the previous allocation unless the distribution becomes uneven. It is the algorithm that makes the decision about partition movement.

It is discussed that we leverage a more advanced rebalance algorithm for the Baseline calculation. For example, machine learning technology. In this case, the Baseline calculation is very possible to be delayed. Then the WAGED rebalancer should rely on the Partial Rebalance to calculate an intermediate result before the Baseline is ready. The rebalancer can always update the Best Possible assignment to the optimal one once the Baseline calculation is done. Overall, the main obstacle is the advanced rebalance algorithm itself. Once we have a good candidate, it would be easy to plug it into the workflow.

The Baseline is not directly propagated to the final output. It is consumed by the Partial Rebalance as an important parameter.

Partial Rebalance

Calculating for the Best Possible assignment output based on the Baseline and the previous Best Possible assignment.

Partial Rebalance is triggered on all the substantial cluster changes. Which include cluster topology change, resource config change, instance state change, and the Baseline assignment change. For the other trivial system changes such as Current State change, the Best Possible assignment should be kept the same. So there is no need to run the rebalance algorithm. The rebalancer directly returns the previously calculated result. We propose to leverage the constraint-based rebalance algorithm to reassign the partitions greedily for the sack of rebalancing speed.

As the name suggested, the Partial Rebalance is done with a certain rebalance scope. The coordinator compares the previous Best Possible assignment with the current cluster state so as to derive a minimal rebalance scope. In short, the rebalance scope only contains the following two types of partitions.

The partition's current assignment becomes invalid. The Baseline contains some new partition assignments that do not exist in the current assignment. Note that given multiple changes can happen during the rebalance interval, the rebalancer will merge the corresponding rebalance scopes and finish the calculation in one rebalance process.

About Persisting Assignments

One concern of persisting the assignment is the additional latency and the extra ZK throughput. However, considering that it is only required when a new Best Possible assignment has been calculated, the extra cost would be minor. Moreover, we plan to further optimize the persisting mechanism by compression.

Cluster Change Detector

The Cluster Change Detector is responsible for determining what has been changed. The result will be used to choose the correct rebalance approach. In general, the detector will keep a previous Cluster Data Cache snapshot. Then it compares the old snapshot with the new one for the difference. The comparison is done based on Zookeeper node versions as demonstrated in the following diagram. A content-based comparison might be necessary for some frequently modified ZNodes.

Input

The Cluster Data Cache. Optionally, also input the interest paths when initializing the detector.

Output

The ZNode paths that have been updated after the previous detect call.

Note that during the very first rebalance after a Controller acquires leadership, the detector won't have the old snapshot. So the rebalancer will always trigger a globally rebalance. One potential problem is that will the controller leadership switch cause a large scale partition shuffling? To prevent this, we need the Baseline assignment and the previous Best Possible assignment being persisted in Zookeeper. Moreover, the constraint-based rebalance algorithm only moves partitions whenever necessary.

Cluster Data Provider

The Cluster Data Provider generates a Cluster Model based on the Cluster Data Cache. The main reason we cannot use the cache directly is that we need a runtime object to track the pending assignment changes. Moreover, most of the information in the Cluster Data Cache is irrelevant to the rebalancer. Besides the necessary cluster status information, the Cluster Model contains additional transient states for optimizing the algorithm. For example, it keeps tracking the partitions in each fault zone. So when the rebalance algorithm searches for potential fault zone conflict, it does not need to iterate all the instances for the partition lists. In addition, even the cluster information in the Cluster Model might be altered according to the rebalancer's requirement. For instance, the delayed rebalance logic will impact the instances' state in the Cluster Model.

Input

Cluster Data Cache.

Output

A Cluster Model which contains the following objects.

Assignable Node

Each active instance will be recorded as an assignable node. An assignable node contains information such as the instance Domain and capacity.

Assignable Replica

Each replication of the partition is considered as an assignable object. The replica object contains the weight of the partition. We assume all the replicas in one partition have the same weight. The weight fields should fit the capacity fields of the Assignable Node. Note that unlike the existing rebalancer, the WAGED rebalancer generates the partition assignment with the state assigned.

Cluster Context

Besides node and replica information, we need to record some global information in addition. For example, per-fault zone partition lists. All these global states go to the Cluster Context.

Assignment history

This includes the Baseline assignment and the previous Best Possible assignment. Note that unlike the other records, these two states are not available from the Cluster Data Cache. The rebalancer uses Assignment Metadata Datastore to access the assignments.

Assignment Metadata Datastore

Conceptually, the Assignment Metadata Datastore is a write-through cache that persists the Baseline assignment and the Best Possible assignment. Given a large cluster, the persisted data size might be large. We need to evaluate the performance impact and optimize the datastore.

Rebalance Algorithm Adapter

The Rebalance Coordinator calls the rebalance algorithm through a generic interface. Our goal is decoupling the rebalancer from the algorithm details. Based on the interface, the adapter helps the rebalancer to use different algorithms in different scenarios.

Input of the rebalance interface

The Cluster Model.

Output of the rebalance interface

The partition assignment.

Constraint-base Rebalance Algorithm

The constraint-based rebalance algorithm is a greedy algorithm. The basic idea is searching all the possible assignment for a good enough one by using a set of constraints.

What is Constraint?

Hard Constraint - Evaluate a partition allocation and return YES or NO. The hard constraints are used as filters. Any proposal fails one or more hard constraints will be rejected.

Soft Constraint - Evaluate a partition allocation and return a score within the normalized range.

Constraint Importance Factor (CIF) - The rebalance algorithm aggregates the soft constraint results according to their CIF. CIF is basically the weight of constraint. We rename it to be CIF so as to avoid confusion between the constraint weight and the partition weight. Note that CIF is not applicable to the hard constraints, because their results are either 0 or 1.