Virtual Topology Group for Cloud Environment

Abstract

As increasing number of companies are moving toward cloud like AWS and Azure, the storage and media units should be able to adapt to the requirements imposed by those cloud providers. One of the main components that we are going to use in Azure is the concept of Virtual Machine Scale Set (VMSS). VMSS allows you to create and manage a group of load balanced VMs. These VMs are spread across fault domains, which we are about to discuss in this document. We introduce the concept of virtual topology group, or virtual fault zone, as a virtualization layer on top of the physical fault domain. REST APIs are provided to users to set up or adjust the number of virtual zones. Application rolling upgrade will respect the virtual zone topology. While this virtualization layer does not give us more fault tolerance, this allows us for more control over the fault zone in cloud environments where direct change or override may not be supported.

This design and solution is not restricted to Azure, but ideally universal to all cloud environments. Below we use Azure specific terms but note most are similar elsewhere.

Background & Motivation

Cloud environments (like Azure) have different assumptions and limitations compared to on-premises. Helix will have to adapt to these changes and continue to support applications on Azure and other cloud platforms.

Virtual Machine Scale Set (VMSS)

One of the limitations comes with VMSS. In VMSS, there are two concepts in regard to the fault domain. Soft Fault Domain (SFD) and hard fault domain (HFD).

Soft Fault Domain (SFD)

Each VMSS contains a maximum of 20 SFD. These SFD will be leveraged in a way that each of the VMs inside VMSS will be chosen from 20 racks/hosts. However, Helix and applications do not know about these SFDs. The SFD will be leveraged and is being respected for VM freeze and live migration.

Hard Fault Domain (HFD)

HFD concept is the same concept of zone or domain, and it is visible to users and applications. SFD would not be respected for application deployment (rolling upgrade). In contrast, HFD would be used for application deployment and rolling mode. In Azure, each VMSS has 5 HFDs by default.

Some facts about SFD and HFD: Each VMSS does provide 20 FDs but they are “soft” FDs. It means VMs select from 20 racks/hosts but users may not know about it. Explicitly, we still only can see 5 “hard” FDs. VM freeze respects 20 “soft” FDs that take each “soft” FDs for VM freeze/live migration. It will guarantee no more than 5% of VMs be frozen at the same time. SFD are transparent to application in case of app rolling upgrade or deployment, so it will be limited by 5 HFDs. For the rest of this document, we use FD instead of HFD for simplicity. Since SFD is completely transparent to us this won't be covered in this design.

Limited Fault Domain

Helix currently assigns one resource partition replica per fault zone since zoneware assignment is trying to guarantee the availability of system through independent hardwares (different zones). If HFD is set to 5, applications can only have up to 5 replicas. Otherwise the rule will be violated according to the Pigeonhole principle, and Helix will not generate any assignment since zone awareness is a hard constraint as of now. We can always change the logic to support this case. But even if we do allow, replicas may have uneven distribution across fault zones and cause drop of availability during application rolling upgrade. Please also see the section Alternative approach.

Request Scattering

From the user's perspective, one big challenge is to make sure the fanout size is constrained when scaling the storage cluster horizontally. If we choose to blindly increase the total number of storage nodes without any special partition assignment, the fanout size will increase accordingly, which will decrease the cluster capacity significantly, which makes the cluster not horizontally scalable. There is a tight relationship between the fanout size and cluster capacity, when fanout size increases, the chance to hit long-tail latency will increase, which would eventually lead to more retries and more resource usage per request, the capacity won’t increase along with the cluster expansion.

To increase the capacity of the storage system, we could keep increasing the replication factor and the fault zones to scale the capacity linearly. In these large fanout clusters, we enforce that the replication factor will be the same as the total number of fault domains/zones, so when Helix assigns different replicas belonging to the same partition in different fault zones, it is guaranteed that each zone will take only one replica. Eventually each fault zone will keep a full replication for a given database, which will bound the size of fanout when serving a large batch-get request.

It is of user interest to continue this pattern to scale cluster horizontally even if VMSS HFD can’t grow linearly as we do for on-premise cases and align the capability between on-prem and Azure.

Problem Statement

With the upper bound and inflexibility of VMSS HFD and the invisibility of SFD to application, we need a way to override or change Azure topology configuration.

We need a solution for Helix to continue supporting increasing replication factor without environment-specific physical or hardware limitations as explained above.

Proposal

We propose a virtualization layer on top of physical fault zone so that we have control over the number of virtual FD no matter how the physical FDs are set up. Application team may need to adjust it from time to time for scaling purposes.

This is not to increase fault tolerance, which is still bound by physical FDs.

We’ll need a REST API for control, and it’s a manual process whenever:

First time call for a cluster
Add more nodes to the virtual zones
Add more virtual zones
Cluster shrink

Helix will assign one replica per virtual group. Application rolling upgrade will respect the virtual group.

Alternative approach

Theoretically, we can allow rebalancer to assign more than one replica per physical fault zone to ensure request scattering as an alternative. This is an orthogonal approach to overcome the HFD limitation on the rebalancer side. In this way, multiple replicas may live on one fault zone, while the rebalancer will compute the ideal assignment for each resource partition. How Helix rebalancer should handle this case requires further evaluation on the impact. But even if we have this, application rolling upgrades can only respect the HFD (because SFD is invisible to the app), which means more than one replica may go down at a time during deployment.

We conclude this approach is not applicable due to the fact that SFD is invisible to application, and the upper bound limit of HFD. Helix itself as of now doesn’t block us from taking this approach.

Scope

The solution is designed to unblock the immediate issue for onboarding Azure, but it should be universal to all "cloud environments". It does not indicate a specific cloud platform or even public cloud. As long as the cluster nodes can provide the necessary information, it is supporting the necessary cloud feature from Helix perspective. This feature is NOT designed for and NOT supported in on-prem use cases, where fault zone assignment is manually done and rack and host information is not automatically populated. The replication factor and fault zones can keep in sync because SRE and sysops have control over the fault zone setup.

Architecture/Implementation

Related Helix Configs

Instance config

The topology information of each instance exists inside instance config. As shown in the following code, Helix controller realizes to which fault domain each instance belongs by looking at the “DOMAIN” field of instance config. In our example, this instance belongs to zone_E.

{
    "id": "app1",
    "listFields": {},
    "mapFields": {},
    "simpleFields": {
        "DOMAIN": "zone=zone_E,instance=Participant_E_160",
        "HELIX_ENABLED": "true",
        "HELIX_HOST": "app1",
        "HELIX_PORT": "1234"
    }
}

Cluster config

The simple fields in cluster config have two fields that are relevant to partition and replica placement. One of them is “TOPOLOGY” which shows the instance config domain is sorted based on what order. For example, in the code below, we see the instance config domain is sorted based on /zone/instance. The other important field is the "FAULT_ZONE_TYPE" field which tells rebalancers to use the "zone" information provided in the instance config to place the replicas of the partitions. So in the following example, each "zone" can have a maximum of one replica per partition.

"simpleFields": {
        ...
        "FAULT_ZONE_TYPE": "zone",
        "MAX_OFFLINE_INSTANCES_ALLOWED": "3",
        "NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT": "2",
        "PERSIST_BEST_POSSIBLE_ASSIGNMENT": "TRUE",
        "TOPOLOGY": "/zone/instance",
        "TOPOLOGY_AWARE_ENABLED": "TRUE",
    }

CloudConfig

CloudConfig is associated with a cluster, it has information about whether the cluster is inside a cloud environment and the cloud provider etc. We’ll need this CloudConfig to validate the current environment.

{
  "id" : "CloudConfig",
  "simpleFields" : {
    "CLOUD_ENABLED" : "true",
    "CLOUD_PROVIDER" : "AZURE",
    "CLOUD_ID" : "..."
  },
  "mapFields" : { },
  "listFields" : { }
}

System diagram

A system diagram as below and we will talk about some key steps.

Validation

We check against CloudConfig and only proceed if we are on a cloud environment. The number of virtual groups to assign should not exceed the number of instances in the cluster. Cluster topology-aware rebalance should be enabled.

Maintenance mode

We want to make sure the cluster is in maintenance mode while doing the critical configuration change. Controlled by a boolean param, we provides two modes for this:

Auto maintenance mode – Cluster enters and exit maintenance mode itself within the service
User Control – We let users control the maintenance mode either manually or automatically, while the service doesn’t modify it.

Either case, we ensure cluster maintenance mode is on before making any config changes to avoid unexpected intervention. If mode 1 is used and the cluster is already in maintenance mode, an exception will be thrown.

Configs update

Once we have the virtual zone mapping assignment, effectively, we want to change TOPOLOGY and FAULT_ZONE_TYPE of cluster config, and DOMAIN of instance config. There are a few options with different tradeoffs. Below takes one instance being assigned virtualZone=vzone_1 as an example.

API

Once this API is being called, the virtual zones info will be updated in cluster config as discussed above. Then calculation will be done to decide the virtual grouping assignment and update all instance configs accordingly.

Java API for using virtual grouping

public void addVirtualTopologyGroup(String clusterName, Map<String, String> customFields)

REST API for using virtual grouping

curl -X POST -H "Content-Type: application/json" http://localhost:12954/admin/v2/clusters/myCluster?command=addVirtualTopologyGroup -d '{"virtualTopologyGroupName": "virtual_zone", "virtualTopologyGroupNumber": 20, “autoMaintenanceModeDisabled” : true}'

params

virtualTopologyGroupName

Required, the name of the virtual topology group. Each virtual group will be named as {virtualTopologyGroupName}_0, {virtualTopologyGroupName}_1 etc.

virtualTopologyGroupNumber

Required, the number of virtual topology groups.

autoMaintenanceModeDisabled

Optional, default false. Whether let cluster automatically enter and exit maintenance mode. If set true, the cluster will NOT change maintenance mode status during the API call, that means it's the caller responsibility to have cluster enter and exit maintenance mode. Either case, the cluster must be in maintenance mode prior to critical config change during the API invocation, otherwise the API will fail.

Virtual topology group mapping

In this step, we take inputs of physical zone mapping and number of virtual groups, then compute the assignment of virtual groups for each instance. The implementation details and analysis will be discussed in the next section. We made a few decisions for the mapping framework and algorithm:

The mapping algorithm is stateless, any stateful implementation is outside the scope of this RFC
The mapping only supports the case where VFD > HFD in this RFC. VMSS allows us to change HFD dynamically within the upper bound. For smaller VFD cases, virtual grouping is not needed at all since we can utilize the physical fault zones directly.

Algorithms

The goal of the algorithm is to re-assign instance mapping from #HFD groups to #VFD groups while maintaining reasonable fault tolerance and reducing rolling upgrade impact. There are two scenarios, one is during application rolling upgrade and deployment. Because applications respect the replica and virtual zone mapping, one virtual zone goes down at a time – we don’t have to worry about it. The other is regarding Azure VM freeze or maintenance, one HFD goes down or in case of rack/switch failure, we want to make sure the algorithm can isolate and minimize the impact conveyed to the virtual zone layer.

Metrics

If we map 5 HFDs to 7 VFDs, it’s inevitable that there exists at least one HFD that maps to multiple VFDs. That means in the worst case scenario, one HFD goes down, more than 1 VFD and replica could be down. There is no way we can prevent this, as “hard” fault tolerance is bounded by the underlying physical fault zones. Hereby we define a few metrics to evaluate the mapping algorithms.

Evenness

This measures the property of each virtual group. It’s common sense that each virtual group should be as same as possible in terms of number of instances and underlying physical zones.

In above example, the max instances per group is 2, min is 1, we have rangeGroupInstances == 1 Also notice virtual group V2 covers both A and B, while all other groups have only one, we have rangeCoveredZones == 1 , and maxCoveredZones == 2

Fault Tolerance

We want to measure how much impact to the virtual zone if one or more physical zones go down. In the above example, any physical zone goes down, two virtual zones are impacted and it’s quite symmetric. To formalize, we want to minimize two criteria:

The number of virtual zones one physical zone maps to, measured by maxImpactVirtualZones
The range of impacted virtual zones for a physical zone (max - min), measured by rangeImpactVirtualZones Another view of fault tolerance is the number of fault zones failure the system can keep running with at least one replica. It’s bounded by #HFD, if it’s 5, the failure number the system can keep running with is between 0 to 4, depending on #HFD and #VFD and grouping. Heuristically, it can be optimized by setting #VFD == k * #HFD and have each VFD only covers one HFD.

Deterministic

We want the algorithm to be deterministic, with the same input always getting the same assignment.

Stability

We also measure the impact of virtual group change when we add or remove hosts from HFD. We compare two assignments with the same virtual group number, how many hosts are assigned different virtual zone id.

Implementations

FIFO (optimal)

An algorithm to iterate sorted instances list and assign one virtual group at a time, only start to assign virtual group_N+1 until virtual group_N is filled to its capacity. Given that instances.size = instancesPerGroup * numGroups + residuals, we break residuals into the first few groups, as a result each virtual group will have either instancesPerGroup or instancesPerGroup + 1 instances.

Round robin

We sort instances by name under each HFD and concatenate them in an array, then assign virtual groups based on mod on instance index. This round-robin assignment will divide each HFD evenly across all VFD and this will have bad fault tolerance.

Combined

This combines the above two, the difference is on the assignment of residuals. This scheme first follows FIFO to assign [instancesPerGroup] for each virtual group, but uses round-robin strategy for the rest [residuals].

Analysis

We conducted benchmark and the result shows FIFO scheme features the best overall fault tolerance and stability.

As a side note, we don’t want to over complicate the algorithm for a few reasons. First, the impact of many factors are hard to measure, e.g. uneven HFD allocation, tradeoffs between virtual group balance and minimize physical zone etc. There could hardly be a “best” algorithm. Second, the ROI of more advanced algorithms or frameworks is unclear as different rebalancers act differently to topology change, optimizing virtual group assignment alone won’t necessarily reduce partition shuffling.

Potential Impact

Fault tolerance

Helix will assign one replica per virtual group, application rolling upgrade will respect the virtual topology. As a result, only one replica goes down at a time.

However, the “hard” fault tolerance is still bound by HFD. Helix DOES NOT provide more fault tolerance with virtual topology group. For the case of Azure VM freeze or maintenance, one HFD may be partially down and more than one replicas on the HFD may be affected.

If there are 5 HFDs and 10 virtual zones, and one resource has 2 replicas. It’s guaranteed 2 replicas live on different virtual zones, but not necessarily on different HFD. It’s recommended that we assign the same value of virtual zones and replication factor for all resources on the cluster.

Rebalancer

Now we discuss the potential impact of using the above API in Azure environment. We have done some experiment from Helix side to help identify whether there is any risk.

First time to call the API When the API is used in a cluster for the first time, reshuffle will happen in the cluster, which is not a big concern as it is a one time effort.
Add more nodes to existing virtual zones The new nodes will have their Azure fault zone auto populated. Then we need to call the above API for virtual zone assignment. Note that this call will also cause reshuffle in the cluster. The degree of the movement cannot be guaranteed, however, Helix can guarantee that the minimum replica is satisfied all the time.
Add more virtual zones If more virtual zones are added, we also need to call the above API to reassign nodes to different virtual zones. This call will also cause reshuffle in the cluster. The degree of the movement cannot be guaranteed, however, Helix can guarantee that the minimum replica is satisfied all the time.
Cluster shrink When nodes leave the cluster due to different reasons, the above API is also needed to be called for reassignment of nodes. It will cause reshuffle in the cluster across zones. Still, Helix can guarantee that the minimum replica is satisfied all the time.

Overview

Apache Helix Official Page

Resources for Contributors

Pull Request Checklist

Pull Request Testing

Pull Request Description Template

Pull Request Merge Steps

Helix Code Style in IntelliJ

Apache Helix 2.0

Apache Helix 2.0 Overview

Project Tracking

Design Documents