Feature Request: Support Multiple Strategies for ClusterSet Failover #449

kahirokunn · 2025-01-23T06:21:40Z

Background

There is a request to support multiple strategies regarding the timing and method of removing (uninstalling) ClusterProfiles from unhealthy clusters during ClusterSet failover operations.
Specifically, three strategies are being considered:

CreateBeforeDestroy
Install the ClusterProfile on the new cluster after completing the uninstallation from the unhealthy cluster.
DestroyBeforeCreate
Uninstall the ClusterProfile from the unhealthy cluster after completing the installation on the new cluster.
CreateOnly
After installing the ClusterProfile on the new cluster, leave it as-is on the unhealthy cluster without uninstalling.

Currently, these strategies are not explicitly defined in the documentation or specifications, or there appears to be no control method provided.

Request Details

We would like to support multiple failover strategies, similar to StatefulSet and Deployment, allowing users to select them as needed.

The main request is to support these three control methods:
1. CreateBeforeDestroy
2. DestroyBeforeCreate
3. CreateOnly
Ideally, being able to apply different strategies ("CreateBeforeDestroy", "DestroyBeforeCreate", "CreateOnly") per ClusterProfile (for example, using labelSelector) for unhealthy clusters would allow flexible responses to various use cases.

Use Cases

Want to perform cluster-level Blue/Green deployments.
Cases where workloads need to be deliberately left on unhealthy clusters (e.g., for logging or backup purposes) while migrating to new clusters.
Some operations consider "complete failover" only after uninstallation is complete, while others prefer to gradually remove old clusters after getting new clusters operational.
In development environments, want to set up HealthChecks like the following to reduce operational costs by stopping certain resources outside working hours. (This can be implemented through other mechanisms as well.)

---
apiVersion: lib.projectsveltos.io/v1beta1
kind: HealthCheck
metadata:
  name: weekday-working-hours-healthcheck
spec:
  resourceSelectors:
  - group: ""
    version: v1
    kind: ConfigMap
  evaluateHealth: |
    function evaluate()
      statuses = {}

      -- Default status and message
      local status = "Degraded"
      local message = "Outside of healthy working hours"

      -- Get the current day and hour
      local currentTime = os.date("*t")
      local currentHour = currentTime.hour
      local currentDay = currentTime.wday -- Lua's weekdays start from 1 (Sunday)

      -- Set status to Healthy for Monday (2) to Friday (6) between 6 AM and 9 PM
      if currentDay >= 2 and currentDay <= 6 and currentHour >= 6 and currentHour < 21 then
        status = "Healthy"
        message = "Within healthy working hours"
      end

      -- Assign the status to each resource
      for _, resource in ipairs(resources) do
        table.insert(statuses, {resource = resource, status = status, message = message})
      end

      -- Construct the overall health status
      local hs = {}
      if #statuses > 0 then
        hs.resources = statuses
      end

      return hs
    end

Expected Benefits

Operations can be flexibly customized per ClusterProfile according to user and organizational policies.
Leads to minimized downtime and reduced operational costs.
Easier to meet various operational requirements such as Business Continuity Planning (BCP) and audit requirements.
Documentation makes it easier to compare different strategies and share best practices.

Proposal

Add parameters to ClusterProfile/ClusterSet that allow selection of multiple strategies during failover (e.g., "failoverStrategy" with options like "CreateBeforeDestroy", "DestroyBeforeCreate", "CreateOnly").
Enable control at the CR (spec) level, allowing strategies to be switched per ClusterProfile.
Clearly document the benefits, considerations, and expected use cases for each strategy in the official documentation to help users select appropriate strategies.

Conclusion

This is a request to enable selection from multiple strategies for ClusterProfile uninstallation procedures and timing during cluster failover.
In particular, we hope for a design that can accommodate various use cases, such as strategies that allow migration to new environments without uninstallation.
We appreciate your consideration of this request.

gianlucam76 · 2025-01-24T13:47:47Z

Thank you @kahirokunn

Good idea. I see one problem though. If Sveltos cannot access the unhealthy cluster anymore, then it cannot remove add-ons from there. Which can, with this approach, block Sveltos from moving forward.

How do you suggest we cover this scenario in this proposal?

Thank you!

kahirokunn · 2025-01-27T07:53:08Z

For reference, may I ask first about the current Failover behavior? (Actually, I don't think I can find it described anywhere. Thx 🙏

gianlucam76 · 2025-01-27T08:47:51Z

It is here: https://projectsveltos.github.io/sveltos/features/set/

kahirokunn · 2025-01-28T00:07:15Z

@gianlucam76 Thank you!
I've actually already read the contents, but I don't think there's any mention of whether the ClusterProfile that's already installed and unhealthy in the cluster will be left as it is or uninstalled. 👀

gianlucam76 · 2025-01-28T07:32:06Z

@kahirokunn when you create a ClusterSet it selects matching clusters. Usually you create a ClusterSet to match two clusters only. And in the ClusterSet you specify how many clusters must be active. Usually you say only one cluster.

So ClusterSet selects cluster A and B for instance. And ClusterSet out of those two clusters, decides that Cluster A is active and Cluster B is backup.

As long as Cluster A is healthy (apiserver is reachable) all add-ons and applications are deployed there.
When Cluster A becomes unhealthy (apiserver is not reachable anymore), ClusterSet promotes Cluster B as active.

At this point all add-ons and applications are deployed there.

Nothing happens to Cluster A as apiserver is not reachable. So Sveltos does not even try to remove add-ons and applications.

kahirokunn · 2025-01-28T07:37:09Z

@gianlucam76
Thank you for your reply!
It's very helpful!

I have one more question. Could you please explain in more detail about the Unhealthy status mentioned in the following document?
Does Unhealthy only apply when the server cannot be reached? Or does it also take into account the status of the HealthCheck custom resource?

Thx again 🙏

gianlucam76 · 2025-01-28T07:43:07Z

Thanks @kahirokunn

Sveltoscluster controller periodically (every minute) queries apiserver of every registered cluster. When the query fails for a number (configurable) of consecutive times, Sveltos says this cluster is unhealthy because apiserver is down.

And while no other resources in the managed cluster are looked at, extending this to check also resources in the cluster (assuming of course apiserver is reachable) would be very easy to achieve

kahirokunn · 2025-01-28T09:07:13Z

Thank you!
After talking with the team members, I think there are some areas where we can improve on the content I suggested in advance, so I would like to add the following suggestions, although it may be a bit of a bother. 🙏

Feature Request: Implementation of Flexible Failover Control Using Prometheus Metrics

Overview

Currently, there is a risk that all ClusterProfiles may automatically failover even when there are only temporary intermittent communications issues between the management cluster (mgmt cluster) and the managed cluster's apiserver. In particular, network failures or apiserver downtime don't necessarily mean a loss of application-level availability. Considering such cases, we propose incorporating metrics-based mechanisms like Prometheus rather than relying solely on apiserver accessibility for failover triggers, enabling more flexible and appropriate failover control.

Background and Issues

Risk of automatic failover occurring just because the mgmt cluster temporarily loses connection to the managed cluster's apiserver.
- Applications running on the managed cluster may continue functioning normally even during temporary network failures or authentication infrastructure issues.
- In such situations, the current failover mechanism's behavior of "immediately installing all ClusterProfiles to a new cluster" may actually increase risk.
Potential inadequacy of Healthcheck resources alone to ensure safety.
- Apiserver or authentication infrastructure failures may make it difficult to retrieve Healthcheck resources.
- This raises concerns about incorrect failovers occurring.

Additional Proposal: Prometheus Metrics-Based Failover Decision Making

Based on these issues, we propose adding a metrics-based mechanism using Prometheus as follows:

Metrics-Based Failover Triggers
- Comprehensively evaluate cluster health using various metrics available from Prometheus, such as network latency, Pod count, and error rates.
- Make failover decisions based on trends in application-level required metrics rather than simply checking apiserver accessibility.
Custom Resource Utilization Referenced from Other OSS
- Flagger (Canary CRD) https://github.com/fluxcd/flagger?tab=readme-ov-file#canary-crd
- Keptn (Metrics evaluation for HPA settings) https://keptn.sh/stable/docs/use-cases/hpa/
- Keda (Metrics-triggered scaling) https://keda.sh/docs/2.15/reference/scaledobject-spec/
These OSS implementations use metrics and custom resources to achieve Canary releases and scaling control, and their implementation and architecture can be referenced for failover control.
Native Support in sveltos
- Implementing Prometheus Query in Go is relatively straightforward.
- By incorporating metrics-based decision logic as sveltos custom resources, referencing the design of the above OSS, more appropriate failover decisions from an application perspective become possible.
  - Personally, I think Keptn's Metrics custom resource implementation could serve as a good reference. It defines multiple metrics with labels, and if all metrics matching certain labels are unhealthy, it triggers failover for the corresponding ClusterProfile or ClusterSet.
```
 apiVersion: metrics.keptn.sh/v1
 kind: KeptnMetric
 metadata:
    name: good-metric
 spec:
 provider:
    name: my-provider
 query: "sum(kube_pod_container_resource_limits{resource='cpu'})"
 fetchIntervalSeconds: 10
 range:
    interval: "3m"
```

Implementation and Operational Benefits

Reduction in incorrect failovers, minimizing service interruptions
More accurate assessment of application health through metrics-based failover
Flexible and extensible operations through standardized custom resources following other OSS examples

Summary

Implementing failover based solely on apiserver accessibility could lead to unnecessary failovers during temporary network failures or authentication infrastructure issues, potentially creating significant risks. Therefore, we propose enabling failover decisions based on application operational status using Prometheus metrics. We believe more flexible and safer failover control can be achieved by incorporating native implementation into sveltos while referencing metrics utilization mechanisms from existing OSS like Flagger, Keptn, and Keda.

gianlucam76 · 2025-01-28T10:21:39Z

Thank you @kahirokunn

Let me see if I got it. Metrics are pulled from managed cluster to management cluster or collected and exposed via reachable IP (which is outside the scope of Sveltos and this request).

Sveltos is also instructed to get and check those metrics (scope of this request). And decide whether a failover is needed using also those

Is that correct understanding of this proposal?

kahirokunn · 2025-01-30T04:04:32Z

Yes, that's right.

kahirokunn · 2025-01-30T08:33:34Z

Thank you for sharing this! Through our discussion, I was able to better organize my thoughts. Upon further review, I've discovered some interesting points about metrics implementation that I'd like to share.

It appears that representing metrics as custom resources is already possible, and there's an existing operator for this purpose in the Keptn Lifecycle Toolkit:

https://github.com/keptn/lifecycle-toolkit/tree/main/metrics-operator

Building on this, I believe we could enhance the failover functionality in ClusterSet/Set by leveraging the status of KeptnMetric custom resources as triggers. This approach could be particularly powerful because:

It would effectively utilize CNCF capabilities
It could significantly reduce implementation costs

I've created a proof of concept showing how ClusterSet could perform failover based on metrics. Here's the example:

apiVersion: lib.projectsveltos.io/v1beta1
kind: ClusterSet
metadata:
  name: prod
spec:
  clusterSelector:
    matchLabels:
      env: prod
  maxReplicas: 1
  templateResourceRefs:
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-http-success-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: httpSuccessRate
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-kubelet-error-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: kubeletErrorRate
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-kubeApiserver-error-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: kubeApiserverErrorRate
  validateHealths:
    - script: |
        function evaluate()
          local hs = {
              healthy = true,
              message = "cluster is healthy"
          }

          local httpSuccessRate       = tonumber(ctx.httpSuccessRate.status.value) or 0
          local kubeletErrorRate      = tonumber(ctx.kubeletErrorRate.status.value) or 0
          local kubeApiserverErrorRate = tonumber(ctx.kubeApiserverErrorRate.status.value) or 0

          if httpSuccessRate <= 0.1 and (kubeletErrorRate + kubeApiserverErrorRate) <= 1.0 then
              hs.healthy = false
              hs.message = "too many errors"
          end

          return hs
        end

Thank you 🙏

gianlucam76 · 2025-01-30T09:08:28Z

Thank you. This is cool and very inline with rest of Sveltos.

I think it makes sense to deliver this. Let's file an enhancement request for addon controller

kahirokunn · 2025-01-30T10:32:40Z

Thank you!
I've submitted it.
projectsveltos/addon-controller#981

gianlucam76 mentioned this issue Feb 4, 2025

Feature Request: Control failover based on Degraded of HealthCheck conditions in ClusterSet / Set #459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support Multiple Strategies for ClusterSet Failover #449

Feature Request: Support Multiple Strategies for ClusterSet Failover #449

kahirokunn commented Jan 23, 2025

gianlucam76 commented Jan 24, 2025

kahirokunn commented Jan 27, 2025

gianlucam76 commented Jan 27, 2025

kahirokunn commented Jan 28, 2025 •

edited

Loading

gianlucam76 commented Jan 28, 2025

kahirokunn commented Jan 28, 2025

gianlucam76 commented Jan 28, 2025 •

edited

Loading

kahirokunn commented Jan 28, 2025

gianlucam76 commented Jan 28, 2025 •

edited

Loading

kahirokunn commented Jan 30, 2025

kahirokunn commented Jan 30, 2025

gianlucam76 commented Jan 30, 2025

kahirokunn commented Jan 30, 2025

Feature Request: Support Multiple Strategies for ClusterSet Failover #449

Feature Request: Support Multiple Strategies for ClusterSet Failover #449

Comments

kahirokunn commented Jan 23, 2025

Background

Request Details

Use Cases

Expected Benefits

Proposal

Conclusion

gianlucam76 commented Jan 24, 2025

kahirokunn commented Jan 27, 2025

gianlucam76 commented Jan 27, 2025

kahirokunn commented Jan 28, 2025 • edited Loading

gianlucam76 commented Jan 28, 2025

kahirokunn commented Jan 28, 2025

gianlucam76 commented Jan 28, 2025 • edited Loading

kahirokunn commented Jan 28, 2025

Feature Request: Implementation of Flexible Failover Control Using Prometheus Metrics

Overview

Background and Issues

Additional Proposal: Prometheus Metrics-Based Failover Decision Making

Implementation and Operational Benefits

Summary

gianlucam76 commented Jan 28, 2025 • edited Loading

kahirokunn commented Jan 30, 2025

kahirokunn commented Jan 30, 2025

gianlucam76 commented Jan 30, 2025

kahirokunn commented Jan 30, 2025

kahirokunn commented Jan 28, 2025 •

edited

Loading

gianlucam76 commented Jan 28, 2025 •

edited

Loading

gianlucam76 commented Jan 28, 2025 •

edited

Loading