Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support Multiple Strategies for ClusterSet Failover #449

Open
kahirokunn opened this issue Jan 23, 2025 · 13 comments
Open

Comments

@kahirokunn
Copy link
Contributor

Background

There is a request to support multiple strategies regarding the timing and method of removing (uninstalling) ClusterProfiles from unhealthy clusters during ClusterSet failover operations.
Specifically, three strategies are being considered:

  1. CreateBeforeDestroy
    Install the ClusterProfile on the new cluster after completing the uninstallation from the unhealthy cluster.

  2. DestroyBeforeCreate
    Uninstall the ClusterProfile from the unhealthy cluster after completing the installation on the new cluster.

  3. CreateOnly
    After installing the ClusterProfile on the new cluster, leave it as-is on the unhealthy cluster without uninstalling.

Currently, these strategies are not explicitly defined in the documentation or specifications, or there appears to be no control method provided.

Request Details

We would like to support multiple failover strategies, similar to StatefulSet and Deployment, allowing users to select them as needed.

  • The main request is to support these three control methods:

    1. CreateBeforeDestroy
    2. DestroyBeforeCreate
    3. CreateOnly
  • Ideally, being able to apply different strategies ("CreateBeforeDestroy", "DestroyBeforeCreate", "CreateOnly") per ClusterProfile (for example, using labelSelector) for unhealthy clusters would allow flexible responses to various use cases.

Use Cases

  • Want to perform cluster-level Blue/Green deployments.
  • Cases where workloads need to be deliberately left on unhealthy clusters (e.g., for logging or backup purposes) while migrating to new clusters.
  • Some operations consider "complete failover" only after uninstallation is complete, while others prefer to gradually remove old clusters after getting new clusters operational.
  • In development environments, want to set up HealthChecks like the following to reduce operational costs by stopping certain resources outside working hours. (This can be implemented through other mechanisms as well.)
---
apiVersion: lib.projectsveltos.io/v1beta1
kind: HealthCheck
metadata:
  name: weekday-working-hours-healthcheck
spec:
  resourceSelectors:
  - group: ""
    version: v1
    kind: ConfigMap
  evaluateHealth: |
    function evaluate()
      statuses = {}

      -- Default status and message
      local status = "Degraded"
      local message = "Outside of healthy working hours"

      -- Get the current day and hour
      local currentTime = os.date("*t")
      local currentHour = currentTime.hour
      local currentDay = currentTime.wday -- Lua's weekdays start from 1 (Sunday)

      -- Set status to Healthy for Monday (2) to Friday (6) between 6 AM and 9 PM
      if currentDay >= 2 and currentDay <= 6 and currentHour >= 6 and currentHour < 21 then
        status = "Healthy"
        message = "Within healthy working hours"
      end

      -- Assign the status to each resource
      for _, resource in ipairs(resources) do
        table.insert(statuses, {resource = resource, status = status, message = message})
      end

      -- Construct the overall health status
      local hs = {}
      if #statuses > 0 then
        hs.resources = statuses
      end

      return hs
    end

Expected Benefits

  • Operations can be flexibly customized per ClusterProfile according to user and organizational policies.
  • Leads to minimized downtime and reduced operational costs.
  • Easier to meet various operational requirements such as Business Continuity Planning (BCP) and audit requirements.
  • Documentation makes it easier to compare different strategies and share best practices.

Proposal

  • Add parameters to ClusterProfile/ClusterSet that allow selection of multiple strategies during failover (e.g., "failoverStrategy" with options like "CreateBeforeDestroy", "DestroyBeforeCreate", "CreateOnly").
  • Enable control at the CR (spec) level, allowing strategies to be switched per ClusterProfile.
  • Clearly document the benefits, considerations, and expected use cases for each strategy in the official documentation to help users select appropriate strategies.

Conclusion

This is a request to enable selection from multiple strategies for ClusterProfile uninstallation procedures and timing during cluster failover.
In particular, we hope for a design that can accommodate various use cases, such as strategies that allow migration to new environments without uninstallation.
We appreciate your consideration of this request.

@gianlucam76
Copy link
Member

Thank you @kahirokunn

Good idea. I see one problem though. If Sveltos cannot access the unhealthy cluster anymore, then it cannot remove add-ons from there. Which can, with this approach, block Sveltos from moving forward.

How do you suggest we cover this scenario in this proposal?

Thank you!

@kahirokunn
Copy link
Contributor Author

For reference, may I ask first about the current Failover behavior? (Actually, I don't think I can find it described anywhere. Thx 🙏

@gianlucam76
Copy link
Member

@kahirokunn
Copy link
Contributor Author

kahirokunn commented Jan 28, 2025

@gianlucam76 Thank you!
I've actually already read the contents, but I don't think there's any mention of whether the ClusterProfile that's already installed and unhealthy in the cluster will be left as it is or uninstalled. 👀

@gianlucam76
Copy link
Member

@kahirokunn when you create a ClusterSet it selects matching clusters. Usually you create a ClusterSet to match two clusters only. And in the ClusterSet you specify how many clusters must be active. Usually you say only one cluster.

So ClusterSet selects cluster A and B for instance. And ClusterSet out of those two clusters, decides that Cluster A is active and Cluster B is backup.

As long as Cluster A is healthy (apiserver is reachable) all add-ons and applications are deployed there.
When Cluster A becomes unhealthy (apiserver is not reachable anymore), ClusterSet promotes Cluster B as active.

At this point all add-ons and applications are deployed there.

Nothing happens to Cluster A as apiserver is not reachable. So Sveltos does not even try to remove add-ons and applications.

@kahirokunn
Copy link
Contributor Author

@gianlucam76
Thank you for your reply!
It's very helpful!

I have one more question. Could you please explain in more detail about the Unhealthy status mentioned in the following document?
Does Unhealthy only apply when the server cannot be reached? Or does it also take into account the status of the HealthCheck custom resource?

Thx again 🙏

@gianlucam76
Copy link
Member

gianlucam76 commented Jan 28, 2025

Thanks @kahirokunn

Sveltoscluster controller periodically (every minute) queries apiserver of every registered cluster. When the query fails for a number (configurable) of consecutive times, Sveltos says this cluster is unhealthy because apiserver is down.

And while no other resources in the managed cluster are looked at, extending this to check also resources in the cluster (assuming of course apiserver is reachable) would be very easy to achieve

@kahirokunn
Copy link
Contributor Author

Thank you!
After talking with the team members, I think there are some areas where we can improve on the content I suggested in advance, so I would like to add the following suggestions, although it may be a bit of a bother. 🙏

Feature Request: Implementation of Flexible Failover Control Using Prometheus Metrics

Overview

Currently, there is a risk that all ClusterProfiles may automatically failover even when there are only temporary intermittent communications issues between the management cluster (mgmt cluster) and the managed cluster's apiserver. In particular, network failures or apiserver downtime don't necessarily mean a loss of application-level availability. Considering such cases, we propose incorporating metrics-based mechanisms like Prometheus rather than relying solely on apiserver accessibility for failover triggers, enabling more flexible and appropriate failover control.


Background and Issues

  1. Risk of automatic failover occurring just because the mgmt cluster temporarily loses connection to the managed cluster's apiserver.

    • Applications running on the managed cluster may continue functioning normally even during temporary network failures or authentication infrastructure issues.
    • In such situations, the current failover mechanism's behavior of "immediately installing all ClusterProfiles to a new cluster" may actually increase risk.
  2. Potential inadequacy of Healthcheck resources alone to ensure safety.

    • Apiserver or authentication infrastructure failures may make it difficult to retrieve Healthcheck resources.
    • This raises concerns about incorrect failovers occurring.

Additional Proposal: Prometheus Metrics-Based Failover Decision Making

Based on these issues, we propose adding a metrics-based mechanism using Prometheus as follows:

  1. Metrics-Based Failover Triggers

    • Comprehensively evaluate cluster health using various metrics available from Prometheus, such as network latency, Pod count, and error rates.
    • Make failover decisions based on trends in application-level required metrics rather than simply checking apiserver accessibility.
  2. Custom Resource Utilization Referenced from Other OSS

    These OSS implementations use metrics and custom resources to achieve Canary releases and scaling control, and their implementation and architecture can be referenced for failover control.

  3. Native Support in sveltos

    • Implementing Prometheus Query in Go is relatively straightforward.

    • By incorporating metrics-based decision logic as sveltos custom resources, referencing the design of the above OSS, more appropriate failover decisions from an application perspective become possible.

      • Personally, I think Keptn's Metrics custom resource implementation could serve as a good reference. It defines multiple metrics with labels, and if all metrics matching certain labels are unhealthy, it triggers failover for the corresponding ClusterProfile or ClusterSet.
       apiVersion: metrics.keptn.sh/v1
       kind: KeptnMetric
       metadata:
          name: good-metric
       spec:
       provider:
          name: my-provider
       query: "sum(kube_pod_container_resource_limits{resource='cpu'})"
       fetchIntervalSeconds: 10
       range:
          interval: "3m"

Implementation and Operational Benefits

  • Reduction in incorrect failovers, minimizing service interruptions
  • More accurate assessment of application health through metrics-based failover
  • Flexible and extensible operations through standardized custom resources following other OSS examples

Summary

Implementing failover based solely on apiserver accessibility could lead to unnecessary failovers during temporary network failures or authentication infrastructure issues, potentially creating significant risks. Therefore, we propose enabling failover decisions based on application operational status using Prometheus metrics. We believe more flexible and safer failover control can be achieved by incorporating native implementation into sveltos while referencing metrics utilization mechanisms from existing OSS like Flagger, Keptn, and Keda.

@gianlucam76
Copy link
Member

gianlucam76 commented Jan 28, 2025

Thank you @kahirokunn

Let me see if I got it. Metrics are pulled from managed cluster to management cluster or collected and exposed via reachable IP (which is outside the scope of Sveltos and this request).

Sveltos is also instructed to get and check those metrics (scope of this request). And decide whether a failover is needed using also those

Is that correct understanding of this proposal?

@kahirokunn
Copy link
Contributor Author

Yes, that's right.

@kahirokunn
Copy link
Contributor Author

Thank you for sharing this! Through our discussion, I was able to better organize my thoughts. Upon further review, I've discovered some interesting points about metrics implementation that I'd like to share.

It appears that representing metrics as custom resources is already possible, and there's an existing operator for this purpose in the Keptn Lifecycle Toolkit:

https://github.com/keptn/lifecycle-toolkit/tree/main/metrics-operator

Building on this, I believe we could enhance the failover functionality in ClusterSet/Set by leveraging the status of KeptnMetric custom resources as triggers. This approach could be particularly powerful because:

  1. It would effectively utilize CNCF capabilities
  2. It could significantly reduce implementation costs

I've created a proof of concept showing how ClusterSet could perform failover based on metrics. Here's the example:

apiVersion: lib.projectsveltos.io/v1beta1
kind: ClusterSet
metadata:
  name: prod
spec:
  clusterSelector:
    matchLabels:
      env: prod
  maxReplicas: 1
  templateResourceRefs:
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-http-success-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: httpSuccessRate
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-kubelet-error-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: kubeletErrorRate
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-kubeApiserver-error-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: kubeApiserverErrorRate
  validateHealths:
    - script: |
        function evaluate()
          local hs = {
              healthy = true,
              message = "cluster is healthy"
          }

          local httpSuccessRate       = tonumber(ctx.httpSuccessRate.status.value) or 0
          local kubeletErrorRate      = tonumber(ctx.kubeletErrorRate.status.value) or 0
          local kubeApiserverErrorRate = tonumber(ctx.kubeApiserverErrorRate.status.value) or 0

          if httpSuccessRate <= 0.1 and (kubeletErrorRate + kubeApiserverErrorRate) <= 1.0 then
              hs.healthy = false
              hs.message = "too many errors"
          end

          return hs
        end

Thank you 🙏

@gianlucam76
Copy link
Member

Thank you. This is cool and very inline with rest of Sveltos.

I think it makes sense to deliver this. Let's file an enhancement request for addon controller

@kahirokunn
Copy link
Contributor Author

Thank you!
I've submitted it.
projectsveltos/addon-controller#981

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants