Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design:add outlier detection design doc #5460

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 275 additions & 0 deletions design/outlier-detection-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
# Design for Supporting Outlier detection in Contour

Status: Draft

## Abstract
This document recommends adding passive health checks to services to enhance routing reliability.

## Background

Envoy supports two types of health checks, namely active health check and [passive health check][1]. Passive and active health checks can be enabled simultaneously or independently, and form the basis of the overall upstream health check solution. Contour has implemented active health checks. If Passive health checks are configured to better monitor service status and route traffic to healthy instances.

## Goals
- Add outlier detection related configuration detection for services.
- Support global configuration of OutlierDetection, service level can be overridden or disabled.
- Support configuration based on route and TCPRoute.
- Support configuration for consecutive-5 xx and consecutive-local-origin-failure.

## Non-Goals
- Not supported Outlier Detection configuration for consecutive-gateway-failure, success-rate, and failure-percentage.

## Detailed Design

### Global Configuration
The `OutlierDetection` section of the Contour config file will look like:

```yaml
outlierDetection:
# ConsecutiveServerErrors defines The number of consecutive server-side error responses before a consecutive 5xx ejection occurs.
# When the backend host encounters consecutive
# errors greater than or equal to ConsecutiveServerErrors, it will be
# ejected from the load balancing pool.
# for HTTP services, a 5xx counts as an error and for TCP services
# connection failures and connection timeouts count as an error.
# It can be disabled by setting the value to 0.
# Defaults to 5.
consecutiveServerErrors: 5
# Interval is the interval at which host status is evaluated.
# Defaults to 10s.
interval: 10s
# BaseEjectionTime is the base time that a host is ejected for.
# A host will remain ejected for a period of time equal to the
# product of the ejection base duration and the number of times the host has been ejected.
# Defaults to 30s.
baseEjectionTime: 30s
# MaxEjectionTime is the maximum time a host will be ejected for.
# After this amount of time, a host will be returned to normal operation.
# If not specified, the default value (300s) or BaseEjectionTime value is applied, whatever is larger.
# Defaults to 300s.
maxEjectionTime: 300s
# SplitExternalLocalOriginErrors defines whether to split the local origin errors from the external origin errors.
# Defaults to false.
splitExternalLocalOriginErrors: false
# ConsecutiveLocalOriginFailure defines the number of consecutive local origin failures before a consecutive local origin ejection occurs.
# Parameters take effect only when SplitExternalLocalOriginErrors is true.
# Defaults to 5.
consecutiveLocalOriginFailure: 5
# MaxEjectionPercent is the max percentage of hosts in the load balancing pool for the upstream service that can be ejected.
# But will eject at least one host regardless of the value here.
# Defaults to 10%.
maxEjectionPercent: 10
# MaxEjectionTimeJitter is The maximum amount of jitter to add to the ejection time,
# in order to prevent a ‘thundering herd’ effect where all proxies try to reconnect to host at the same time.
# Defaults to 0s.
maxEjectionTimeJitter: 0s
```


### Httpproxy Configuration
Add a new field in the CRD of httpproxy for configuring Outlier Detection of the service. The field is defined as follows:
```go
// OutlierDetection defines the configuration for outlier detection on a service.
// If not specified, global configuration will be used.
// If specified, it will override the global configuration.
type OutlierDetection struct {

// Disabled defines whether to disable outlier detection for the service.
// Defaults to false.
Disabled *bool `json:"disabled,omitempty"`
// ConsecutiveServerErrors defines The number of consecutive server-side error responses before a consecutive 5xx ejection occurs.
// When the backend host encounters consecutive
// errors greater than or equal to ConsecutiveServerErrors, it will be
// ejected from the load balancing pool.
// for HTTP services, a 5xx counts as an error and for TCP services
// connection failures and connection timeouts count as an error.
// It can be disabled by setting the value to 0.
// Defaults to 5.
// +optional
ConsecutiveServerErrors *uint32 `json:"consecutiveServerErrors,omitempty"`

// Interval is the interval at which host status is evaluated.
// Defaults to 10s.
// +optional
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?h)|(\d*(\.\d*)?m)|(\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$`
Interval *string `json:"interval,omitempty"`

// BaseEjectionTime is the base time that a host is ejected for.
// A host will remain ejected for a period of time equal to the
// product of the ejection base duration and the number of times the host has been ejected.
// Defaults to 30s.
// +optional
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?h)|(\d*(\.\d*)?m)|(\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$`
BaseEjectionTime *string `json:"baseEjectionTime,omitempty"`

// MaxEjectionTime is the maximum time a host will be ejected for.
// After this amount of time, a host will be returned to normal operation.
// If not specified, the default value (300s) or BaseEjectionTime value is applied, whatever is larger.
// Defaults to 300s.
// +optional
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?h)|(\d*(\.\d*)?m)|(\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$`
MaxEjectionTime *string `json:"maxEjectionTime,omitempty"`

// SplitExternalLocalOriginErrors defines whether to split the local origin errors from the external origin errors.
// Defaults to false.
// +optional
// +kubebuilder:default=false
SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nice advantage of this passive healthcheck config is we can get more granular failure details like this vs. our existing active http healthcheck configuration, I would err on the side of always enabling this to separate server errors from possible network level errors, but curious to see what others think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I would lean on the side of enabling this by default. Regarding granularity, we should also allow users to tune consecutive_gateway_failure and consecutive_local_origin_failure; otherwise, this toggle by itself feels too rigid.

We should also make sure to set sane defaults and that the accompanying documentation is clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your review.The setting of consecutive_local_origin_failure is already in the design document. Consecutive_gateway_failure was not intended to be configured at the beginning. Of course, if this is a good design, I will update this design document to support consecutive_gateway_failure.
@clayton-gonsalves @sunjayBhatia

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents here is to remove the SplitExternalLocalOriginErrors field and set it to always true in the generated Envoy config

in addition, we should choose one of the 5xx error codes or gateway errors to expose as a configurable parameter to users for upstream originated error control, my 2 cents here again is that the consecutive_5xx (existing ConsecutiveServerErrors field in this config) is sufficient, I don't see too much utility in limiting the classified errors to 502, 503, and 504


// ConsecutiveLocalOriginFailure defines the number of consecutive local origin failures before a consecutive local origin ejection occurs.
// Parameters take effect only when SplitExternalLocalOriginErrors is true.
// Defaults to 5.
ConsecutiveLocalOriginFailure *uint32 `json:"consecutiveLocalOriginFailure,omitempty"`

// MaxEjectionPercent is the max percentage of hosts in the load balancing pool for the upstream service that can be ejected.
// But will eject at least one host regardless of the value here.
// Defaults to 10%.
// +optional
// +kubebuilder:validation:Maximum=100
MaxEjectionPercent *uint32 `json:"maxEjectionPercent,omitempty"`

// MaxEjectionTimeJitter is The maximum amount of jitter to add to the ejection time,
// in order to prevent a ‘thundering herd’ effect where all proxies try to reconnect to host at the same time.
// Defaults to 0s.
// +optional
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$`
MaxEjectionTimeJitter *string `json:"maxEjectionTimeJitter,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just set a default for this and not expose this field to start with, this config is pretty big so limiting what users need to think about would be great

}
yangyy93 marked this conversation as resolved.
Show resolved Hide resolved
```

## Example

If the global configuration is not configured,Below are some configuration examples for outlier detection.

### Example 1
```yaml
apiVersion: projectcontour.io/v1alpha1
kind: HTTPProxy
metadata:
name: simple-5xx
namespace: projectcontour
spec:
virtualhost:
fqdn: outlierDetection.projectcontour.io
routes:
- conditions:
- prefix: /
services:
- name: s1
port: 80
outlierDetection:
consecutiveServerErrors: 5
maxEjectionPercent: 100
```
In this example, when s1 service experiences 5 consecutive 5xx errors (including locally originated and externally originated (transaction) errors), s1 service will be ejected, with a maximum ejection rate of 100%. Panic is disabled, which is also the default value.

Below is the generated envoy configuration:
```yaml
cluster:
common_lb_config:
healthy_panic_threshold:
value: 0.0
outlier_detection:
enforcing_success_rate: 0
consecutive_5xx: 5
enforcing_consecutive_5xx: 100
max_ejection_percent: 100
enforcing_consecutive_gateway_failure: 0
```

### Example 2
```yaml
apiVersion: projectcontour.io/v1alpha1
kind: HTTPProxy
metadata:
name: only-local-external-origin-errors
namespace: projectcontour
spec:
virtualhost:
fqdn: outlierDetection.projectcontour.io
routes:
- conditions:
- prefix: /
services:
- name: s1
port: 80
outlierDetection:
consecutiveServerErrors: 0
maxEjectionPercent: 100
splitExternalLocalOriginErrors: true
consecutiveLocalOriginFailure: 5
```
In this example, when the s1 service has 5 consecutive local origin errors, the s1 service will be ejected, and the 5 xx errors returned by the upstream service will not be counted in the error statistics. The maximum eviction ratio is 100%. Panic is disabled, which is also the default value.

Below is the generated envoy configuration:
```yaml
cluster:
common_lb_config:
healthy_panic_threshold:
value: 0.0
outlier_detection:
enforcing_success_rate: 0
enforcing_consecutive_5xx: 0
max_ejection_percent: 100
split_external_local_origin_errors: true
consecutive_local_origin_failure: 5
#Default to 100% if not specified
#enforcing_consecutive_local_origin_failure: 100
enforcing_local_origin_success_rate: 0
enforcing_consecutive_gateway_failure: 0
```

### Example 3
```yaml
apiVersion: projectcontour.io/v1alpha1
kind: HTTPProxy
metadata:
name: local-external-origin-errors-and-consecutive-5xx
namespace: projectcontour
spec:
virtualhost:
fqdn: outlierDetection.projectcontour.io
routes:
- conditions:
- prefix: /
services:
- name: s1
port: 80
outlierDetection:
consecutiveServerErrors: 10
maxEjectionPercent: 100
splitExternalLocalOriginErrors: true
consecutiveLocalOriginFailure: 5
```
In this example, when the s1 service continuously encounters 5 local origin errors or 10 external errors, the s1 service will be ejected. The maximum eviction rate is 100%. Panic is disabled, which is also the default value.

Below is the generated envoy configuration:
```yaml
cluster:
common_lb_config:
healthy_panic_threshold:
value: 0.0
outlier_detection:
enforcing_success_rate: 0
consecutive_5xx: 10
enforcing_consecutive_5xx: 100
max_ejection_percent: 100
split_external_local_origin_errors: true
consecutive_local_origin_failure: 5
#Default to 100% if not specified
#enforcing_consecutive_local_origin_failure: 100
enforcing_local_origin_success_rate: 0
enforcing_consecutive_gateway_failure: 0
```

Notes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably add some discussion/documentation of the successful_active_health_check_uneject_host field, users should be advised that this defaults to true so if active health checking is also enabled this could cause some hosts flip-floopping, esp. if the active health check is configured to a different endpoint than the exposed route

- If consecutiveServerErrors is specified as 0, and splitExternalLocalOriginErrors is true, then local errors will be ignored.This is especially useful when the upstream service explicitly returns a 5xx for some requests and you want to ignore those responses from upstream service while determining the outlier detection status of a host.
- When accessing the upstream host through an opaque TCP connection, connection timeouts, connection errors, and request failures are all considered 5xx errors, and therefore these events are included in the 5xx error statistics.
- Please refer to the [official documentation of Envoy][1] for more instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also include some discussion/documentation on the different behavior between http router and tcp filter when this is in the mix


## Open Issues
- Whether the existing supported configuration options meet the needs of the user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious for people who have used this in the past, particularly at scale, whether the "consecutive errors" mechanism for configuring outlier detection is actually desirable vs. the failure percentage mechanism

one could foresee a scenario in which a host in a cluster should be detected as an outlier however it does not consistently error consecutively but does still have a high failure percentage (say the consecutive errors was configured at 5, and you consistently get 3 out of 5 requests that fail), you may not detect that such a host should be ejected as quickly and also may not consistently eject it for as long as it needs to be ejected: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm

- Should a global switch be provided to record [ejection event logs][2]?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of adding some discussion and exploration on this, I think with our active healthchecking configuration, it isn't ideal as-is but (IMO) people find it generally easier to explain and understand why their upstream is unhealthy (it returned an unhealthy status code from their healthcheck endpoint)

with passive healthchecking I can foresee some support requests etc. where users are confused as to why their services are deemed unhealthy, so having logs etc. to explain things would be great

we could also take advantage of the active healthchecking event log configuration as well

I think a great outcome of implementing this feature could be a debugging guide, including some mention of what stats to monitor (e.g. from https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats#config-cluster-manager-cluster-stats-outlier-detection)

we would have to likely add a bootstrap flag to change the bootstrap ClusterManager configuration to enable logging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding a bootstrap flag to handle whether to enable logging can better help people confirm the cause of the failure. If it is considered feasible, I will create a new PR to implement this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also add an active health check flag to record health check failure events


[1]: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier
[2]: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/bootstrap/v3/bootstrap.proto#envoy-v3-api-field-config-bootstrap-v3-clustermanager-outlier-detection