Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design:add outlier detection design doc #5460

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yangyy93
Copy link
Member

@yangyy93 yangyy93 commented Jun 9, 2023

adding passive health checks to services to enhance routing reliability.
close #5317

@yangyy93 yangyy93 requested a review from a team as a code owner June 9, 2023 06:23
@yangyy93 yangyy93 requested review from stevesloka and sunjayBhatia and removed request for a team June 9, 2023 06:23
@yangyy93 yangyy93 added kind/design Categorizes issue or PR as related to design. release-note/none-required Marks a PR as not requiring a release note. Should only be used for very small changes. labels Jun 9, 2023
@codecov
Copy link

codecov bot commented Jun 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.53%. Comparing base (d53f2a3) to head (00f09f7).
Report is 416 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #5460   +/-   ##
=======================================
  Coverage   78.53%   78.53%           
=======================================
  Files         138      138           
  Lines       19327    19327           
=======================================
  Hits        15179    15179           
  Misses       3856     3856           
  Partials      292      292           

@github-actions
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2023
@yangyy93
Copy link
Member Author

@skriss @sunjayBhatia
Do you have any suggestions for this design pr?

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2023
Copy link
Member

@sunjayBhatia sunjayBhatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the design! sorry to take so long in reviewing but left some comments from a first pass

design/outlier-detection-design.md Outdated Show resolved Hide resolved
design/outlier-detection-design.md Outdated Show resolved Hide resolved
// +optional
// +kubebuilder:validation:Maximum=100
// +kubebuilder:default=0
MinHealthPercent *uint32 `json:"minHealthPercent,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need to offer configurability for panic threshold in this case since we have the above "max ejection %" config and we are not utilizing the panic threshold ability to "fail" traffic when in panic mode, is there a specific reason you have for adding this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a reference to some of the design concepts of istio. After careful consideration, I think we can remove this field and keep the default value of contour.Thanks for your comment


## Open Issues
- Whether the existing supported configuration options meet the needs of the user.
- Should a global switch be provided to record [ejection event logs][2]?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of adding some discussion and exploration on this, I think with our active healthchecking configuration, it isn't ideal as-is but (IMO) people find it generally easier to explain and understand why their upstream is unhealthy (it returned an unhealthy status code from their healthcheck endpoint)

with passive healthchecking I can foresee some support requests etc. where users are confused as to why their services are deemed unhealthy, so having logs etc. to explain things would be great

we could also take advantage of the active healthchecking event log configuration as well

I think a great outcome of implementing this feature could be a debugging guide, including some mention of what stats to monitor (e.g. from https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats#config-cluster-manager-cluster-stats-outlier-detection)

we would have to likely add a bootstrap flag to change the bootstrap ClusterManager configuration to enable logging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding a bootstrap flag to handle whether to enable logging can better help people confirm the cause of the failure. If it is considered feasible, I will create a new PR to implement this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also add an active health check flag to record health check failure events

// Defaults to false.
// +optional
// +kubebuilder:default=false
SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nice advantage of this passive healthcheck config is we can get more granular failure details like this vs. our existing active http healthcheck configuration, I would err on the side of always enabling this to separate server errors from possible network level errors, but curious to see what others think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I would lean on the side of enabling this by default. Regarding granularity, we should also allow users to tune consecutive_gateway_failure and consecutive_local_origin_failure; otherwise, this toggle by itself feels too rigid.

We should also make sure to set sane defaults and that the accompanying documentation is clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your review.The setting of consecutive_local_origin_failure is already in the design document. Consecutive_gateway_failure was not intended to be configured at the beginning. Of course, if this is a good design, I will update this design document to support consecutive_gateway_failure.
@clayton-gonsalves @sunjayBhatia

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents here is to remove the SplitExternalLocalOriginErrors field and set it to always true in the generated Envoy config

in addition, we should choose one of the 5xx error codes or gateway errors to expose as a configurable parameter to users for upstream originated error control, my 2 cents here again is that the consecutive_5xx (existing ConsecutiveServerErrors field in this config) is sufficient, I don't see too much utility in limiting the classified errors to 502, 503, and 504

@github-actions
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2023
@yangyy93
Copy link
Member Author

Soon I will submit a PR with detailed code implementation.

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2023
@davinci26
Copy link
Contributor

cc @clayton-gonsalves that would also interest us as well, do you want to do a review (or reassign it internally)?

Copy link
Contributor

@clayton-gonsalves clayton-gonsalves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this Design; I am looking forward to this.

Here are a few comments from my side and a question;

I like the granularity of having the outlier detection per service as it allows the service owners to fine-tune the as per their requirements.

Did you also explore the possibility of defining a config at the vhost level or the global level?
This would also align with the patterns we have in place for rate limiting and extauth for example.

I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.

design/outlier-detection-design.md Show resolved Hide resolved
// Defaults to false.
// +optional
// +kubebuilder:default=false
SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I would lean on the side of enabling this by default. Regarding granularity, we should also allow users to tune consecutive_gateway_failure and consecutive_local_origin_failure; otherwise, this toggle by itself feels too rigid.

We should also make sure to set sane defaults and that the accompanying documentation is clear.

@yangyy93
Copy link
Member Author

Did you also explore the possibility of defining a config at the vhost level or the global level? This would also align with the patterns we have in place for rate limiting and extauth for example.

I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.

The global OutlierDetection configuration supports configuring the log path of abnormal events. I have not found the related configuration of vhost level and global level in the relevant documents of envoy.

@clayton-gonsalves
Copy link
Contributor

Did you also explore the possibility of defining a config at the vhost level or the global level? This would also align with the patterns we have in place for rate limiting and extauth for example.
I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.

The global OutlierDetection configuration supports configuring the log path of abnormal events. I have not found the related configuration of vhost level and global level in the relevant documents of envoy.

Apologies for the bad link. Yeah, I don't think envoy supports it. We would need to implement it ourselves like how global auth is implemented.

@yangyy93
Copy link
Member Author

yangyy93 commented Aug 3, 2023

Did you also explore the possibility of defining a config at the vhost level or the global level? This would also align with the patterns we have in place for rate limiting and extauth for example.
I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.

The global OutlierDetection configuration supports configuring the log path of abnormal events. I have not found the related configuration of vhost level and global level in the relevant documents of envoy.

Apologies for the bad link. Yeah, I don't think envoy supports it. We would need to implement it ourselves like how global auth is implemented.

Yes, it is possible to implement it the same way using global authentication. this is a good suggestion.

@github-actions
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2023
Signed-off-by: yy <[email protected]>

outlier-detection-design.md

Signed-off-by: yy <[email protected]>

update outlier-detection-design.md

Signed-off-by: yy <[email protected]>

update design

Signed-off-by: yy <[email protected]>
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot closed this Feb 25, 2024
@sunjayBhatia sunjayBhatia reopened this May 29, 2024
@sunjayBhatia sunjayBhatia removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2024
@sunjayBhatia sunjayBhatia requested review from a team, rajatvig and davinci26 and removed request for a team May 29, 2024 18:43
@sunjayBhatia
Copy link
Member

reopening this to restart some conversation shere

// Defaults to false.
// +optional
// +kubebuilder:default=false
SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents here is to remove the SplitExternalLocalOriginErrors field and set it to always true in the generated Envoy config

in addition, we should choose one of the 5xx error codes or gateway errors to expose as a configurable parameter to users for upstream originated error control, my 2 cents here again is that the consecutive_5xx (existing ConsecutiveServerErrors field in this config) is sufficient, I don't see too much utility in limiting the classified errors to 502, 503, and 504

// Defaults to 0s.
// +optional
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$`
MaxEjectionTimeJitter *string `json:"maxEjectionTimeJitter,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just set a default for this and not expose this field to start with, this config is pretty big so limiting what users need to think about would be great



## Open Issues
- Whether the existing supported configuration options meet the needs of the user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious for people who have used this in the past, particularly at scale, whether the "consecutive errors" mechanism for configuring outlier detection is actually desirable vs. the failure percentage mechanism

one could foresee a scenario in which a host in a cluster should be detected as an outlier however it does not consistently error consecutively but does still have a high failure percentage (say the consecutive errors was configured at 5, and you consistently get 3 out of 5 requests that fail), you may not detect that such a host should be ejected as quickly and also may not consistently eject it for as long as it needs to be ejected: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm

enforcing_consecutive_gateway_failure: 0
```

Notes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably add some discussion/documentation of the successful_active_health_check_uneject_host field, users should be advised that this defaults to true so if active health checking is also enabled this could cause some hosts flip-floopping, esp. if the active health check is configured to a different endpoint than the exposed route

- If consecutiveServerErrors is specified as 0, and splitExternalLocalOriginErrors is true, then local errors will be ignored.This is especially useful when the upstream service explicitly returns a 5xx for some requests and you want to ignore those responses from upstream service while determining the outlier detection status of a host.
- When accessing the upstream host through an opaque TCP connection, connection timeouts, connection errors, and request failures are all considered 5xx errors, and therefore these events are included in the 5xx error statistics.
- Please refer to the [official documentation of Envoy][1] for more instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also include some discussion/documentation on the different behavior between http router and tcp filter when this is in the mix

Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 14, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2024
@izturn izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 2, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2024
@izturn izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2024
Copy link

github-actions bot commented Aug 3, 2024

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 3, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2024
@izturn izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024
@izturn izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. release-note/none-required Marks a PR as not requiring a release note. Should only be used for very small changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Outlier Detection
5 participants