design:add outlier detection design doc #5460

yangyy93 · 2023-06-09T06:23:05Z

adding passive health checks to services to enhance routing reliability.
close #5317

codecov · 2023-06-09T06:27:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.53%. Comparing base (d53f2a3) to head (00f09f7).
Report is 522 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5460   +/-   ##
=======================================
  Coverage   78.53%   78.53%           
=======================================
  Files         138      138           
  Lines       19327    19327           
=======================================
  Hits        15179    15179           
  Misses       3856     3856           
  Partials      292      292

github-actions · 2023-06-24T00:24:51Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

yangyy93 · 2023-06-25T02:32:37Z

@skriss @sunjayBhatia
Do you have any suggestions for this design pr?

sunjayBhatia

thanks for the design! sorry to take so long in reviewing but left some comments from a first pass

design/outlier-detection-design.md

sunjayBhatia · 2023-06-26T22:01:33Z

design/outlier-detection-design.md

+    // +optional
+    // +kubebuilder:validation:Maximum=100
+    // +kubebuilder:default=0
+    MinHealthPercent *uint32 `json:"minHealthPercent,omitempty"`


I don't know if we need to offer configurability for panic threshold in this case since we have the above "max ejection %" config and we are not utilizing the panic threshold ability to "fail" traffic when in panic mode, is there a specific reason you have for adding this?

Here is a reference to some of the design concepts of istio. After careful consideration, I think we can remove this field and keep the default value of contour.Thanks for your comment

sunjayBhatia · 2023-06-26T22:20:52Z

design/outlier-detection-design.md

+
+## Open Issues
+- Whether the existing supported configuration options meet the needs of the user.
+- Should a global switch be provided to record [ejection event logs][2]?


I'm in favor of adding some discussion and exploration on this, I think with our active healthchecking configuration, it isn't ideal as-is but (IMO) people find it generally easier to explain and understand why their upstream is unhealthy (it returned an unhealthy status code from their healthcheck endpoint)

with passive healthchecking I can foresee some support requests etc. where users are confused as to why their services are deemed unhealthy, so having logs etc. to explain things would be great

we could also take advantage of the active healthchecking event log configuration as well

I think a great outcome of implementing this feature could be a debugging guide, including some mention of what stats to monitor (e.g. from https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats#config-cluster-manager-cluster-stats-outlier-detection)

we would have to likely add a bootstrap flag to change the bootstrap ClusterManager configuration to enable logging

Yes, adding a bootstrap flag to handle whether to enable logging can better help people confirm the cause of the failure. If it is considered feasible, I will create a new PR to implement this function.

can also add an active health check flag to record health check failure events

sunjayBhatia · 2023-06-26T22:30:11Z

design/outlier-detection-design.md

+    // Defaults to false.
+    // +optional
+    // +kubebuilder:default=false
+    SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"`


a nice advantage of this passive healthcheck config is we can get more granular failure details like this vs. our existing active http healthcheck configuration, I would err on the side of always enabling this to separate server errors from possible network level errors, but curious to see what others think

+1, I would lean on the side of enabling this by default. Regarding granularity, we should also allow users to tune consecutive_gateway_failure and consecutive_local_origin_failure; otherwise, this toggle by itself feels too rigid.

We should also make sure to set sane defaults and that the accompanying documentation is clear.

thanks for your review.The setting of consecutive_local_origin_failure is already in the design document. Consecutive_gateway_failure was not intended to be configured at the beginning. Of course, if this is a good design, I will update this design document to support consecutive_gateway_failure.
@clayton-gonsalves @sunjayBhatia

my 2 cents here is to remove the SplitExternalLocalOriginErrors field and set it to always true in the generated Envoy config

in addition, we should choose one of the 5xx error codes or gateway errors to expose as a configurable parameter to users for upstream originated error control, my 2 cents here again is that the consecutive_5xx (existing ConsecutiveServerErrors field in this config) is sufficient, I don't see too much utility in limiting the classified errors to 502, 503, and 504

github-actions · 2023-07-19T00:33:41Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

yangyy93 · 2023-07-19T08:05:08Z

Soon I will submit a PR with detailed code implementation.

davinci26 · 2023-07-25T13:19:03Z

cc @clayton-gonsalves that would also interest us as well, do you want to do a review (or reassign it internally)?

clayton-gonsalves

Thanks for this Design; I am looking forward to this.

Here are a few comments from my side and a question;

I like the granularity of having the outlier detection per service as it allows the service owners to fine-tune the as per their requirements.

Did you also explore the possibility of defining a config at the vhost level or the global level?
This would also align with the patterns we have in place for rate limiting and extauth for example.

I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.

design/outlier-detection-design.md

clayton-gonsalves · 2023-07-26T07:22:24Z

design/outlier-detection-design.md

+    // Defaults to false.
+    // +optional
+    // +kubebuilder:default=false
+    SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"`


+1, I would lean on the side of enabling this by default. Regarding granularity, we should also allow users to tune consecutive_gateway_failure and consecutive_local_origin_failure; otherwise, this toggle by itself feels too rigid.

We should also make sure to set sane defaults and that the accompanying documentation is clear.

yangyy93 · 2023-07-26T10:08:59Z

Did you also explore the possibility of defining a config at the vhost level or the global level? This would also align with the patterns we have in place for rate limiting and extauth for example.

I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.

The global OutlierDetection configuration supports configuring the log path of abnormal events. I have not found the related configuration of vhost level and global level in the relevant documents of envoy.

clayton-gonsalves · 2023-07-26T10:13:34Z

Did you also explore the possibility of defining a config at the vhost level or the global level? This would also align with the patterns we have in place for rate limiting and extauth for example.
I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.

The global OutlierDetection configuration supports configuring the log path of abnormal events. I have not found the related configuration of vhost level and global level in the relevant documents of envoy.

Apologies for the bad link. Yeah, I don't think envoy supports it. We would need to implement it ourselves like how global auth is implemented.

yangyy93 · 2023-08-03T06:34:51Z

Did you also explore the possibility of defining a config at the vhost level or the global level? This would also align with the patterns we have in place for rate limiting and extauth for example.
I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.

The global OutlierDetection configuration supports configuring the log path of abnormal events. I have not found the related configuration of vhost level and global level in the relevant documents of envoy.

Apologies for the bad link. Yeah, I don't think envoy supports it. We would need to implement it ourselves like how global auth is implemented.

Yes, it is possible to implement it the same way using global authentication. this is a good suggestion.

github-actions · 2023-08-18T00:17:25Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

Signed-off-by: yy <[email protected]> outlier-detection-design.md Signed-off-by: yy <[email protected]> update outlier-detection-design.md Signed-off-by: yy <[email protected]> update design Signed-off-by: yy <[email protected]>

Signed-off-by: yy <[email protected]>

github-actions · 2024-07-19T00:43:48Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-08-03T00:21:53Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-08-18T00:24:49Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-09-11T00:23:43Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-10-01T00:28:37Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-10-21T00:26:56Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-11-07T00:25:46Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-11-22T00:27:33Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2025-01-03T00:25:47Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2025-01-19T00:27:44Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

yangyy93 requested a review from a team as a code owner June 9, 2023 06:23

yangyy93 requested review from stevesloka and sunjayBhatia and removed request for a team June 9, 2023 06:23

yangyy93 added kind/design Categorizes issue or PR as related to design. release-note/none-required Marks a PR as not requiring a release note. Should only be used for very small changes. labels Jun 9, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2023

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2023

sunjayBhatia requested review from tsaarni and skriss June 26, 2023 20:58

sunjayBhatia reviewed Jun 26, 2023

View reviewed changes

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2023

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2023

yangyy93 mentioned this pull request Jul 24, 2023

HTTPProxy: add cluster outlierDetection #5575

Open

skriss requested a review from clayton-gonsalves July 24, 2023 15:57

clayton-gonsalves suggested changes Jul 26, 2023

View reviewed changes

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2023

yangyy93 added 2 commits August 22, 2023 15:19

add outlier-detection-design.md

d88de14

Signed-off-by: yy <[email protected]> outlier-detection-design.md Signed-off-by: yy <[email protected]> update outlier-detection-design.md Signed-off-by: yy <[email protected]> update design Signed-off-by: yy <[email protected]>

update outlier-detection-design.md

c579c38

Signed-off-by: yy <[email protected]>

yangyy93 force-pushed the outlier-detection-design branch from 98cc6dc to c579c38 Compare August 22, 2023 07:41

izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 2, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2024

izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2024

github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 3, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2024

izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024

izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2024

github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 1, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2024

izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024

github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 7, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2024

izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2024

github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 3, 2025

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design:add outlier detection design doc #5460

design:add outlier detection design doc #5460

yangyy93 commented Jun 9, 2023

codecov bot commented Jun 9, 2023 •

edited

Loading

github-actions bot commented Jun 24, 2023

yangyy93 commented Jun 25, 2023

sunjayBhatia left a comment

sunjayBhatia Jun 26, 2023

yangyy93 Jun 27, 2023

sunjayBhatia Jun 26, 2023

yangyy93 Jun 27, 2023

yangyy93 Jun 27, 2023

sunjayBhatia Jun 26, 2023

clayton-gonsalves Jul 26, 2023

yangyy93 Jul 26, 2023

sunjayBhatia May 29, 2024

github-actions bot commented Jul 19, 2023

yangyy93 commented Jul 19, 2023

davinci26 commented Jul 25, 2023

clayton-gonsalves left a comment

clayton-gonsalves Jul 26, 2023

yangyy93 commented Jul 26, 2023

clayton-gonsalves commented Jul 26, 2023

yangyy93 commented Aug 3, 2023

github-actions bot commented Aug 18, 2023

github-actions bot commented Jul 19, 2024

github-actions bot commented Aug 3, 2024

github-actions bot commented Aug 18, 2024

github-actions bot commented Sep 11, 2024

github-actions bot commented Oct 1, 2024

github-actions bot commented Oct 21, 2024

github-actions bot commented Nov 7, 2024

github-actions bot commented Nov 22, 2024

github-actions bot commented Jan 3, 2025

github-actions bot commented Jan 19, 2025

design:add outlier detection design doc #5460

Are you sure you want to change the base?

design:add outlier detection design doc #5460

Conversation

yangyy93 commented Jun 9, 2023

codecov bot commented Jun 9, 2023 • edited Loading

Codecov Report

github-actions bot commented Jun 24, 2023

yangyy93 commented Jun 25, 2023

sunjayBhatia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 19, 2023

yangyy93 commented Jul 19, 2023

davinci26 commented Jul 25, 2023

clayton-gonsalves left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yangyy93 commented Jul 26, 2023

clayton-gonsalves commented Jul 26, 2023

yangyy93 commented Aug 3, 2023

github-actions bot commented Aug 18, 2023

github-actions bot commented Jul 19, 2024

github-actions bot commented Aug 3, 2024

github-actions bot commented Aug 18, 2024

github-actions bot commented Sep 11, 2024

github-actions bot commented Oct 1, 2024

github-actions bot commented Oct 21, 2024

github-actions bot commented Nov 7, 2024

github-actions bot commented Nov 22, 2024

github-actions bot commented Jan 3, 2025

github-actions bot commented Jan 19, 2025

codecov bot commented Jun 9, 2023 •

edited

Loading