Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistake in a BackendTrafficPolicy causes all routes to return 404 #5147

Closed
dghubble opened this issue Jan 25, 2025 · 5 comments · Fixed by #5176
Closed

Mistake in a BackendTrafficPolicy causes all routes to return 404 #5147

dghubble opened this issue Jan 25, 2025 · 5 comments · Fixed by #5176
Assignees
Labels
area/xds-translator kind/bug Something isn't working
Milestone

Comments

@dghubble
Copy link

dghubble commented Jan 25, 2025

Description:

A colleage and I found that a subtle mistake in a single BackendTrafficPolicy can make envoy proxy instances return 404's for ALL routes.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: hellogo
  namespace: default
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: hello
  retry:
    numRetries: 1
    perRetry:
      backOff:
        baseInterval: 0s      # 0s breaks everything, 1s is ok

Repro steps:

Create a BackendTrafficPolicy as shown above. Nothing stops a developer setting baseInterval: 0s.

At first, nothing is wrong. Then, if you restart envoy proxies, you'll find ALL httproutes return 404s immediately. Logs show route_not_found for all requests but no mention of why or which resources causs this. Inspecting the raw envoy config via the admin portal, the dynamic_route_configs section is never generated (usually its populated).

To find the offending resource, we had to delete resources until discovering the problematic thing was this one BackendTrafficPolicy and this one value within it. Pretty scary to us. Questions:

  • What values should be allowed in baseInterval?
  • What validations can be done to stop misconfigurations like this?
  • Supposing there are other (perhaps future) resource misconfigs / validation issues, how can those be scoped to avoid breaking all routes?
  • How can a user identify the problematic resources, either in envoy-gateway or the envoy proxy? Here we had to guess and test

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Environment:

Include the environment like gateway version, envoy version and so on.

envoy-gateway: v1.2.5

Logs:

Include the access logs and the Envoy logs.

@arkodg arkodg added kind/bug Something isn't working area/xds-translator and removed triage labels Jan 25, 2025
@arkodg arkodg added this to the v1.3.0 milestone Jan 25, 2025
@arkodg arkodg added the help wanted Extra attention is needed label Jan 25, 2025
@arkodg
Copy link
Contributor

arkodg commented Jan 25, 2025

looks like this one managed to escape all the checks, here's the error from the envoy proxy

[2025-01-25 03:01:44.281][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138] gRPC config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration rejected: Proto constraint validation failed (RouteConfigurationValidationError.VirtualHosts[0]: embedded message failed validation | caused by VirtualHostValidationError.Routes[0]: embedded message failed validation | caused by RouteValidationError.Route: embedded message failed validation | caused by RouteActionValidationError.RetryPolicy: embedded message failed validation | caused by RetryPolicyValidationError.RetryBackOff: embedded message failed validation | caused by RetryBackOffValidationError.BaseInterval: **value must be greater than 0s**): name: "default/eg/http"

We have 3 levels of validation

  1. Apply Time
  • Validated by Kube API Server based on the CRD OpenAPI schema based off Kube-Builder and CEL tags , and config that fails validation is rejected
  1. Runtime
  • Validated by the gateway-api runner in Envoy Gateway
  • More complex validations happen here and if any validation fails
    • A negative status is added to the Policy resource
    • A 500 Direct Response is attached to the targeted Route, implementing a fail closed state, so requests targeting this route will fail, and should be able to debug these faster
  1. xDS Translation
  • We run Validate on the xDs Resource which executes the proto validations for the envoy proxy defined resources, if this fails it should be logged in envoy-gateway. We plan on bubbling this up as a status too in the future.
    It should have been caught here, because we didn't add the must be greater than 0s validation anywhere but it wasn't and the config was pushed to envoy proxy which failed the entire route config

@arkodg
Copy link
Contributor

arkodg commented Jan 25, 2025

@zhaohuabing any idea why the xDS validate didn't kick in ?

this issue can be fixed by adding a CEL validation for this case

@zhaohuabing
Copy link
Member

zhaohuabing commented Jan 25, 2025

@arkodg the validation is done in the ResourceVersionTable.AddXdsResource function, but Routes are added to the RouteConfiguration after tCtx.AddXdsResource(resourcev3.RouteType, xdsRouteCfg) is called. So the validations for RouteConfiguration are skipped.

if xdsRouteCfg == nil {
xdsRouteCfg = &routev3.RouteConfiguration{
IgnorePortInHostMatching: true,
Name: httpListener.Name,
}
if err = tCtx.AddXdsResource(resourcev3.RouteType, xdsRouteCfg); err != nil {
errs = errors.Join(errs, err)
}
}
// Generate xDS virtual hosts and routes for the given HTTPListener,
// and add them to the xDS route config.
if err = t.addRouteToRouteConfig(tCtx, xdsRouteCfg, httpListener, metrics, http3Enabled); err != nil {
errs = errors.Join(errs, err)

This may happen in other xDS validation as well. I'm going to send a PR to fix it.

@zhaohuabing zhaohuabing self-assigned this Jan 25, 2025
@zhaohuabing zhaohuabing removed the help wanted Extra attention is needed label Jan 25, 2025
@zhaohuabing zhaohuabing removed their assignment Jan 25, 2025
@zhaohuabing zhaohuabing added the help wanted Extra attention is needed label Jan 25, 2025
@zhaohuabing
Copy link
Member

Created #5148 to add missing validations. The CEL validation/Gateway API translator validation for baseInterval can be addressed in a separate PR.

@arkodg arkodg self-assigned this Jan 27, 2025
@arkodg arkodg removed the help wanted Extra attention is needed label Jan 27, 2025
arkodg added a commit to arkodg/gateway that referenced this issue Jan 30, 2025
arkodg added a commit to arkodg/gateway that referenced this issue Jan 30, 2025
@arkodg arkodg closed this as completed in 4844d9a Jan 30, 2025
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <[email protected]>

* more validations

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <[email protected]>

* more validations

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <[email protected]>

* more validations

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <[email protected]>

* more validations

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit 4844d9a)
Signed-off-by: Guy Daich <[email protected]>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <[email protected]>

* more validations

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit 4844d9a)
Signed-off-by: Guy Daich <[email protected]>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <[email protected]>

* more validations

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit 4844d9a)
Signed-off-by: Guy Daich <[email protected]>
guydc added a commit that referenced this issue Jan 31, 2025
* doc: response compression (#5071)

compression docs

Signed-off-by: Huabing Zhao <[email protected]>
(cherry picked from commit 549fdde)
Signed-off-by: Guy Daich <[email protected]>

* docs: how to specify a self-signed ca for the remote jwks host in the SP JWT settings. (#5085)

* docs for jwt self-signed ca

Signed-off-by: Huabing Zhao <[email protected]>

* fix gen

Signed-off-by: Huabing Zhao <[email protected]>

* update docs

Signed-off-by: Huabing Zhao <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>
(cherry picked from commit fdc7849)
Signed-off-by: Guy Daich <[email protected]>

* chore: fix gen (#5166)

fix gen

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
(cherry picked from commit 34db8af)
Signed-off-by: Guy Daich <[email protected]>

* docs: add api key auth instructions (#5097)

* docs: add api key auth instruction

Signed-off-by: Taufik Mulyana <[email protected]>

* fix: remove unrelated links

Signed-off-by: Taufik Mulyana <[email protected]>

---------

Signed-off-by: Taufik Mulyana <[email protected]>
(cherry picked from commit b5cf087)
Signed-off-by: Guy Daich <[email protected]>

* add SECURITY.md (#5167)

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit f7a10eb)
Signed-off-by: Guy Daich <[email protected]>

* chore: link SECURITY.md (#5168)

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit ac9026f)
Signed-off-by: Guy Daich <[email protected]>

* build(deps): bump actions/stale from 9.0.0 to 9.1.0 (#5162)

Bumps [actions/stale](https://github.com/actions/stale) from 9.0.0 to 9.1.0.
- [Release notes](https://github.com/actions/stale/releases)
- [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
- [Commits](actions/stale@28ca103...5bef64f)

---
updated-dependencies:
- dependency-name: actions/stale
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Arko Dasgupta <[email protected]>
(cherry picked from commit 57d4aa8)
Signed-off-by: Guy Daich <[email protected]>

* docs: rm sectionName from some of the examples (#5173)

adds whats left off from #4868

deleted the sectionName in these examples because the Service spec does
not define a port `Name`

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit 45804e2)
Signed-off-by: Guy Daich <[email protected]>

* ci(fix): osv-scanner PR mode (#5174)

fix: osv-scanner PR mode

Signed-off-by: shahar-h <[email protected]>
Co-authored-by: Guy Daich <[email protected]>
(cherry picked from commit e904d3f)
Signed-off-by: Guy Daich <[email protected]>

* wip: docs: add standalone in container instruction (#5172)

* docs: add standalone in container instruction

Signed-off-by: Denis Shatokhin <[email protected]>

* docs: update headings and image tag

Signed-off-by: Denis Shatokhin <[email protected]>

---------

Signed-off-by: Denis Shatokhin <[email protected]>
(cherry picked from commit a3448c1)
Signed-off-by: Guy Daich <[email protected]>

* docs: update prerequisites files with installation and connectivity t… (#5094)

* docs: update prerequisites files with installation and connectivity testing steps

Signed-off-by: DeeBi9 <[email protected]>

* lint

Signed-off-by: DeeBi9 <[email protected]>

* docs: remove the Note

Signed-off-by: DeeBi9 <[email protected]>

* remove redundant code

Signed-off-by: DeeBi9 <[email protected]>

---------

Signed-off-by: DeeBi9 <[email protected]>
(cherry picked from commit 3253339)
Signed-off-by: Guy Daich <[email protected]>

* [release/v1.3] fix 1.3.0-rc.1 release note (#5175)

* fix 1.3.0-rc.1 release note

Signed-off-by: Guy Daich <[email protected]>

* more fixes

Signed-off-by: Guy Daich <[email protected]>

---------

Signed-off-by: Guy Daich <[email protected]>
(cherry picked from commit 4fba2bf)
Signed-off-by: Guy Daich <[email protected]>

* fail validation if baseInterval is 0s (#5176)

* fail validation if baseInterval is 0s

Fixes: #5147

Signed-off-by: Arko Dasgupta <[email protected]>

* more validations

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit 4844d9a)
Signed-off-by: Guy Daich <[email protected]>

* [release/1.3] release notes (#5177)

Signed-off-by: Guy Daich <[email protected]>
(cherry picked from commit c2215b2)
Signed-off-by: Guy Daich <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>
Signed-off-by: Guy Daich <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Taufik Mulyana <[email protected]>
Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: shahar-h <[email protected]>
Signed-off-by: Denis Shatokhin <[email protected]>
Signed-off-by: DeeBi9 <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: Taufik Mulyana <[email protected]>
Co-authored-by: Arko Dasgupta <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: shahar-h <[email protected]>
Co-authored-by: Denis Shatokhin <[email protected]>
Co-authored-by: Deepanshu Bisht <[email protected]>
@dda104
Copy link

dda104 commented Jan 31, 2025

Same problem with all routes 404 when using filters

---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: something
spec:
  parentRefs:
    - name: eg
      namespace: something
      sectionName: something
  hostnames:
    - something.com
  rules:
    - backendRefs:
        - group: ""
          kind: Service
          name: something
          port: 123
          weight: 1
      filters:
        - type: RequestHeaderModifier
          requestHeaderModifier:
            add:
              - name: Host
                value: something.com
              - name: X-Forwarded-Proto
                value: https
              - name: X-Forwarded-Host
                value: something.com
      matches:
        - path:
            type: PathPrefix
            value: /

I understand that filters are not needed here and maybe they are written incorrectly, I'm just making a report that the problem in one httproute affects all httroutes in the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/xds-translator kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants