Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster DNS resolution fails after ambassador container restarts #5785

Closed
fs185143 opened this issue Sep 19, 2024 · 5 comments
Closed

Cluster DNS resolution fails after ambassador container restarts #5785

fs185143 opened this issue Sep 19, 2024 · 5 comments
Labels
t:bug Something isn't working

Comments

@fs185143
Copy link

fs185143 commented Sep 19, 2024

Describe the bug
A clear and concise description of what the bug is.

If the ambassador container restarts (e.g., due to OOMKill), then cluster DNS resolution will fail with a 503 status. This is then shown as a 403 status to the client.

This can be resolved via:

  • k rollout restart -n emissary deployment/emissary-ingress; or
  • k exec -n emissary deployment/emissary-ingress -c ambassador -- echo /etc/hosts << "authserver.authserver <service-cluster-ip>"

Predictably, setting failure_mode_allow: true on the AuthService also "resolves" it, but only by allowing it to bypass the auth server which is not acceptable.

To Reproduce
Steps to reproduce the behavior:

  • Create some AuthService that points to your Service
  • Observe this listed in k exec -n emissary deployment/emissary-ingress -c ambassador -- cat /ambassador/clustermap.json, e.g., as
  "cluster_extauth_authserver_authserver_8080_emissary": {
    "kind": "KubernetesServiceResolver",
    "namespace": "authserver",
    "port": 8080,
    "service": "authserver"
  },
  • Force ambassador container to restart with k exec -n emissary deployment/emissary-ingress -c ambassador -- curl -X POST localhost:8001/quitquitquit
  • Try and access some endpoint that relies on the AuthService
  • Observe DNS resolution failure in ambassador debug logs:
    [2024-08-07 11:02:16.419][38][debug][http] [source/common/http/conn_manager_impl.cc:1149] [Tags: "ConnectionId":"100339","StreamId":"2631911976081457373"] request end stream
    [2024-08-07 11:02:16.419][38][debug][connection] [./source/common/network/connection_impl.h:98] [C100339] current connecting state: false
    [2024-08-07 11:02:16.420][38][debug][router] [source/common/router/router.cc:478] [Tags: "ConnectionId":"0","StreamId":"9760236291867409354"] cluster 'cluster_extauth_authserver_authserver_8080_emissary' match for URL '<path>'
    [2024-08-07 11:02:16.420][38][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1669] no healthy host for HTTP connection pool
    [2024-08-07 11:02:16.420][38][debug][http] [source/common/http/async_client_impl.cc:123] async http request response headers (end_stream=false):
    ':status', '503'
    'content-length', '19'
    'content-type', 'text/plain'
    
    [2024-08-07 11:02:16.420][38][debug][http] [source/common/http/filter_manager.cc:946] [Tags: "ConnectionId":"100339","StreamId":"2631911976081457373"] Preparing local reply with details ext_authz_error
    [2024-08-07 11:02:16.420][38][debug][http] [source/common/http/filter_manager.cc:988] [Tags: "ConnectionId":"100339","StreamId":"2631911976081457373"] Executing sending local reply.
    [2024-08-07 11:02:16.420][38][debug][http] [source/common/http/conn_manager_impl.cc:1820] [Tags: "ConnectionId":"100339","StreamId":"2631911976081457373"] encoding headers via codec (end_stream=false):
    ':status', '403'
    'content-length', '3984'
    'content-type', 'text/html'
    'date', 'Wed, 07 Aug 2024 11:02:15 GMT'
    'server', 'envoy'
    ...
    [2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
    [2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
    [2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
    [2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms
    
  • Also observe this returned as a 403 to the client via ambassador logs:
    ACCESS [2024-07-19T09:59:29.493Z] "GET <path> HTTP/1.1" 403 UAEX 0 3984 0 - "<ip>, <ip>,<ip>, <ip>, <ip>,<ip>" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0" <uuid>" "<uuid>.<domain>" "-"
    

Expected behavior
DNS resolution should continue to work after container restarts.

Versions (please complete the following information):

  • Ambassador: 3.9.1
  • Kubernetes environment: GKE
  • Version: v1.29.7-gke.1104000

Additional context
I also noticed that regardless of the state of ambassador, running k exec -n emissary deployment/emissary-ingress -c ambassador -- nslookup authserver.authserver.svc.cluster.local fails. It only works if I do k exec -n emissary deployment/emissary-ingress -c ambassador -- nslookup authserver.authserver.svc.cluster.local <dns-ip> with <dns-ip> being the ClusterIP of the kube-system/kubedns Service.

@fs185143
Copy link
Author

fs185143 commented Sep 19, 2024

Looking at another container on our cluster, /etc/resolv.conf contains

search <namespace>.svc.cluster.local svc.cluster.local cluster.local google.internal
nameserver 10.8.0.10
options ndots:5

Perhaps this is an issue with the container being based on busybox?

I got it to work by changing the /etc/resolv.conf in the ambassador container to

search authserver.svc.cluster.local emissary.svc.cluster.local svc.cluster.local cluster.local google.internal .
nameserver 10.8.0.10
options ndots:5

@fs185143
Copy link
Author

It seems like busybox v1.28.4 has the correct /etc/resolv.conf, but v1.36.1 (what ambassador uses) does not.

@fs185143
Copy link
Author

Works on BusyBox v1.36.1 (2023-05-18 22:34:17 UTC) multi-call binary., but not BusyBox v1.36.1 (2023-11-06 11:32:24 UTC) multi-call binary. (where ambassador uses the latter).

@fs185143
Copy link
Author

After further inspection, it seems like /etc/resolv.conf is correct for a few seconds after the container starts, but shortly after gets overwritten to be

# Generated by resolvconf
nameserver 10.64.0.10

The only processes running are

PID   USER     TIME  COMMAND
    1 ambassad  0:02 busyambassador entrypoint
   28 ambassad  0:00 {diagd} /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1
   29 ambassad  0:00 {diagd} /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1

@fs185143
Copy link
Author

was caused by another container on the same pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant