Graceful shutdown drain not happening due to prometheus #4125

luvk1412 · 2024-08-29T17:53:34Z

Description

While testing envoy graceful shutdown in my staging env, I am facing an issue where all connections close in mostly 5-10 seconds, but there is one single connection which remains active for 5 minutes (shown in log line envoy/shutdown_manager.go:224 total connections: 1 in shutdown-manager logs, and this log line continues to appear for 5m). Due to this, during deployment/restarts of envoy proxy pods, new pods come up and get ready but old pods take 5m to terminate. Full Shutdown Manager Logs

While debugging this it was pointed out by @arkodg in following slack thread that it could be due to prometheus. On removing the ServiceMonitor everything worked fine. So basically the one connection is due to prometheus scraping which is not getting closed automatically. My guess is the 5m time is due to be the idle timeout of GO Http library used by prometheus but i am not sure about this - Source.

Need suggestions on how can i fix this as i don't see any sort of configurable timeout in prometheus for connections used for scrapping. Possible solutions i can think of:

Can decrease shutdown.drainTimeout in EnvoyProxy to decrease the time but this doesn't seems to be an ideal solution.
Add a timeout in envoy-gateway-proxy-ready-0.0.0.0-19001 listener using ProxyBootstrap, haven't tried this yet but this could work.

Setup info:

prom global config :

global:
  scrape_interval: 30s
  scrape_timeout: 15s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  evaluation_interval: 1m
  external_labels:
    prometheus: monitoring/kps-prometheus
    prometheus_replica: prometheus-kps-prometheus-0

Envoy's graceful shutdown settings are at default.

The text was updated successfully, but these errors were encountered:

arkodg · 2024-08-29T22:31:27Z

thanks for raising this issue @luvk1412 , agree with your suggestions, reiterating them here, in order of preference

Reduce the default drainTimeout to 60s.
enhance envoy proxy to support draining a specific listener (envoy-gateway-proxy-ready-0.0.0.0-19001) using the admin endpoint https://www.envoyproxy.io/docs/envoy/latest/operations/admin.html#post--drain_listeners
e.g. POST /drain_listeners?name=envoy-gateway-proxy-ready-0.0.0.0-19001 so we can perform immediate draining for the prom listener and graceful draining for other listeners. This however means that we will lose some metrics b/w the time shutdown starts and ends.
Set the downstream http idle timeout to a value less than 60s like 30s https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-httpprotocoloptions. This may solve the shutdown issue but will have a slight impact of efficiency and performance of prom metrics collection (it will now have to setup a connection every time to pull metrics)

for anyone else hitting this issue you can circumvent it by setting shutdown.drainTimeout in the EnvoyProxy to 60s (or similar value)

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: proxy-config
  namespace: envoy-gateway-system
spec:
  shutdown:
    drainTimeout: 60s
``

arkodg · 2024-08-29T22:32:04Z

ptal @envoyproxy/gateway-maintainers, we may need to cherry-pick the solution back to previous releases

arkodg · 2024-09-13T22:31:41Z

hey @luvk1412 can you retry the test with v0.0.0-latest , we recently added #4230 into main which sends out connection: close immediately which may help here

zirain · 2024-09-13T23:42:05Z

will istio/istio#52971 help this?

luvk1412 · 2024-09-14T09:18:31Z

sure @arkodg , will try this. cc @ncsham .

ncsham · 2024-09-16T07:58:52Z

Hey @arkodg , I have tried the latest helm chart with gateway image "docker.io/envoyproxy/gateway-dev:latest" i.e without mentioning by removing .global.images.envoyGateway.image in values.yaml

after the apply , when i fetch the envoy config via
egctl c envoy-proxy -n namespace all pod_name > translated-conf.json

now i dont see drainType = MODIFY_ONLY anymore , but still rolling restart and pod deletion/eviction of envoy proxies is taking around 5mins, with that one connection of prometheus still there.

is there something i am missing ?

arkodg · 2024-09-16T17:19:24Z

thanks for retesting @ncsham, solution here is to probably reduce default drain timeout to 60s

* Set default minDrainDuration to `10s` . Since the default `readinessProbe.periodSeconds` is `5s`, this gives any LB controller `5s` to update its endpoint pool if its basing it off the k8s API server * Set default `drainTimeout` to `60s`. This ensures clients holding persistent connections, can be closed sooner. Fixes: envoyproxy#4125 * Updates the default `terminationGracePeriodSeconds` to `360s` which is `300s` more than the default drain timeout Signed-off-by: Arko Dasgupta <[email protected]>

luvk1412 added the triage label Aug 29, 2024

arkodg added kind/bug Something isn't working kind/decision A record of a decision made by the community. and removed triage labels Aug 29, 2024

arkodg added cherrypick/release-v1.0.3 cherrypick/release-v1.1.1 and removed cherrypick/release-v1.0.3 cherrypick/release-v1.1.1 labels Sep 3, 2024

arkodg added this to the v1.2.0-rc1 milestone Sep 5, 2024

arkodg assigned arkodg and unassigned arkodg Sep 5, 2024

arkodg linked a pull request Sep 19, 2024 that will close this issue

revisit envoy shutdown settings #4288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown drain not happening due to prometheus #4125

Graceful shutdown drain not happening due to prometheus #4125

luvk1412 commented Aug 29, 2024 •

edited

Loading

arkodg commented Aug 29, 2024

arkodg commented Aug 29, 2024

arkodg commented Sep 13, 2024

zirain commented Sep 13, 2024

luvk1412 commented Sep 14, 2024

ncsham commented Sep 16, 2024 •

edited

Loading

arkodg commented Sep 16, 2024

Graceful shutdown drain not happening due to prometheus #4125

Graceful shutdown drain not happening due to prometheus #4125

Comments

luvk1412 commented Aug 29, 2024 • edited Loading

Description

Setup info:

arkodg commented Aug 29, 2024

arkodg commented Aug 29, 2024

arkodg commented Sep 13, 2024

zirain commented Sep 13, 2024

luvk1412 commented Sep 14, 2024

ncsham commented Sep 16, 2024 • edited Loading

arkodg commented Sep 16, 2024

luvk1412 commented Aug 29, 2024 •

edited

Loading

ncsham commented Sep 16, 2024 •

edited

Loading