-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful shutdown drain not happening due to prometheus #4125
Comments
thanks for raising this issue @luvk1412 , agree with your suggestions, reiterating them here, in order of preference
for anyone else hitting this issue you can circumvent it by setting
|
ptal @envoyproxy/gateway-maintainers, we may need to cherry-pick the solution back to previous releases |
will istio/istio#52971 help this? |
Hey @arkodg , I have tried the latest helm chart with gateway image "docker.io/envoyproxy/gateway-dev:latest" i.e without mentioning by removing .global.images.envoyGateway.image in values.yaml after the apply , when i fetch the envoy config via now i dont see drainType = MODIFY_ONLY anymore , but still rolling restart and pod deletion/eviction of envoy proxies is taking around 5mins, with that one connection of prometheus still there. is there something i am missing ? |
thanks for retesting @ncsham, solution here is to probably reduce default drain timeout to 60s |
* Set default minDrainDuration to `10s` . Since the default `readinessProbe.periodSeconds` is `5s`, this gives any LB controller `5s` to update its endpoint pool if its basing it off the k8s API server * Set default `drainTimeout` to `60s`. This ensures clients holding persistent connections, can be closed sooner. Fixes: envoyproxy#4125 * Updates the default `terminationGracePeriodSeconds` to `360s` which is `300s` more than the default drain timeout Signed-off-by: Arko Dasgupta <[email protected]>
Description
While testing envoy graceful shutdown in my staging env, I am facing an issue where all connections close in mostly 5-10 seconds, but there is one single connection which remains active for 5 minutes (shown in log line
envoy/shutdown_manager.go:224 total connections: 1
inshutdown-manager
logs, and this log line continues to appear for 5m). Due to this, during deployment/restarts of envoy proxy pods, new pods come up and get ready but old pods take 5m to terminate. Full Shutdown Manager LogsWhile debugging this it was pointed out by @arkodg in following slack thread that it could be due to prometheus. On removing the
ServiceMonitor
everything worked fine. So basically the one connection is due to prometheus scraping which is not getting closed automatically. My guess is the 5m time is due to be the idle timeout of GO Http library used by prometheus but i am not sure about this - Source.Need suggestions on how can i fix this as i don't see any sort of configurable timeout in prometheus for connections used for scrapping. Possible solutions i can think of:
shutdown.drainTimeout
inEnvoyProxy
to decrease the time but this doesn't seems to be an ideal solution.envoy-gateway-proxy-ready-0.0.0.0-19001
listener using ProxyBootstrap, haven't tried this yet but this could work.Setup info:
The text was updated successfully, but these errors were encountered: