Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Panic in Distributor #10724

Open
lasermoth opened this issue Feb 24, 2025 · 1 comment
Open

Bug: Panic in Distributor #10724

lasermoth opened this issue Feb 24, 2025 · 1 comment
Labels
bug Something isn't working component/distributor

Comments

@lasermoth
Copy link
Contributor

What is the bug?

During a rollout of distributor pods I noticed some pods panic during shutdown.

1740358044474	{"caller":"signals.go:62","level":"info","msg":"=== received SIGINT/SIGTERM ===\n*** exiting","ts":"2025-02-24T00:47:24.455676056Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"active-groups-cleanup-service","msg":"module stopped","ts":"2025-02-24T00:47:29.463218373Z"}
1740358049474	{"caller":"basic_lifecycler.go:238","level":"info","msg":"ring lifecycler is shutting down","ring":"distributor","ts":"2025-02-24T00:47:29.463913053Z"}
1740358049474	{"caller":"basic_lifecycler.go:403","level":"info","msg":"unregistering instance from ring","ring":"distributor","ts":"2025-02-24T00:47:29.464202522Z"}
1740358049474	{"caller":"basic_lifecycler.go:278","level":"info","msg":"instance removed from the ring","ring":"distributor","ts":"2025-02-24T00:47:29.464347533Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"distributor-service","msg":"module stopped","ts":"2025-02-24T00:47:29.464872722Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"ingester-ring","msg":"module stopped","ts":"2025-02-24T00:47:29.464974382Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"runtime-config","msg":"module stopped","ts":"2025-02-24T00:47:29.465042882Z"}
1740358049474	{"caller":"memberlist_client.go:720","level":"info","msg":"leaving memberlist cluster","ts":"2025-02-24T00:47:29.465085602Z"}
1740358049489	2025/02/24 00:47:29 http: panic serving 10.252.42.5:47292: send on closed channel
1740358049489	goroutine 399553 [running]:
1740358049489	net/http.(*conn).serve.func1()
1740358049489		/usr/local/go/src/net/http/server.go:1903 +0xbe
1740358049489	panic({0x2706d20?, 0x3539f10?})
1740358049489		/usr/local/go/src/runtime/panic.go:770 +0x132
1740358049489	github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5.1()
1740358049489		/__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:155 +0x175
1740358049489	panic({0x2706d20?, 0x3539f10?})
1740358049489		/usr/local/go/src/runtime/panic.go:770 +0x132
1740358049489	github.com/grafana/dskit/concurrency.(*ReusableGoroutinesPool).Go(0x29c75c0?, 0xc01d3c18c0)
1740358049489		/__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/worker.go:28 +0x25
1740358049489	github.com/grafana/dskit/ring.DoBatchWithOptions({0x356b828, 0xc01d3c1860}, 0x1, {0x35602d0, 0xc000c9a908}, {0xc01d3ca000, 0x7d0, 0x2872ec0?}, 0xc01d3c1890, {0xc01d3c2b70, ...})
1740358049489		/__w/mimir/mimir/vendor/github.com/grafana/dskit/ring/batch.go:180 +0x722
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).sendWriteRequestToIngesters(0xc001c9c808, {0x356b828, 0xc01d3c1860}, {0x35602d0, 0xc000c9a908}, 0xc01ceead40, {0xc01d3ca000, 0x7d0, 0x7d0}, 0x7d0, ...)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1579 +0x136
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).sendWriteRequestToBackends(0xc001c9c808, {0x356b828, 0xc01d3c1860}, {0xc0253d9745, 0xb}, 0xc01ceead40, {0xc01d3ca000, 0x7d0, 0x7d0}, 0x7d0, ...)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1536 +0x90a
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).push(0xc001c9c808, {0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1485 +0x65b
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0x19535758676?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushValidationMiddleware-fm.(*Distributor).prePushValidationMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1131 +0xd78
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc0253d9745?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushSortAndFilterMiddleware-fm.(*Distributor).prePushSortAndFilterMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:988 +0x21d
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0x107283d3b67288c4?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushRelabelMiddleware-fm.(*Distributor).prePushRelabelMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:943 +0x4d6
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xb?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushHaDedupeMiddleware-fm.(*Distributor).prePushHaDedupeMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:886 +0x752
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc001dc8f10?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).metricsMiddleware-fm.(*Distributor).metricsMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1177 +0x3e7
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc01ceedd40?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).limitsMiddleware-fm.(*Distributor).limitsMiddleware.func1({0x356b828?, 0xc01ceedd40?}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1360 +0x237
1740358049489	github.com/grafana/mimir/pkg/api.(*API).RegisterDistributor.Handler.handler.func2({0x3566d60, 0xc01ceead00}, 0xc01ced9440)
1740358049489		/__w/mimir/mimir/pkg/distributor/push.go:159 +0x25a
1740358049489	net/http.HandlerFunc.ServeHTTP(0x0?, {0x3566d60?, 0xc01ceead00?}, 0x412005?)
1740358049489		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/mimir/pkg/api.(*API).newRoute.ConsistencyMiddleware.func1.1({0x3566d60, 0xc01ceead00}, 0xc01ced9440)
1740358049490		/__w/mimir/mimir/pkg/querier/api/consistency.go:58 +0xab
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced9320?, {0x3566d60?, 0xc01ceead00?}, 0xc001dc92a8?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/mimir/pkg/api.New.newTenantValidationMiddleware.func1.1({0x3566d60, 0xc01ceead00}, 0xc01ced9320)
1740358049490		/__w/mimir/mimir/pkg/api/tenant.go:43 +0x174
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced9200?, {0x3566d60?, 0xc01ceead00?}, 0x3530a01?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.init.func2.1({0x3566d60, 0xc01ceead00}, 0xc01ced9200)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/http_auth.go:21 +0x108
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced90e0?, {0x3566d60?, 0xc01ceead00?}, 0xc001dc9420?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/gorilla/mux.(*Router).ServeHTTP(0xc000000480, {0x3566d60, 0xc01ceead00}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/gorilla/mux/mux.go:212 +0x1e2
1740358049490	github.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1.2({0x3566d60?, 0xc01ceead00?})
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:89 +0x33
1740358049490	github.com/felixge/httpsnoop.(*Metrics).CaptureMetrics(0xc0273e0eb8, {0x7c386eb2ee70, 0xc051b57080}, 0xc001dc9750)
1740358049490		/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:84 +0x1e5
1740358049490	github.com/felixge/httpsnoop.CaptureMetricsFn({0x7c386eb2ee70, 0xc051b57080}, 0xc001dc9750)
1740358049490		/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:39 +0x4e
1740358049490	github.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1({0x7c386eb2ee70, 0xc051b57080}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:88 +0x2dd
1740358049490	net/http.HandlerFunc.ServeHTTP(0x3562a50?, {0x7c386eb2ee70?, 0xc051b57080?}, 0xc01ceedc50?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.(*Log).Wrap.Log.Wrap.func1({0x3562a50, 0xc051b57020}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/logging.go:90 +0x26f
1740358049490	net/http.HandlerFunc.ServeHTTP(0x412005?, {0x3562a50?, 0xc051b57020?}, 0xc000ca3901?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5({0x355f9a0, 0xc024c26c40}, 0xc01ced8c60)
1740358049490		/__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:159 +0x4d6
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced8b40?, {0x355f9a0?, 0xc024c26c40?}, 0x7c386eaec6e0?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.(*RouteInjector).Wrap.RouteInjector.Wrap.func1({0x355f9a0, 0xc024c26c40}, 0xc01ced8b40)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/route_injector.go:24 +0x72
1740358049490	net/http.HandlerFunc.ServeHTTP(0x412005?, {0x355f9a0?, 0xc024c26c40?}, 0xc024c26c01?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	net/http.serverHandler.ServeHTTP({0x3557af8?}, {0x355f9a0?, 0xc024c26c40?}, 0x6?)
1740358049490		/usr/local/go/src/net/http/server.go:3142 +0x8e
1740358049490	net/http.(*conn).serve(0xc01c0f19e0, {0x356b828, 0xc001f20120})
1740358049490		/usr/local/go/src/net/http/server.go:2044 +0x5e8
1740358049490	created by net/http.(*Server).Serve in goroutine 301
1740358049490		/usr/local/go/src/net/http/server.go:3290 +0x4b4
1740358049504	2025/02/24 00:47:29 http: panic serving 10.252.95.26:32878: send on closed channel
1740358049504	goroutine 399493 [running]:
1740358049504	net/http.(*conn).serve.func1()
1740358049504		/usr/local/go/src/net/http/server.go:1903 +0xbe
......
......
# This continues for other goroutines

It does not happen on all pods, and seems rather random.

I was able to reproduce this by terminating a single pod, and weirdly this also caused CPU to drop on all other distributors.

Image

I did note some 503s from Prometheus at the time which is likely why this is happened but interesting a single pod caused this behaviour.

Likely the most critical thing of note is changes to the grpc config to try and align with the GKE graceful termination period of 15 sec for spot instances.

   - -server.grpc.keepalive.max-connection-age=10s
   - -server.grpc.keepalive.max-connection-age-grace=5s
   - -server.grpc.keepalive.max-connection-idle=10s
   - -shutdown-delay=5s

This could possibly be the factor here, in which case spot instances would not be possible.

The CPU drop across all pods is confusing however as I would expect Prometheus to just retry if a single distributor pod had terminated the connection unexpectedly.

How to reproduce it?

Unsure.

Possibly with the GRPC changes

   - -server.grpc.keepalive.max-connection-age=10s
   - -server.grpc.keepalive.max-connection-age-grace=5s
   - -server.grpc.keepalive.max-connection-idle=10s
   - -shutdown-delay=5s

What did you think would happen?

Graceful shutdown of the distirbutor pod.

What was your environment?

Mimir 2.13,
GKE 1.31.5-gke.1068000

Any additional context to share?

I could not see anything related in the release notes through to 2.15, but am happy to upgrade.

@lasermoth lasermoth added the bug Something isn't working label Feb 24, 2025
@narqo
Copy link
Contributor

narqo commented Mar 7, 2025

What was your environment?
Mimir 2.13

Thank you for reporting this. If possible, could you check if you can reproduce it in 2.15? Looking through the panic's traceback, it seems that it came through dskit. Given we don't maintain the changelog when it comes to regular update of the dependencies, it's hard to tell if this was resolved already in the latest Mimir release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/distributor
Projects
None yet
Development

No branches or pull requests

2 participants