Bug: Panic in Distributor #10724

lasermoth · 2025-02-24T06:24:47Z

What is the bug?

During a rollout of distributor pods I noticed some pods panic during shutdown.

1740358044474	{"caller":"signals.go:62","level":"info","msg":"=== received SIGINT/SIGTERM ===\n*** exiting","ts":"2025-02-24T00:47:24.455676056Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"active-groups-cleanup-service","msg":"module stopped","ts":"2025-02-24T00:47:29.463218373Z"}
1740358049474	{"caller":"basic_lifecycler.go:238","level":"info","msg":"ring lifecycler is shutting down","ring":"distributor","ts":"2025-02-24T00:47:29.463913053Z"}
1740358049474	{"caller":"basic_lifecycler.go:403","level":"info","msg":"unregistering instance from ring","ring":"distributor","ts":"2025-02-24T00:47:29.464202522Z"}
1740358049474	{"caller":"basic_lifecycler.go:278","level":"info","msg":"instance removed from the ring","ring":"distributor","ts":"2025-02-24T00:47:29.464347533Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"distributor-service","msg":"module stopped","ts":"2025-02-24T00:47:29.464872722Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"ingester-ring","msg":"module stopped","ts":"2025-02-24T00:47:29.464974382Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"runtime-config","msg":"module stopped","ts":"2025-02-24T00:47:29.465042882Z"}
1740358049474	{"caller":"memberlist_client.go:720","level":"info","msg":"leaving memberlist cluster","ts":"2025-02-24T00:47:29.465085602Z"}
1740358049489	2025/02/24 00:47:29 http: panic serving 10.252.42.5:47292: send on closed channel
1740358049489	goroutine 399553 [running]:
1740358049489	net/http.(*conn).serve.func1()
1740358049489		/usr/local/go/src/net/http/server.go:1903 +0xbe
1740358049489	panic({0x2706d20?, 0x3539f10?})
1740358049489		/usr/local/go/src/runtime/panic.go:770 +0x132
1740358049489	github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5.1()
1740358049489		/__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:155 +0x175
1740358049489	panic({0x2706d20?, 0x3539f10?})
1740358049489		/usr/local/go/src/runtime/panic.go:770 +0x132
1740358049489	github.com/grafana/dskit/concurrency.(*ReusableGoroutinesPool).Go(0x29c75c0?, 0xc01d3c18c0)
1740358049489		/__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/worker.go:28 +0x25
1740358049489	github.com/grafana/dskit/ring.DoBatchWithOptions({0x356b828, 0xc01d3c1860}, 0x1, {0x35602d0, 0xc000c9a908}, {0xc01d3ca000, 0x7d0, 0x2872ec0?}, 0xc01d3c1890, {0xc01d3c2b70, ...})
1740358049489		/__w/mimir/mimir/vendor/github.com/grafana/dskit/ring/batch.go:180 +0x722
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).sendWriteRequestToIngesters(0xc001c9c808, {0x356b828, 0xc01d3c1860}, {0x35602d0, 0xc000c9a908}, 0xc01ceead40, {0xc01d3ca000, 0x7d0, 0x7d0}, 0x7d0, ...)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1579 +0x136
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).sendWriteRequestToBackends(0xc001c9c808, {0x356b828, 0xc01d3c1860}, {0xc0253d9745, 0xb}, 0xc01ceead40, {0xc01d3ca000, 0x7d0, 0x7d0}, 0x7d0, ...)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1536 +0x90a
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).push(0xc001c9c808, {0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1485 +0x65b
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0x19535758676?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushValidationMiddleware-fm.(*Distributor).prePushValidationMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1131 +0xd78
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc0253d9745?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushSortAndFilterMiddleware-fm.(*Distributor).prePushSortAndFilterMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:988 +0x21d
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0x107283d3b67288c4?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushRelabelMiddleware-fm.(*Distributor).prePushRelabelMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:943 +0x4d6
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xb?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushHaDedupeMiddleware-fm.(*Distributor).prePushHaDedupeMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:886 +0x752
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc001dc8f10?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).metricsMiddleware-fm.(*Distributor).metricsMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1177 +0x3e7
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc01ceedd40?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).limitsMiddleware-fm.(*Distributor).limitsMiddleware.func1({0x356b828?, 0xc01ceedd40?}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1360 +0x237
1740358049489	github.com/grafana/mimir/pkg/api.(*API).RegisterDistributor.Handler.handler.func2({0x3566d60, 0xc01ceead00}, 0xc01ced9440)
1740358049489		/__w/mimir/mimir/pkg/distributor/push.go:159 +0x25a
1740358049489	net/http.HandlerFunc.ServeHTTP(0x0?, {0x3566d60?, 0xc01ceead00?}, 0x412005?)
1740358049489		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/mimir/pkg/api.(*API).newRoute.ConsistencyMiddleware.func1.1({0x3566d60, 0xc01ceead00}, 0xc01ced9440)
1740358049490		/__w/mimir/mimir/pkg/querier/api/consistency.go:58 +0xab
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced9320?, {0x3566d60?, 0xc01ceead00?}, 0xc001dc92a8?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/mimir/pkg/api.New.newTenantValidationMiddleware.func1.1({0x3566d60, 0xc01ceead00}, 0xc01ced9320)
1740358049490		/__w/mimir/mimir/pkg/api/tenant.go:43 +0x174
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced9200?, {0x3566d60?, 0xc01ceead00?}, 0x3530a01?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.init.func2.1({0x3566d60, 0xc01ceead00}, 0xc01ced9200)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/http_auth.go:21 +0x108
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced90e0?, {0x3566d60?, 0xc01ceead00?}, 0xc001dc9420?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/gorilla/mux.(*Router).ServeHTTP(0xc000000480, {0x3566d60, 0xc01ceead00}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/gorilla/mux/mux.go:212 +0x1e2
1740358049490	github.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1.2({0x3566d60?, 0xc01ceead00?})
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:89 +0x33
1740358049490	github.com/felixge/httpsnoop.(*Metrics).CaptureMetrics(0xc0273e0eb8, {0x7c386eb2ee70, 0xc051b57080}, 0xc001dc9750)
1740358049490		/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:84 +0x1e5
1740358049490	github.com/felixge/httpsnoop.CaptureMetricsFn({0x7c386eb2ee70, 0xc051b57080}, 0xc001dc9750)
1740358049490		/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:39 +0x4e
1740358049490	github.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1({0x7c386eb2ee70, 0xc051b57080}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:88 +0x2dd
1740358049490	net/http.HandlerFunc.ServeHTTP(0x3562a50?, {0x7c386eb2ee70?, 0xc051b57080?}, 0xc01ceedc50?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.(*Log).Wrap.Log.Wrap.func1({0x3562a50, 0xc051b57020}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/logging.go:90 +0x26f
1740358049490	net/http.HandlerFunc.ServeHTTP(0x412005?, {0x3562a50?, 0xc051b57020?}, 0xc000ca3901?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5({0x355f9a0, 0xc024c26c40}, 0xc01ced8c60)
1740358049490		/__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:159 +0x4d6
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced8b40?, {0x355f9a0?, 0xc024c26c40?}, 0x7c386eaec6e0?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.(*RouteInjector).Wrap.RouteInjector.Wrap.func1({0x355f9a0, 0xc024c26c40}, 0xc01ced8b40)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/route_injector.go:24 +0x72
1740358049490	net/http.HandlerFunc.ServeHTTP(0x412005?, {0x355f9a0?, 0xc024c26c40?}, 0xc024c26c01?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	net/http.serverHandler.ServeHTTP({0x3557af8?}, {0x355f9a0?, 0xc024c26c40?}, 0x6?)
1740358049490		/usr/local/go/src/net/http/server.go:3142 +0x8e
1740358049490	net/http.(*conn).serve(0xc01c0f19e0, {0x356b828, 0xc001f20120})
1740358049490		/usr/local/go/src/net/http/server.go:2044 +0x5e8
1740358049490	created by net/http.(*Server).Serve in goroutine 301
1740358049490		/usr/local/go/src/net/http/server.go:3290 +0x4b4
1740358049504	2025/02/24 00:47:29 http: panic serving 10.252.95.26:32878: send on closed channel
1740358049504	goroutine 399493 [running]:
1740358049504	net/http.(*conn).serve.func1()
1740358049504		/usr/local/go/src/net/http/server.go:1903 +0xbe
......
......
# This continues for other goroutines

It does not happen on all pods, and seems rather random.

I was able to reproduce this by terminating a single pod, and weirdly this also caused CPU to drop on all other distributors.

I did note some 503s from Prometheus at the time which is likely why this is happened but interesting a single pod caused this behaviour.

Likely the most critical thing of note is changes to the grpc config to try and align with the GKE graceful termination period of 15 sec for spot instances.

   - -server.grpc.keepalive.max-connection-age=10s
   - -server.grpc.keepalive.max-connection-age-grace=5s
   - -server.grpc.keepalive.max-connection-idle=10s
   - -shutdown-delay=5s

This could possibly be the factor here, in which case spot instances would not be possible.

The CPU drop across all pods is confusing however as I would expect Prometheus to just retry if a single distributor pod had terminated the connection unexpectedly.

How to reproduce it?

Unsure.

Possibly with the GRPC changes

   - -server.grpc.keepalive.max-connection-age=10s
   - -server.grpc.keepalive.max-connection-age-grace=5s
   - -server.grpc.keepalive.max-connection-idle=10s
   - -shutdown-delay=5s

What did you think would happen?

Graceful shutdown of the distirbutor pod.

What was your environment?

Mimir 2.13,
GKE 1.31.5-gke.1068000

Any additional context to share?

I could not see anything related in the release notes through to 2.15, but am happy to upgrade.

The text was updated successfully, but these errors were encountered:

narqo · 2025-03-07T08:32:17Z

What was your environment?
Mimir 2.13

Thank you for reporting this. If possible, could you check if you can reproduce it in 2.15? Looking through the panic's traceback, it seems that it came through dskit. Given we don't maintain the changelog when it comes to regular update of the dependencies, it's hard to tell if this was resolved already in the latest Mimir release.

lasermoth added the bug Something isn't working label Feb 24, 2025

narqo added the component/distributor label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Panic in Distributor #10724

Bug: Panic in Distributor #10724

lasermoth commented Feb 24, 2025

narqo commented Mar 7, 2025

Bug: Panic in Distributor #10724

Bug: Panic in Distributor #10724

Comments

lasermoth commented Feb 24, 2025

What is the bug?

How to reproduce it?

What did you think would happen?

What was your environment?

Any additional context to share?

narqo commented Mar 7, 2025