Error: at least 2 live replicas required across different availability zones, could only find 0 - unhealthy instances #5552

girishms-sentient · 2023-07-19T18:07:08Z

girishms-sentient
Jul 19, 2023

When I try to push metrics from the Prometheus remote, I'm getting the below error.

err="server returned HTTP status 500 Internal Server Error: at least 2 live replicas required across different availability zones, could only find 0 - unhealthy instances

ts=2023-07-19T17:53:46.991Z caller=dedupe.go:112 component=remote level=info remote_name=430cbf url=http://k8s-xxxxxxxxx.elb.us-west-2.amazonaws.com/api/v1/push msg="Done replaying WAL" duration=4.283554275s
ts=2023-07-19T17:53:47.078Z caller=dedupe.go:112 component=remote level=warn remote_name=430cbf url=http://k8s-xxxxxxxxx.elb.us-west-2.amazonaws.com/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: at least 2 live replicas required across different availability zones, could only find 0 - unhealthy instances: 7.0.11.88:9095,7.0.20.11:9095,7.0.4.40:9095"
ts=2023-07-19T17:54:42.754Z caller=dedupe.go:112 component=remote level=warn remote_name=430cbf url=http://k8s-xxxxxxxxx.elb.us-west-2.amazonaws.com/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: at least 2 live replicas required across different availability zones, could only find 0 - unhealthy instances: 7.0.4.40:9095,7.0.20.11:9095,7.0.11.88:9095"
ts=2023-07-19T17:54:49.838Z caller=dedupe.go:112 component=remote level=warn remote_name=430cbf url=http://k8s-xxxxxxxxx.elb.us-west-2.amazonaws.com/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: at least 2 live replicas required across different availability zones, could only find 0 - unhealthy instances: 7.0.11.88:9095,7.0.20.11:9095,7.0.4.40:9095"

Helm Config:

        ################ MIMIR CONFIGURATION #####################
        mimir:
          structuredConfig:
            ingester:
              ring:
                final_sleep: 0s
                num_tokens: 512
                tokens_file_path: /data/tokens
                heartbeat_period: 2s
                heartbeat_timeout: 10s
                unregister_on_shutdown: true
                kvstore:
                  store: memberlist
                replication_factor: 3
                zone_awareness_enabled: true

            memberlist:
              abort_if_cluster_join_fails: false
              compression_enabled: false
              join_members:
                - mimir-gossip-ring:7946

            limits:
              compactor_blocks_retention_period: 604800s
              ingestion_rate: 500000
              max_global_series_per_metric: 9000000
              max_global_series_per_user: 9000000
              max_label_names_per_series: 60

            distributor:
              instance_limits:
                max_ingestion_rate: 0
                max_inflight_push_requests: 0
              remote_timeout: 30s
              ring:
                kvstore:
                  store: memberlist

            compactor:
              data_dir: /data/compactor
              sharding_ring:
                heartbeat_period: 2s
                heartbeat_timeout: 10s  
                kvstore:
                  store: memberlist

            store_gateway:
              sharding_ring:
                heartbeat_period: 2s
                heartbeat_timeout: 10s
                zone_awareness_enabled: true 
                kvstore:
                  store: memberlist  

            ruler:
              rule_path: /data/ruler
              poll_interval: 2s
              ring:
                heartbeat_period: 2s
                heartbeat_timeout: 10s
                kvstore:
                  store: memberlist
          
          ############## MIMIR STORAGE ################
            blocks_storage:
              backend: s3
              bucket_store:
                max_chunk_pool_bytes: 12884901888 # 12GiB
              s3:
                endpoint: s3.us-west-2.amazonaws.com
                bucket_name: central-logging-mimir-bucket-block-storage
                insecure: true
              tsdb:
                dir: /data/tsdb

            alertmanager_storage:
              backend: s3
              s3:
                endpoint: s3.us-west-2.amazonaws.com
                bucket_name: central-logging-mimir-bucket-alertmanager-storage

            ruler_storage:
              backend: s3
              s3:
                endpoint: s3.us-west-2.amazonaws.com
                bucket_name: central-logging-mimir-bucket-ruler-storage

        #############################################################

        extraEnv:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP

        serviceAccount:
          create: true
          name: mimir
          annotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxx:role/central-monitoring-mimir
                                        
        alertmanager:
          persistentVolume:
            enabled: true
            storageClass: ebs-sc
          replicas: 2
          resources:
            limits:
              memory: 1.4Gi
            requests:
              cpu: 1
              memory: 1Gi
          statefulSet:
            enabled: true
          tolerations:
          podAnnotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxx:role/central-monitoring-mimir
          extraArgs:
            memberlist.bind-addr: ${MY_POD_IP}


        compactor:
          persistentVolume:
            size: 5Gi
            storageClass: ebs-sc
          podAnnotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxx:role/central-monitoring-mimir
          extraArgs:
            memberlist.bind-addr: ${MY_POD_IP}


        distributor:
          replicas: 2
          podAnnotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxx:role/central-monitoring-mimir
          extraArgs:
            memberlist.bind-addr: ${MY_POD_IP}


        ingester:
          persistentVolume:
            size: 50Gi
            storageClass: ebs-sc
          replicas: 3
          podAnnotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxx:role/central-monitoring-mimir
          extraArgs:
            memberlist.bind-addr: ${MY_POD_IP}


        ruler:
          replicas: 1
          podAnnotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxx:role/central-monitoring-mimir
          extraArgs:
            memberlist.bind-addr: ${MY_POD_IP}


        store_gateway:
          persistentVolume:
            size: 10Gi
            storageClass: ebs-sc
          replicas: 3
          extraArgs:
            memberlist.bind-addr: ${MY_POD_IP}


        querier:
          replicas: 1
          extraArgs:
            memberlist.bind-addr: ${MY_POD_IP}


        admin-cache:
          enabled: true
          replicas: 2

        chunks-cache:
          enabled: true
          replicas: 2

        index-cache:
          enabled: true
          replicas: 1

        metadata-cache:
          enabled: true

        results-cache:
          enabled: true
          replicas: 2

        minio:
          enabled: false

        overrides_exporter:
          replicas: 1
          resources:
            limits:
              memory: 128Mi
            requests:
              cpu: 100m
              memory: 128Mi

        query_frontend:
          replicas: 1

        nginx:
          service:
            type: LoadBalancer
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-ip-address-type: ipv4
              service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
              service.beta.kubernetes.io/aws-load-balancer-subnets: "xxxxxxxxxxxxxxx"
              service.beta.kubernetes.io/aws-load-balancer-type: external
          replicas: 1
          resources:
            limits:
              memory: 731Mi
            requests:
              cpu: 1
              memory: 512Mi

        # Grafana Enterprise Metrics feature related
        admin_api:
          replicas: 1
          resources:
            limits:
              memory: 128Mi
            requests:
              cpu: 100m
              memory: 64Mi

        gateway:
          replicas: 1
          resources:
            limits:
              memory: 731Mi
            requests:
              cpu: 1
              memory: 512Mi

Despite keeping replication_factor: 3 as suggested, I am still getting this error.

Answered by girishms-sentient

Jul 27, 2023

This issue has been resolved, by making changes to the memberlist.

            memberlist:
              abort_if_cluster_join_fails: false
              compression_enabled: false
              join_members:
                - dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }}
              advertise_addr: ${MY_POD_IP}

View full answer

girishms-sentient · 2023-07-19T18:09:55Z

girishms-sentient
Jul 19, 2023
Author

@pstibrany @pracucci @DylanGuedes I'd appreciate your help on this.

0 replies

pstibrany · 2023-07-20T08:12:26Z

pstibrany
Jul 20, 2023
Maintainer

Are your ingesters running properly? Are they updating their heartbeat in the ring (see /distributor/ring endpoint on distributor pod)? Error indicates that while there are ring entries for ingesters, they are "unhealthy", ie. have old last updated timestamp. Try to focus your investigation on why ingesters don't update their timestamp in the ring.

Update: I just noticed very tight timeouts for hearbeat:

                heartbeat_period: 2s
                heartbeat_timeout: 10s

You may want to start with default values instead (15s hearbeat period, 1m timeout).

0 replies

girishms-sentient · 2023-07-20T09:34:20Z

girishms-sentient
Jul 20, 2023
Author

@pstibrany Thanks for the quick response.

Despite making the changes you suggested, I was still unable to resolve the problem and getting the same issue.

Prometheus logs:

ts=2023-07-20T09:11:37.845Z caller=dedupe.go:112 component=remote level=warn remote_name=5bb1ba url=http://k8s-xxxxxxxx.elb.us-west-2.amazonaws.com/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: at least 2 live replicas required, could only find 0 - unhealthy instances: 7.0.4.17:9095,7.0.19.220:9095,7.0.11.154:9095"

ts=2023-07-20T09:12:22.332Z caller=dedupe.go:112 component=remote level=warn remote_name=5bb1ba url=http://k8s-cxxxxxxxx.elb.us-west-2.amazonaws.com/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: at least 2 live replicas required, could only find 0 - unhealthy instances: 7.0.11.154:9095,7.0.19.220:9095,7.0.4.17:9095"

ts=2023-07-20T09:12:37.922Z caller=dedupe.go:112 component=remote level=warn remote_name=5bb1ba url=http://k8s-xxxxxxxx.elb.us-west-2.amazonaws.com/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: at least 2 live replicas required, could only find 0 - unhealthy instances: 7.0.4.17:9095,7.0.19.220:9095,7.0.11.154:9095"

Mimir Distributor logs:

ts=2023-07-20T09:18:37.777227367Z caller=logging.go:86 level=warn traceID=77b4e204d49e765d msg="POST /api/v1/push (500) 544.828µs Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 7.0.11.154:9095,7.0.19.220:9095,7.0.4.17:9095\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 18286; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.42.0; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: anonymous; "

ts=2023-07-20T09:18:38.41301915Z caller=logging.go:86 level=warn traceID=0ce7164052ad4f8e msg="POST /api/v1/push (500) 1.437841ms Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 7.0.4.17:9095,7.0.19.220:9095,7.0.11.154:9095\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 17892; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.42.0; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: anonymous; "

ts=2023-07-20T09:18:46.932644939Z caller=logging.go:86 level=warn traceID=68268a84dc8bd41e msg="POST /api/v1/push (500) 1.261222ms Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 7.0.19.220:9095,7.0.4.17:9095,7.0.11.154:9095\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 20759; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.42.0; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: anonymous; "

Furthermore, I am seeing the following logs: Got ping for unexpected node on ingestor, compactor, distributor, store-gateway, and querier:

ts=2023-07-20T09:01:16.779205853Z caller=ingester.go:1831 level=info msg="opening existing TSDBs"
ts=2023-07-20T09:01:16.779279844Z caller=ingester.go:1927 level=info msg="successfully opened existing TSDBs"
ts=2023-07-20T09:01:16.779395603Z caller=lifecycler.go:595 level=info msg="instance not found in ring, adding with no tokens" ring=ingester
ts=2023-07-20T09:01:16.779394836Z caller=mimir.go:792 level=info msg="Application started"
ts=2023-07-20T09:01:16.779516943Z caller=lifecycler.go:435 level=info msg="auto-joining cluster after timeout" ring=ingester

ts=2023-07-20T09:01:21.57352153Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ruler-7c7bcc4bc4-9sjch-5dd2805f' from=[::]:7946"
ts=2023-07-20T09:01:23.575013961Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ruler-7c7bcc4bc4-9sjch-5dd2805f' from=[::]:7946"
ts=2023-07-20T09:01:23.575062692Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ruler-7c7bcc4bc4-9sjch-5dd2805f' from=[::]:7946"
ts=2023-07-20T09:01:23.57508945Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ruler-7c7bcc4bc4-9sjch-5dd2805f' from=[::]:7946"
ts=2023-07-20T09:01:26.574094166Z caller=log.go:194 level=info msg="Suspect mimir-ruler-7c7bcc4bc4-9sjch-5dd2805f has failed, no acks received"
ts=2023-07-20T09:01:26.575018121Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-alertmanager-0-489ece3a' from=[::]:7946"
ts=2023-07-20T09:01:28.57669268Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-alertmanager-0-489ece3a' from=[::]:7946"
ts=2023-07-20T09:01:28.576980858Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-alertmanager-0-489ece3a' from=[::]:7946"
ts=2023-07-20T09:01:28.577022324Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-alertmanager-0-489ece3a' from=[::]:7946"
ts=2023-07-20T09:01:31.574702719Z caller=log.go:194 level=info msg="Suspect mimir-alertmanager-0-489ece3a has failed, no acks received"
ts=2023-07-20T09:01:31.575166277Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ingester-2-3bed4645' from=[::]:7946"
ts=2023-07-20T09:01:33.576403599Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ingester-2-3bed4645' from=[::]:7946"
ts=2023-07-20T09:01:33.576636082Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ingester-2-3bed4645' from=[::]:7946"
ts=2023-07-20T09:01:33.576676923Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ingester-2-3bed4645' from=[::]:7946"
ts=2023-07-20T09:01:36.575949366Z caller=log.go:194 level=info msg="Suspect mimir-ingester-2-3bed4645 has failed, no acks received"

Do you have any suggestions on how to resolve this issue?

Also, I'm trying to port-forward and see the member list, but not getting any response.

[rocky@ip-10-220-0-250 ~]$ kubectl get svc -n central-monitoring
NAME                             TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)                      AGE
mimir-alertmanager               ClusterIP      172.20.246.95    <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-alertmanager-headless      ClusterIP      None             <none>                                                                          8080/TCP,9095/TCP,9094/TCP   8m12s
mimir-chunks-cache               ClusterIP      None             <none>                                                                          11211/TCP,9150/TCP           8m12s
mimir-compactor                  ClusterIP      172.20.196.85    <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-distributor                ClusterIP      172.20.100.139   <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-distributor-headless       ClusterIP      None             <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-gossip-ring                ClusterIP      None             <none>                                                                          7946/TCP                     8m12s
mimir-index-cache                ClusterIP      None             <none>                                                                          11211/TCP,9150/TCP           8m12s
mimir-ingester                   ClusterIP      172.20.178.214   <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-ingester-headless          ClusterIP      None             <none>                                                                          9095/TCP                     8m12s
mimir-metadata-cache             ClusterIP      None             <none>                                                                          11211/TCP,9150/TCP           8m12s
mimir-nginx                      LoadBalancer   172.20.149.85    k8s-xxxxxxxxxxxxxx.elb.us-west-2.amazonaws.com   80:31327/TCP                 8m12s
mimir-overrides-exporter         ClusterIP      172.20.217.12    <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-querier                    ClusterIP      172.20.5.94      <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-query-frontend             ClusterIP      172.20.185.129   <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-query-scheduler            ClusterIP      172.20.247.173   <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-query-scheduler-headless   ClusterIP      None             <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-results-cache              ClusterIP      None             <none>                                                                          11211/TCP,9150/TCP           8m12s
mimir-ruler                      ClusterIP      172.20.117.134   <none>                                                                          8080/TCP                     8m12s
mimir-store-gateway              ClusterIP      172.20.85.25     <none>                                                                          8080/TCP,9095/TCP            8m12s
mimir-store-gateway-headless     ClusterIP      None             <none>                                                                          9095/TCP                     8m12s

[rocky@ip-10-220-0-250 ~]$ kubectl port-forward -n central-monitoring service/mimir-gossip-ring 7946:7946 --address="0.0.0.0"
Forwarding from 0.0.0.0:7946 -> 7946
Handling connection for 7946
Handling connection for 7946

Is there any way to see memberlist?

4 replies

girishms-sentient Jul 21, 2023
Author

@dimitarvdimitrov @pstibrany @pracucci @DylanGuedes I'd appreciate your help on this.

DylanGuedes Jul 21, 2023
Collaborator

i don't have much experience with the Mimir codebase but a few suggestions:

access the /config page and make sure all different ring stores are using "memberlist" (and not inmemory)
access the /memberlist page and double check it looks good
run the system with -log.level=debug and see if anything interesting pops up

girishms-sentient Jul 27, 2023
Author

This issue has been resolved, by making changes to the memberlist.

            memberlist:
              abort_if_cluster_join_fails: false
              compression_enabled: false
              join_members:
                - dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }}
              advertise_addr: ${MY_POD_IP}

Answer selected by girishms-sentient

girishms-sentient Jul 27, 2023
Author

Thanks @DylanGuedes @pstibrany for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: at least 2 live replicas required across different availability zones, could only find 0 - unhealthy instances #5552

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error: at least 2 live replicas required across different availability zones, could only find 0 - unhealthy instances #5552

girishms-sentient Jul 19, 2023

Replies: 3 comments · 4 replies

girishms-sentient Jul 19, 2023 Author

pstibrany Jul 20, 2023 Maintainer

girishms-sentient Jul 20, 2023 Author

girishms-sentient Jul 21, 2023 Author

DylanGuedes Jul 21, 2023 Collaborator

girishms-sentient Jul 27, 2023 Author

girishms-sentient Jul 27, 2023 Author

girishms-sentient
Jul 19, 2023

Replies: 3 comments 4 replies

girishms-sentient
Jul 19, 2023
Author

pstibrany
Jul 20, 2023
Maintainer

girishms-sentient
Jul 20, 2023
Author

girishms-sentient Jul 21, 2023
Author

DylanGuedes Jul 21, 2023
Collaborator

girishms-sentient Jul 27, 2023
Author

girishms-sentient Jul 27, 2023
Author