[BUG] 2.16.0 Auto-expend replicas causes cluster yellow state when cluster nodes are above low watermark #15919

sandervandegeijn · 2024-09-12T22:16:30Z

Describe the bug

We have encountered this bug multiple times, also before 2.16.0.

When cluster nodes are already above the low watermark causing new indices being distributed to other nodes, it can happen that the cluster goes to yellow. The cause seems to be that the default policy on system indices is: auto_expand_replicas: "1-all". It tries to allocate replicas to nodes that are not able to accept more data because of the watermark situation.

This seems to happen when kubernetes is reallocating opensearch nodes to different k8s compute nodes.

Cluster state:

{
  "cluster_name": "xxxxx",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 17,
  "number_of_data_nodes": 12,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 2631,
  "active_shards": 3140,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 3,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 99.90454979319122
}

It tries to allocate the replicas:

{
  "index": ".opendistro_security",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "CLUSTER_RECOVERED",
    "at": "2024-09-12T14:45:27.211Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "02CeBVQKTa2lD1Qx0GAS3Q",
      "node_name": "opensearch-data-nodes-hot-6",
      "transport_address": "10.244.33.33:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [8.175061087167675%]"
        }
      ]
    },
    {
      "node_id": "Balhhxf2T2uNpUP6rq88Ag",
      "node_name": "opensearch-data-nodes-hot-2",
      "transport_address": "10.244.86.36:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [9.615515861288957%]"
        }
      ]
    },
    {
      "node_id": "DppvPjxgR0u8CVQVyAX0UA",
      "node_name": "opensearch-data-nodes-hot-7",
      "transport_address": "10.244.97.29:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[DppvPjxgR0u8CVQVyAX0UA], [R], s[STARTED], a[id=Q9PoLV1wRGumidM22EKveQ]]"
        },
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [12.463841799195983%]"
        }
      ]
    },
    {
      "node_id": "LQSYXzHbTfqowAOj3nrU3w",
      "node_name": "opensearch-data-nodes-hot-4",
      "transport_address": "10.244.70.30:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [7.916677463242952%]"
        }
      ]
    },
    {
      "node_id": "Ls8ptyo7ROGtFeO8hY5c5Q",
      "node_name": "opensearch-data-nodes-hot-9",
      "transport_address": "10.244.54.37:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[Ls8ptyo7ROGtFeO8hY5c5Q], [R], s[STARTED], a[id=j_FrjkN7R0aCEokKa4tjCA]]"
        }
      ]
    },
    {
      "node_id": "O_CCkTbmRtiuJU3cV93EaA",
      "node_name": "opensearch-data-nodes-hot-1",
      "transport_address": "10.244.83.46:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [8.445263138130201%]"
        }
      ]
    },
    {
      "node_id": "OfBmEaQsSsuJtJ4TKadLnQ",
      "node_name": "opensearch-data-nodes-hot-10",
      "transport_address": "10.244.37.46:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [11.538695394244522%]"
        }
      ]
    },
    {
      "node_id": "RC5KMwpWRMCVrGaF_7oGBA",
      "node_name": "opensearch-data-nodes-hot-0",
      "transport_address": "10.244.99.67:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [12.185368398769644%]"
        }
      ]
    },
    {
      "node_id": "S_fk2yqhQQuby8HM4hJXVA",
      "node_name": "opensearch-data-nodes-hot-8",
      "transport_address": "10.244.45.64:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [10.432421573093784%]"
        }
      ]
    },
    {
      "node_id": "_vxbOtloQmapzz0DbXBsjA",
      "node_name": "opensearch-data-nodes-hot-5",
      "transport_address": "10.244.79.58:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[_vxbOtloQmapzz0DbXBsjA], [P], s[STARTED], a[id=hY9WcHR-S_6TN3kTj4NZJA]]"
        }
      ]
    },
    {
      "node_id": "pP5muAyTSA2Z45yO8Ws0VA",
      "node_name": "opensearch-data-nodes-hot-3",
      "transport_address": "10.244.101.66:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [9.424146099675534%]"
        }
      ]
    },
    {
      "node_id": "zRdO9ndKSbuJ97t77-OLLw",
      "node_name": "opensearch-data-nodes-hot-11",
      "transport_address": "10.244.113.26:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[zRdO9ndKSbuJ97t77-OLLw], [R], s[STARTED], a[id=O7z4RvkiQXGMcfhRSPm8lQ]]"
        },
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [11.883587901703455%]"
        }
      ]
    }
  ]
}

So if we have 12 nodes, it tries to allocate 11 replicas on the restart of the node. But that seems to fail because several nodes are above the low watermark (why not distribute the free space more evenly?). The only solutions seems to be to lower the auto expand setting or to manually redistribute shards across the nodes to even out the disk space usage.

Cluster storage state:

n                            id   v      r rp      dt      du   dup hp load_1m load_5m load_15m
opensearch-master-nodes-0    twM5 2.16.0 m 60   9.5gb 518.1mb  5.32 56    1.74    1.38     1.17
opensearch-data-nodes-hot-5  _vxb 2.16.0 d 96 960.1gb 649.6gb 67.66 41    1.14    1.14     1.10
opensearch-master-nodes-2    nQD7 2.16.0 m 59   9.5gb 518.1mb  5.32 37    1.15    1.06     1.09
opensearch-data-nodes-hot-11 zRdO 2.16.0 d 92 960.1gb   859gb 89.47 31    2.33    3.13     3.62
opensearch-data-nodes-hot-6  02Ce 2.16.0 d 90 960.1gb 848.5gb 88.38 62    1.40    1.40     1.60
opensearch-data-nodes-hot-4  LQSY 2.16.0 d 95 960.1gb 886.5gb 92.33 35    2.33    2.40     2.56
opensearch-data-nodes-hot-10 OfBm 2.16.0 d 96 960.1gb 861.7gb 89.75 58    3.69    4.27     4.21
opensearch-ingest-nodes-0    bx4Z 2.16.0 i 65    19gb  1016mb  5.21 73    2.31    2.60     2.54
opensearch-data-nodes-hot-3  pP5m 2.16.0 d 61 960.1gb 869.6gb 90.58 35    1.71    1.64     1.89
opensearch-data-nodes-hot-9  Ls8p 2.16.0 d 95 960.1gb 643.2gb 66.99 27    0.72    1.00     1.02
opensearch-data-nodes-hot-7  Dppv 2.16.0 d 91 960.1gb 842.4gb 87.74 53    1.29    1.87     1.74
opensearch-data-nodes-hot-2  Balh 2.16.0 d 63 960.1gb 867.8gb 90.38 31    1.93    1.73     1.45
opensearch-data-nodes-hot-8  S_fk 2.16.0 d 64 960.1gb 859.9gb 89.57 42    0.66    0.66     0.71
opensearch-data-nodes-hot-1  O_CC 2.16.0 d 89 960.1gb 884.9gb 92.17 11    1.53    1.48     1.33
opensearch-data-nodes-hot-0  RC5K 2.16.0 d 85 960.1gb 844.8gb 87.99 62    0.77    0.90     1.10
opensearch-master-nodes-1    r70_ 2.16.0 m 58   9.5gb 518.1mb  5.32 58    0.76    0.88     1.05
opensearch-ingest-nodes-1    NX1N 2.16.0 i 61    19gb  1016mb  5.21 17    0.49    1.12     1.77

Related component

Storage

To Reproduce

Cluster is nearing capacity ( good from a storage cost perspective )
Cluster gets rebooted or individual nodes get rebooted
Cluster goes to yellow state

Expected behavior

Rebalance shards proactively based on storage usage of nodes
System indices might take priority ignoring the low/high watermark untill cluster disk usage really becomes critical

Additional Details

Plugins
Default

Screenshots
N/A

Host/Environment (please complete the following information):
Default 2.16.0 docker images

Additional context
N/A

The text was updated successfully, but these errors were encountered:

ashking94 · 2024-09-19T15:13:32Z

@sandervandegeijn Thanks for filing this issue, please feel free to submit a pull request.

dblock · 2024-09-30T16:31:00Z

[Catch All Triage - 1, 2, 3, 4]

sandervandegeijn added bug Something isn't working untriaged labels Sep 12, 2024

github-actions bot added the Storage Issues and PRs relating to data and metadata storage label Sep 12, 2024

sandervandegeijn mentioned this issue Sep 12, 2024

[RELEASE] Release version 2.17.0 #15095

Closed

23 tasks

ashking94 added ShardManagement:Routing and removed Storage Issues and PRs relating to data and metadata storage labels Sep 19, 2024

dblock removed the untriaged label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 2.16.0 Auto-expend replicas causes cluster yellow state when cluster nodes are above low watermark #15919

[BUG] 2.16.0 Auto-expend replicas causes cluster yellow state when cluster nodes are above low watermark #15919

sandervandegeijn commented Sep 12, 2024 •

edited

Loading

ashking94 commented Sep 19, 2024

dblock commented Sep 30, 2024

[BUG] 2.16.0 Auto-expend replicas causes cluster yellow state when cluster nodes are above low watermark #15919

[BUG] 2.16.0 Auto-expend replicas causes cluster yellow state when cluster nodes are above low watermark #15919

Comments

sandervandegeijn commented Sep 12, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

ashking94 commented Sep 19, 2024

dblock commented Sep 30, 2024

sandervandegeijn commented Sep 12, 2024 •

edited

Loading