You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SmartScaler in the OpenSearch Kubernetes Operator is intermittently skipping the draining step when scaling down data nodes. Based on the logs, it correctly excludes a node, waits for it to drain, confirms the drain, and removes it. However, for some nodes, it skips the waiting step and removes them directly, potentially causing disruption.
To Reproduce
Steps to reproduce the behaviour:
1. Trigger a scale-down event for a data node group.
2. Monitor the operator logs for node exclusion, draining, and removal.
3. Observe that some nodes follow the expected exclusion → draining → removal sequence, while others are removed without waiting for a drain.
Expected behaviour
Every node undergoing scale-down should be properly drained before removal, ensuring cluster stability.
Operator Logs
{"level":"info","ts":"2025-02-20T12:38:29.164Z","msg":"Group: data, Excluded node: opensearch-data-14",...}
...
{"level":"info","ts":"2025-02-20T12:44:00.612Z","msg":"Group: data, Waiting for node opensearch-data-14 to drain",...}
...
{"level":"info","ts":"2025-02-20T12:49:28.491Z","msg":"Group: data, Node opensearch-data-14 is drained",...}
{"level":"info","ts":"2025-02-20T12:49:28.828Z","msg":"Group: data, Removed node opensearch-data-14",...}
{"level":"info","ts":"2025-02-20T12:49:29.120Z","msg":"Group: data, Removed node opensearch-data-13",...} <-- No drain step for data-13
{"level":"info","ts":"2025-02-20T12:49:44.805Z","msg":"Group: data, Excluded node: opensearch-data-12",...}
{"level":"info","ts":"2025-02-20T12:49:45.423Z","msg":"Group: data, Waiting for node opensearch-data-12 to drain",...}
Issue Breakdown
• opensearch-data-14 follows the correct process: Excluded → Drained → Removed
• opensearch-data-13 is removed without draining
• opensearch-data-12 resumes the correct behaviour
Impact
• Potential risk of data loss or increased cluster instability
• Unexpected scaling behaviour causing uneven shard distribution
Describe the bug
The SmartScaler in the OpenSearch Kubernetes Operator is intermittently skipping the draining step when scaling down data nodes. Based on the logs, it correctly excludes a node, waits for it to drain, confirms the drain, and removes it. However, for some nodes, it skips the waiting step and removes them directly, potentially causing disruption.
To Reproduce
Steps to reproduce the behaviour:
1. Trigger a scale-down event for a data node group.
2. Monitor the operator logs for node exclusion, draining, and removal.
3. Observe that some nodes follow the expected exclusion → draining → removal sequence, while others are removed without waiting for a drain.
Expected behaviour
Every node undergoing scale-down should be properly drained before removal, ensuring cluster stability.
Operator Logs
Issue Breakdown
• opensearch-data-14 follows the correct process: Excluded → Drained → Removed
• opensearch-data-13 is removed without draining
• opensearch-data-12 resumes the correct behaviour
Impact
• Potential risk of data loss or increased cluster instability
• Unexpected scaling behaviour causing uneven shard distribution
Environment
• OpenSearch Operator version: 2.6.0
• OpenSearch version: 2.15.0
full log:
The text was updated successfully, but these errors were encountered: