Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SmartScaler intermittently skipping draining of pods #961

Open
Omarimcblack opened this issue Feb 20, 2025 · 0 comments
Open

[BUG] SmartScaler intermittently skipping draining of pods #961

Omarimcblack opened this issue Feb 20, 2025 · 0 comments
Labels
bug Something isn't working untriaged Issues that have not yet been triaged

Comments

@Omarimcblack
Copy link

Describe the bug

The SmartScaler in the OpenSearch Kubernetes Operator is intermittently skipping the draining step when scaling down data nodes. Based on the logs, it correctly excludes a node, waits for it to drain, confirms the drain, and removes it. However, for some nodes, it skips the waiting step and removes them directly, potentially causing disruption.

To Reproduce
Steps to reproduce the behaviour:
1. Trigger a scale-down event for a data node group.
2. Monitor the operator logs for node exclusion, draining, and removal.
3. Observe that some nodes follow the expected exclusion → draining → removal sequence, while others are removed without waiting for a drain.

Expected behaviour
Every node undergoing scale-down should be properly drained before removal, ensuring cluster stability.

Operator Logs

{"level":"info","ts":"2025-02-20T12:38:29.164Z","msg":"Group: data, Excluded node: opensearch-data-14",...}
...
{"level":"info","ts":"2025-02-20T12:44:00.612Z","msg":"Group: data, Waiting for node opensearch-data-14 to drain",...}
...
{"level":"info","ts":"2025-02-20T12:49:28.491Z","msg":"Group: data, Node opensearch-data-14 is drained",...}
{"level":"info","ts":"2025-02-20T12:49:28.828Z","msg":"Group: data, Removed node opensearch-data-14",...}
{"level":"info","ts":"2025-02-20T12:49:29.120Z","msg":"Group: data, Removed node opensearch-data-13",...}  <-- No drain step for data-13
{"level":"info","ts":"2025-02-20T12:49:44.805Z","msg":"Group: data, Excluded node: opensearch-data-12",...}
{"level":"info","ts":"2025-02-20T12:49:45.423Z","msg":"Group: data, Waiting for node opensearch-data-12 to drain",...}

Issue Breakdown
• opensearch-data-14 follows the correct process: Excluded → Drained → Removed
• opensearch-data-13 is removed without draining
• opensearch-data-12 resumes the correct behaviour

Impact
• Potential risk of data loss or increased cluster instability
• Unexpected scaling behaviour causing uneven shard distribution

Environment
• OpenSearch Operator version: 2.6.0
• OpenSearch version: 2.15.0

full log:

{"level":"info","ts":"2025-02-20T12:38:29.164Z","msg":"Group: data, Excluded node: opensearch-data-14","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-resident-telem-logs-opensearch"},"namespace":"dedicated-resident-telem-logs-opensearch","name":"opensearch","reconcileID":"ba60913f-9fd0-4c0c-ad94-65775a1dde06"}
...
{"level":"info","ts":"2025-02-20T12:44:00.612Z","msg":"Group: data, Waiting for node opensearch-data-14 to drain","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-resident-telem-logs-opensearch"},"namespace":"dedicated-resident-telem-logs-opensearch","name":"opensearch","reconcileID":"04eef43a-fcbe-4f7e-b6a5-0077910c29ce"}
...
{"level":"info","ts":"2025-02-20T12:49:28.491Z","msg":"Group: data, Node opensearch-data-14 is drained","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"c9d0a63c-addd-4fc6-ba1e-9e947147aee2"}
{"level":"info","ts":"2025-02-20T12:49:28.502Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"0f523f48-591a-44dc-a616-2392fc3f57d7","cluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"}}
{"level":"info","ts":"2025-02-20T12:49:28.745Z","msg":"Group: data, Node opensearch-data-14 is drained","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"0f523f48-591a-44dc-a616-2392fc3f57d7"}
{"level":"info","ts":"2025-02-20T12:49:28.756Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"7651335d-1f89-4cb2-b9e5-a353b3709545","cluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"}}
{"level":"info","ts":"2025-02-20T12:49:28.828Z","logger":"KubeAPIWarningLogger","msg":"would violate PodSecurity \"restricted:latest\": privileged (container \"init-sysctl\" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (containers \"init\", \"init-sysctl\", \"opensearch\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"init\", \"init-sysctl\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"init\", \"init-sysctl\" must set securityContext.runAsNonRoot=true), runAsUser=0 (container \"init\" must not set runAsUser=0), seccompProfile (pod or containers \"init\", \"init-sysctl\", \"opensearch\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")"}
{"level":"info","ts":"2025-02-20T12:49:28.828Z","msg":"Group: data, Removed node opensearch-data-14","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"7651335d-1f89-4cb2-b9e5-a353b3709545"}
{"level":"info","ts":"2025-02-20T12:49:28.966Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"a1c16fc4-a62b-4b9f-9527-4e5c7ae791b9","cluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"}}
{"level":"info","ts":"2025-02-20T12:49:29.120Z","msg":"Group: data, Removed node opensearch-data-13","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"a1c16fc4-a62b-4b9f-9527-4e5c7ae791b9"}
{"level":"info","ts":"2025-02-20T12:49:29.259Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"5c25b6bb-d333-4c4a-9a9c-e2ca10f1435e","cluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"}}
{"level":"info","ts":"2025-02-20T12:49:44.805Z","msg":"Group: data, Excluded node: opensearch-data-12","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"5c25b6bb-d333-4c4a-9a9c-e2ca10f1435e"}
{"level":"info","ts":"2025-02-20T12:49:44.872Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"170ac32b-d312-49ec-a630-55074602f047","cluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"}}
{"level":"info","ts":"2025-02-20T12:49:45.423Z","msg":"Group: data, Waiting for node opensearch-data-12 to drain","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch","namespace":"dedicated-logs-opensearch"},"namespace":"dedicated-logs-opensearch","name":"opensearch","reconcileID":"170ac32b-d312-49ec-a630-55074602f047"}
@Omarimcblack Omarimcblack added bug Something isn't working untriaged Issues that have not yet been triaged labels Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged Issues that have not yet been triaged
Projects
Status: 🆕 New
Development

No branches or pull requests

1 participant