Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize NodeIndicesStats output behind flag #14454

Conversation

Pranshu-S
Copy link
Contributor

@Pranshu-S Pranshu-S commented Jun 19, 2024

Description:

As of today the APIs which fetch Node Stats with indices stats currently returns all shard statistics regardless of the requested level (node, level, or indices) to the coordinator node (node where-in or from which the rest request was created). This unnecessary iteration on the coordinator node leads to overhead and potential health issues. Furthermore, the coordinator node repeats index-level statistics calculation for each node response even when the requested level is indices.

This pull request proposes optimizations to the NodeStats API to pre-compute and return only the necessary information to generate the response. The APIs would now perform pre-computation of shard or index-level statistics on remote nodes (individual nodes which respond to the coordinator node) based on the level parameter in the REST request - nodes, indices or shards. Only for level = shards does the response contains the shard level stats.

The optimisation hides behind a flag transport level flag optimizeNodeIndicesStatsOnLevel part of CommonStatsFlag. This is enabled in RestNodeStatsAction as well as RestNodeAction API calls as we do not require it during these code paths.

Pre-computation on remote nodes minimizes data transfer and processing on the coordinator node, leading to significant performance gains, especially in large clusters. Reduced load on the coordinator node contributes to a healthier state by minimizing GC pauses and CPU spikes.

Testing:

Extensive testing with a 20k shards cluster has validated the effectiveness of these optimizations.

Stats Without Optimisation With Optimisation
Average 0.857280048 0.4961762
P90 1.1514673 0.7631599
Max 1.78542 1.423496
Min 0.444184 0.212655

Related Issues

Resolves #13340

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 59a0f5d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the NodeStatsOptimisation-ClusterStats-NodeIndicesStats3 branch from 59a0f5d to f6d3b58 Compare July 1, 2024 08:39
Copy link
Contributor

github-actions bot commented Jul 1, 2024

❌ Gradle check result for f6d3b58: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for b741868: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the NodeStatsOptimisation-ClusterStats-NodeIndicesStats3 branch from b741868 to a267aef Compare July 10, 2024 10:15
Copy link
Contributor

❌ Gradle check result for a267aef: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 517baa5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…-ClusterStats-NodeIndicesStats3

Signed-off-by: Pranshu Shukla <[email protected]>
@Pranshu-S Pranshu-S force-pushed the NodeStatsOptimisation-ClusterStats-NodeIndicesStats3 branch from 517baa5 to e009a84 Compare July 11, 2024 11:01
Copy link
Contributor

❌ Gradle check result for e009a84: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 857e7e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <[email protected]>
Copy link
Contributor

❌ Gradle check result for 9876c13: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <[email protected]>
Copy link
Contributor

❌ Gradle check result for 61bf2e0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 73de188: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S
Copy link
Contributor Author

[Test Result](https://build.ci.opensearch.org/job/gradle-check/42419/testReport/) (3 failures / -5)
[org.opensearch.discovery.ClusterDisruptionIT.classMethod](https://build.ci.opensearch.org/job/gradle-check/42419/testReport/junit/org.opensearch.discovery/ClusterDisruptionIT/classMethod/)
[org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate](https://build.ci.opensearch.org/job/gradle-check/42419/testReport/junit/org.opensearch.index/ShardIndexingPressureSettingsIT/testShardIndexingPressureLastSuccessfulSettingsUpdate/)
[org.opensearch.indices.replication.SegmentReplicationIT.testReplicaAlreadyAtCheckpoint](https://build.ci.opensearch.org/job/gradle-check/42419/testReport/junit/org.opensearch.indices.replication/SegmentReplicationIT/testReplicaAlreadyAtCheckpoint/)

For org.opensearch.discovery.ClusterDisruptionIT.classMethod

java.lang.AssertionError: SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly
MockSpanData{spanID='aa1534630900e459', parentSpanID='f8c436a5e84c915f', traceID='d7fd33dd4c374128', spanName='dispatchedShardOperationOnPrimary', startEpochNanos=2792950755631, endEpochNanos=0, hasEnded=false, attributes={index=restart_while_indexing, refresh_policy=false, thread.name=opensearch[node_t1][write][T#1], bulk_request_items=1, shard_id=0, node_id=uRqO2n91TTCp5ocLhE5MQA}, stackTrace=null}

Similar to failures in Flaky Test Issue - #14308


For org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate

java.lang.AssertionError: expected:<0> but was:<2>
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.junit.Assert.assertEquals(Assert.java:633)
	at org.opensearch.index.ShardIndexingPressureSettingsIT.lambda$waitForTwoOutstandingRequests$13(ShardIndexingPressureSettingsIT.java:939)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1128)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1101)
	at org.opensearch.index.ShardIndexingPressureSettingsIT.waitForTwoOutstandingRequests(ShardIndexingPressureSettingsIT.java:938)
	at 

Similar failures in Flaky Test Issue - #14331


For org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate
Flaky Test - #14328

Signed-off-by: Pranshu Shukla <[email protected]>
Signed-off-by: Pranshu Shukla <[email protected]>
Copy link
Contributor

✅ Gradle check result for fa2373f: SUCCESS

Copy link

codecov bot commented Jul 15, 2024

Codecov Report

Attention: Patch coverage is 81.63265% with 18 lines in your changes missing coverage. Please review.

Project coverage is 71.83%. Comparing base (acee2ae) to head (40e0fc9).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
.../java/org/opensearch/indices/NodeIndicesStats.java 85.71% 7 Missing and 5 partials ⚠️
...in/java/org/opensearch/indices/IndicesService.java 25.00% 2 Missing and 1 partial ⚠️
...h/action/admin/indices/stats/CommonStatsFlags.java 75.00% 0 Missing and 2 partials ⚠️
...rg/opensearch/rest/action/cat/RestNodesAction.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #14454      +/-   ##
============================================
- Coverage     71.87%   71.83%   -0.04%     
- Complexity    63318    63415      +97     
============================================
  Files          5231     5244      +13     
  Lines        296521   296864     +343     
  Branches      42832    42868      +36     
============================================
+ Hits         213113   213250     +137     
- Misses        65948    66097     +149     
- Partials      17460    17517      +57     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Pranshu Shukla <[email protected]>
Copy link
Contributor

✅ Gradle check result for 15f890d: SUCCESS

Copy link
Contributor

❌ Gradle check result for 2f4187f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 40e0fc9: SUCCESS

@shwetathareja shwetathareja merged commit e146f13 into opensearch-project:main Aug 29, 2024
34 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 29, 2024
* Optimize NodeIndicesStats output behind flag

Signed-off-by: Pranshu Shukla <[email protected]>
(cherry picked from commit e146f13)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Pranshu-S pushed a commit to Pranshu-S/OpenSearch that referenced this pull request Sep 4, 2024
* Optimize NodeIndicesStats output behind flag

Signed-off-by: Pranshu Shukla <[email protected]>
(cherry picked from commit e146f13)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
shwetathareja pushed a commit that referenced this pull request Sep 4, 2024
* Optimize NodeIndicesStats output behind flag (#14454)

Signed-off-by: Pranshu Shukla <[email protected]>
(cherry picked from commit e146f13)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
shwetathareja pushed a commit that referenced this pull request Sep 5, 2024
* Optimize NodeIndicesStats output behind flag (#14454)

Signed-off-by: Pranshu Shukla <[email protected]>
(cherry picked from commit e146f13)
akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Sep 10, 2024
* Optimize NodeIndicesStats output behind flag

Signed-off-by: Pranshu Shukla <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working Cluster Manager Other Stats v2.16.0 Issues and PRs related to version 2.16.0
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[BUG] Level aggregations for node stats API should be processed on data nodes
6 participants