[Feature Request] Optimize the api _cat/nodes #14746

kkewwei · 2024-07-14T03:35:22Z

Is your feature request related to a problem? Please describe

Now the method is as follows:

        return channel -> client.admin().cluster().state(clusterStateRequest, new RestActionListener<>(channel) {
                     ......
                    nodesInfoRequest.timeout(request.param("timeout"));
                    client.admin().cluster().nodesInfo(nodesInfoRequest, new RestActionListener<NodesInfoResponse>(channel) {
                               ......
                               // wait all the nodes response
                               nodesStatsRequest.timeout(request.param("timeout"));
                              client.admin().cluster().nodesStats(nodesStatsRequest, new RestResponseListener<NodesStatsResponse>(channel) {
                                      ......
                               }
                    }
        }

It seems has two problems:

cluster().nodesInfo() and cluster().nodesStats() use separate timeout, in that case, if timeout from the client is 30s, without adding cluster().state(), the overall time can be 60s, which is 2x times that the expect.
Only all nodes return the a NodeInfoResponse in cluster().nodesInfo() can the next cluster().nodesStats() be called. It's normal to have a slow node(such as fullGc) in large clusters, the api will become unresponsive, it means that if some of nodes are blocked in cluster().nodesInfo(), the overrall api will be blocked.

Describe the solution you'd like

If timeout is 30s in _cat/nodes, the overall time should be around 30s.
If some of nodes are blocked, it doesn't affect the rest nodes to get metrics. Each node separately call cluster().nodesInfo() and cluster().nodesStats().

The code can be like this:

        long time1 = threadPool.relativeTimeInMillis();
        return channel -> client.admin().cluster().state(clusterStateRequest, new RestActionListener<>(channel) {
                     ......
                    long time2 = threadPool.relativeTimeInMillis();
                    nodesInfoRequest.timeout(timeout - (time2-time1)));
                    for (String nodeId : nodeIds) {
                         nodesInfoRequest.nodesIds(nodeId);
                          client.admin().cluster().nodesInfo(nodesInfoRequest, new RestActionListener<NodesInfoResponse>(channel) {
                                    ......
                                    long time3 = threadPool.relativeTimeInMillis();
                                    nodesStatsRequest.timeout(timeout - (time3-time1)));
                                    nodesStatsRequest.nodesIds(nodeId);
                                   client.admin().cluster().nodesStats(nodesStatsRequest, new RestResponseListener<NodesStatsResponse>(channel) {
                                         ......
                                    }
                           }
                    }
                    channel.sendResponse(RestTable.buildResponse(buildTable(fullId, request, clusterStateResponse, nodesInfoResponse, nodesStatsResponse), channel));

        }

Related component

Cluster Manager

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

shwetathareja · 2024-07-29T14:46:10Z

cc : @aasom143 who was looking into holistic API cancellation across different cat and cluster APIs

rajiv-kv · 2024-08-01T16:27:43Z

[Triage - attendees 1 2 3 4] - @kkewwei Thanks for creating the issue.
Please check if this can leverage the cancellation framework 13908

kkewwei · 2024-08-23T11:43:29Z

[Triage - attendees 1 2 3 4] - @kkewwei Thanks for creating the issue. Please check if this can leverage the cancellation framework 13908

@rajiv-kv The the cancellation framework can be used here to solve the first problem, It doesn't solve the second problem.

I wold like to solve the two of the problems within the cancellation framework, cc @aasom143.

aasom143 · 2024-09-06T06:13:01Z

Hi @kkewwei, thanks for following up. With the new cancellation framework, we have added a new timeout(cancel_after_time_interval) that can be used to address the first problem. For the second issue, we already have a timeout which can configured for each node's transport call. By setting this timeout, we can prevent our requests from being blocked by a faulty node. I hope this provides clarity on how to resolve the second problem.

aasom143 · 2024-09-06T06:15:05Z

To address the first issue, could you please add cancellation support for the cat/nodes API? You can refer to the recent PR regarding cancellation support for the cat/shards API.

kkewwei · 2024-09-11T10:03:00Z

To address the first issue, could you please add cancellation support for the cat/nodes API? You can refer to the recent PR regarding cancellation support for the cat/shards API.

Of course, thank you.

kkewwei · 2024-10-19T01:22:34Z

@aasom143, It's ok now, please have a look when you are free.#14853

kkewwei added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 14, 2024

github-actions bot added the Cluster Manager label Jul 14, 2024

kkewwei linked a pull request Jul 22, 2024 that will close this issue

optimize _cat/nodes api #14853

Open

3 tasks

rwali-aws removed the untriaged label Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Optimize the api _cat/nodes #14746

[Feature Request] Optimize the api _cat/nodes #14746

kkewwei commented Jul 14, 2024 •

edited

Loading

shwetathareja commented Jul 29, 2024

rajiv-kv commented Aug 1, 2024

kkewwei commented Aug 23, 2024 •

edited

Loading

aasom143 commented Sep 6, 2024

aasom143 commented Sep 6, 2024

kkewwei commented Sep 11, 2024

kkewwei commented Oct 19, 2024 •

edited

Loading

[Feature Request] Optimize the api _cat/nodes #14746

[Feature Request] Optimize the api _cat/nodes #14746

Comments

kkewwei commented Jul 14, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

shwetathareja commented Jul 29, 2024

rajiv-kv commented Aug 1, 2024

kkewwei commented Aug 23, 2024 • edited Loading

aasom143 commented Sep 6, 2024

aasom143 commented Sep 6, 2024

kkewwei commented Sep 11, 2024

kkewwei commented Oct 19, 2024 • edited Loading

kkewwei commented Jul 14, 2024 •

edited

Loading

kkewwei commented Aug 23, 2024 •

edited

Loading

kkewwei commented Oct 19, 2024 •

edited

Loading