Helix ZkClient API to Support Getting a Large Number of Children Using Pagination

Introduction

This wiki outlines a paginated getChildren() feature in Zookeeper and Helix ZkClient. If there are too many children under a znode and the response’s packet size exceeds the zookeeper client’s jute.maxbuffer limit, the getChildren() call will fail and affect the normal business logic in the upstream Helix applications. The paginated getChildren() fetches the children znodes in pages, each of which is under the client's jute.maxbuffer limit.

Problem Statement

If there are too many children and the packet exceeds the limit, zkClient will have connection loss and retry connecting, which could potentially take down ZK servers. The zk client config jute.maxbuffer is exceeded during a zk operation. ZK has a size limit of message it will send between client and server. If this is exceeded, the zk connection will be dropped and an IOException is thrown to the client. So we would like to solve the problem to allow listing a large number of children that normally exceeds the ZkClient's jute.maxbuffer.

Feature Summary

Here is a summary of the pagination feature:

No code change is required for helix users to adopt pagination, because helix ZkClient has integrated the existing API getChildren(path) with the native zookeeper pagination API.
Paginated getChildren() requires ZK server to support pagination. Open source ZK does not support pagination yet.
Helix ZkClient is compatible with both open source apache zookeeper and linkedin forked zookeeper: linkedin/zookeeper. It automatically detects whether the pagination feature is supported: if not, it’ll fallback to the regular non-paginated getChildren.
If all children exceed the jute.maxbuffer, zookeeper server will return one page of children just under jute.maxbuffer, and then Helix ZkClient automatically sends another call to zookeeper server to get the next page, until all children are fetched.
Helix user makes only one call to helix ZkClient to get all children, just like the original behavior.
Client jute.maxbuffer is required to be equal to or slightly greater than server (also strongly recommended to set the same by apache zookeeper: here). ZK server uses server’s jute.maxbuffer to determine the max number of children that can be returned in one page. If client’s jute.maxbuffer is smaller than server’s, the page exceeding client jute.maxbuffer can Not be returned to client. So the client's jute.maxbuffer should not be smaller than the server's.

High-Level Design

Helix ZkClient API

If we create a new getChildren API specifically for pagination, it’ll need the applications to make code changes to adopt the new API. And it is confusing when to use the existing getChildren or the new paginated getChildren. So instead of creating a new getChildren API, we’ll keep the existing getChildren to make it simple: public List<String> getChildren(String path)

Internally, it uses Java reflection to check if the native zk paginated API is available, and determine whether pagination is enabled or not. This strategy ensures this API is compatible with both open source ZK lib and linkedin forked ZK lib.

The behaviors are following:

If the native paginated getChildren() is available, this API will use pagination to get children.
- If the total number of children can be returned in a single page, the behavior is the same as the existing API: returning the list of children from the zk server. This is strongly consistent with the zk server’s children list at the time of returning from the zk server.
- If there are too many children that need to be returned in multiple pages, the final list of children: are ordered by each child’s czxid(creation zxid); could have some deleted or newly added children, which are deleted or added between the first and the last pagination calls, due to concurrent read/write calls.
If the open source apache zookeeper lib is used (the paginated getChildren is unavailable), the API will disable pagination and keep the original behavior by using the native non-paginated getChildren: returns the list of children within the jute.maxbuffer limit, or throw a KeeperException.MarshallingErrorException if the children are too many to get within the jute.maxbuffer limit.

Data Consistency Model

The original patch is not designed for strong data consistency requirements. If there are any children deleted between the first and the last page, the deleted children may be included in the final result. This causes data inconsistency between the client and the server.

In our use cases, after getting children, we still need to call getData for each child. So even if any children deleted are in the returned children list, it is still fine as current code logic already assumes the situation and is able to handle it. With existing implementation, our requirements are already satisfied.

The data consistency that Helix ZkClient getChildren API will provide is:

If the children size doesn’t exceed the packet length limit, all children will be fetched and returned, strongly consistent with the zk server.
If the children are too many to fit in one page and returned in multiple pages, the final result will not have strong consistency with the zk server as mentioned above in the problem: might contain deleted children(that are returned in the first pages before the last page is returned) on the zk server.

Pros:

Makes the logic and design much less complicated.
With existing implementation, less engineering work is required. So our users can use the pagination feature sooner.

Cons:

If there is a case in the future that requires strong data consistency, this model might potentially include the deleted children in the returned result. We need to come back and re-design it to satisfy the strong data consistency requirement.

Compatibility

Backward Compatibility

Situation: Helix bumps up zookeeper and is built with zookeeper that has the paginated getChildren. But at runtime, another zookeeper version is loaded and it does not have the paginated getChildren. In this case, NoSuchMethodException would be thrown.

Expected: Helix should continue to work even if the zk lib doesn’t have the pagination method. getChildren should still use the non-paginated method and return the children.

In Helix ZkConnection, use Java reflection to determine whether the runtime zookeeper version supports pagination or not, and implement the logic in ZkConnection#getChildren:

For the first call of getChildren, check if the ZooKeeper class has the paginated method.
If yes, use pagination; otherwise, use the original non-paginated method

Forward Compatibility

Situation: zookeeper client supports pagination, while zookeeper server does not.

Expected: helix zkClient should use the non-paginated getChildren api.

If paginated getChildren is called in helix zkClient, but the zk server doesn’t support pagination, UnimplementedException is thrown in zookeeper client. To ensure forward compatibility, Helix ZkClient can implement the following strategy:

For the first getChildren call, if the dependency zookeeper supports pagination, the paginated getChildren is called;
If UnimplementedException is thrown, it means ZK server does not support pagination, Helix ZkClient automatically falls back to the non-paginated getChildren.
cache whether the paginated API is supported or not, so later calls always use the same getChilden API.

Mechanism to Enable Pagination

To enable this feature, both the application's ZK server and client need to bump up the zookeeper version that supports the pagination API. The open source apache zookeeper does not yet support pagination. Pagination is supported in this fork of zookeeper: linkedin/zookeeper. User's zookeeper server needs to adopt the pagination feature to have a custom build for deployment.

On the application side, Helix ZkClient also needs the custom ZK build. The custom ZK lib needs to be a dependency in the application. In Helix ZkClient, there is a system property config to disable leveraging zookeeper’s paginated getChildren.

Java system property: zk.getChildren.pagination.disabled:

Java system property on client side. Default value is false.
By default it is false: only if both zk client lib and server support pagination, will Helix ZkClient leverage the paginated getChildren API.
If set to true, Helix ZkClient will force to call the non-paginated ZK API, no matter if zk client and server support pagination or not.

Below is a table to show if the paginated getChildren is enabled in Helix ZkClient, based on whether or not pagination is supported in helix ZK client dependency and ZK server. If and only if both helix ZK client dependency and ZK server support pagination, the paginated getChildren is enabled and used in Helix ZkClient.

(Yes - pagination is supported; No - pagination is not supported)