chore(dataobj): Download pages in 16MB batches #16689

rfratto · 2025-03-11T18:26:25Z

Previously, each page in a call to ReadPages would result in one request to storage. This added a lot of latency when data objects were backed by object storage, with the roundtrip time accumulating.

This PR enables pages in a call to ReadPages to be batched into 16MB windows (from S3's recommendation of using 8MB or 16MB chunks; 16MB was chosen to further reduce roundtrips). Windows are currently downloaded sequentially, though this could be updated to use concurrency if desired.

The effectiveness of this code depends on reading multiple columns and pages at once; this only happens when using dataset.Reader from #16429.

Additionally:

Column metadata now also supports batching, rather than downloading the metadata for one column at a time.
Dataset wrappers have been updated to retain the batching specified by the caller.

A common download size for chunked data in S3 is 8MB or 16MB. When downloading a slice of pages, we find pages that align into an 8MB or 16MB "window" and download that entire set of pages in a single request. This trades off fewer roundtrips for downloading garbage data: if only two pages are downloaded, and fit within a 16MB window, the majority of data in that 8/16MB could be outside the range of both pages. This commit adds utilities for identifying windows. The windowing code is made generic to permit windowing any arbitrary element in the file, including pages and column metadata.

rfratto added 2 commits March 7, 2025 11:19

chore(dataobj): allow dataset wrappers to enable bulk downloads

f479430

rfratto requested a review from a team as a code owner March 11, 2025 18:26

pull-request-size bot added the size/XL label Mar 11, 2025

chore(dataobj): use windowing for downloading pages and column metadata

53fa52c

rfratto force-pushed the dataobj-parallel-downloads branch from 8c67b3b to 53fa52c Compare March 11, 2025 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(dataobj): Download pages in 16MB batches #16689

chore(dataobj): Download pages in 16MB batches #16689

rfratto commented Mar 11, 2025

chore(dataobj): Download pages in 16MB batches #16689

Are you sure you want to change the base?

chore(dataobj): Download pages in 16MB batches #16689

Conversation

rfratto commented Mar 11, 2025