Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(dataobj): Download pages in 16MB batches #16689

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Mar 11, 2025

Previously, each page in a call to ReadPages would result in one request to storage. This added a lot of latency when data objects were backed by object storage, with the roundtrip time accumulating.

This PR enables pages in a call to ReadPages to be batched into 16MB windows (from S3's recommendation of using 8MB or 16MB chunks; 16MB was chosen to further reduce roundtrips). Windows are currently downloaded sequentially, though this could be updated to use concurrency if desired.

The effectiveness of this code depends on reading multiple columns and pages at once; this only happens when using dataset.Reader from #16429.

Additionally:

  • Column metadata now also supports batching, rather than downloading the metadata for one column at a time.
  • Dataset wrappers have been updated to retain the batching specified by the caller.

rfratto added 2 commits March 7, 2025 11:19
A common download size for chunked data in S3 is 8MB or 16MB. When
downloading a slice of pages, we find pages that align into an 8MB or
16MB "window" and download that entire set of pages in a single request.

This trades off fewer roundtrips for downloading garbage data: if only
two pages are downloaded, and fit within a 16MB window, the majority of
data in that 8/16MB could be outside the range of both pages.

This commit adds utilities for identifying windows. The windowing code
is made generic to permit windowing any arbitrary element in the file,
including pages and column metadata.
@rfratto rfratto requested a review from a team as a code owner March 11, 2025 18:26
@rfratto rfratto force-pushed the dataobj-parallel-downloads branch from 8c67b3b to 53fa52c Compare March 11, 2025 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant