Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

Open
austintlee opened this issue Jan 13, 2025 · 0 comments
Open

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

austintlee opened this issue Jan 13, 2025 · 0 comments

Comments

@austintlee
Copy link
Contributor

OpenSearchReader (OSR) is the current default Sycamore reader implementation for reading OpenSearch data into DocSets. It uses Point-in-Time (PIT) snapshots and “slices” to achieve parallelism when reading from OpenSearch (https://opensearch.org/docs/latest/search-plugins/searching-data/point-in-time/).

OpenSearchDatasource (OSD) is an implementation of the Datasource “interface” as defined by Ray’s data API. Ray comes with readers and writers for a number of data sources and sinks and provides examples for writing your own reader or writer (https://docs.ray.io/en/latest/data/custom-datasource-example.html). The latest implementation of OSD is checked into a branch:

https://github.com/aryn-ai/sycamore/blob/opensearch-datasource/lib/sycamore/sycamore/connectors/opensearch/opensearch_datasource.py

Both of the implementations parallelize reads by creating a read task per slice or per parent doc (for document reconstruct). OSR does this by using a flat_map operator where each row corresponds to a slice (each row maps to all matching docs in a slice) and relies on Ray to spread the read tasks across available resources (workers).

OSD explicitly creates ReadTasks which are picked up by available workers. Each read task works on a slice or a parent doc.

When I compare the performance between OSR and OSD, we seem to be losing a lot of time converting between Pandas DataFrames and Sycamore’s DocSets. Overall, OSR outperforms OSD when it is not forced to prepare data as a PyArrow Table or a Pandas DataFrame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant