[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

austintlee · 2025-01-13T17:21:18Z

OpenSearchReader (OSR) is the current default Sycamore reader implementation for reading OpenSearch data into DocSets. It uses Point-in-Time (PIT) snapshots and “slices” to achieve parallelism when reading from OpenSearch (https://opensearch.org/docs/latest/search-plugins/searching-data/point-in-time/).

OpenSearchDatasource (OSD) is an implementation of the Datasource “interface” as defined by Ray’s data API. Ray comes with readers and writers for a number of data sources and sinks and provides examples for writing your own reader or writer (https://docs.ray.io/en/latest/data/custom-datasource-example.html). The latest implementation of OSD is checked into a branch:

https://github.com/aryn-ai/sycamore/blob/opensearch-datasource/lib/sycamore/sycamore/connectors/opensearch/opensearch_datasource.py

Both of the implementations parallelize reads by creating a read task per slice or per parent doc (for document reconstruct). OSR does this by using a flat_map operator where each row corresponds to a slice (each row maps to all matching docs in a slice) and relies on Ray to spread the read tasks across available resources (workers).

OSD explicitly creates ReadTasks which are picked up by available workers. Each read task works on a slice or a parent doc.

When I compare the performance between OSR and OSD, we seem to be losing a lot of time converting between Pandas DataFrames and Sycamore’s DocSets. Overall, OSR outperforms OSD when it is not forced to prepare data as a PyArrow Table or a Pandas DataFrame.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

austintlee commented Jan 13, 2025

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

Comments

austintlee commented Jan 13, 2025