You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenSearchDatasource (OSD) is an implementation of the Datasource “interface” as defined by Ray’s data API. Ray comes with readers and writers for a number of data sources and sinks and provides examples for writing your own reader or writer (https://docs.ray.io/en/latest/data/custom-datasource-example.html). The latest implementation of OSD is checked into a branch:
Both of the implementations parallelize reads by creating a read task per slice or per parent doc (for document reconstruct). OSR does this by using a flat_map operator where each row corresponds to a slice (each row maps to all matching docs in a slice) and relies on Ray to spread the read tasks across available resources (workers).
OSD explicitly creates ReadTasks which are picked up by available workers. Each read task works on a slice or a parent doc.
When I compare the performance between OSR and OSD, we seem to be losing a lot of time converting between Pandas DataFrames and Sycamore’s DocSets. Overall, OSR outperforms OSD when it is not forced to prepare data as a PyArrow Table or a Pandas DataFrame.
The text was updated successfully, but these errors were encountered:
OpenSearchReader (OSR) is the current default Sycamore reader implementation for reading OpenSearch data into DocSets. It uses Point-in-Time (PIT) snapshots and “slices” to achieve parallelism when reading from OpenSearch (https://opensearch.org/docs/latest/search-plugins/searching-data/point-in-time/).
OpenSearchDatasource (OSD) is an implementation of the Datasource “interface” as defined by Ray’s data API. Ray comes with readers and writers for a number of data sources and sinks and provides examples for writing your own reader or writer (https://docs.ray.io/en/latest/data/custom-datasource-example.html). The latest implementation of OSD is checked into a branch:
https://github.com/aryn-ai/sycamore/blob/opensearch-datasource/lib/sycamore/sycamore/connectors/opensearch/opensearch_datasource.py
Both of the implementations parallelize reads by creating a read task per slice or per parent doc (for document reconstruct). OSR does this by using a flat_map operator where each row corresponds to a slice (each row maps to all matching docs in a slice) and relies on Ray to spread the read tasks across available resources (workers).
OSD explicitly creates ReadTasks which are picked up by available workers. Each read task works on a slice or a parent doc.
When I compare the performance between OSR and OSD, we seem to be losing a lot of time converting between Pandas DataFrames and Sycamore’s DocSets. Overall, OSR outperforms OSD when it is not forced to prepare data as a PyArrow Table or a Pandas DataFrame.
The text was updated successfully, but these errors were encountered: