[Feature Request] One-to-many and many-to-one Ingest Processors #16029
Labels
enhancement
Enhancement or improvement to existing feature or request
Indexing
Indexing, Bulk Indexing and anything related to indexing
ingest-pipeline
Is your feature request related to a problem? Please describe
A common use case in semantic search is chunking of large (unstructured) documents where documents get broken into smaller chunks and chunks are then converted to (dense) vectors before they are stored in a vector database. The way this is done in OpenSearch is through a sequence of two processors - the Text Chunking processor and the Text Embedding processor. The problem with these ingest processors is that the resulting embeddings are indexed as nested fields and in my opinion, having to always qualify my queries as "nested" leads to poor user experience and there seem to be a lot of limitations when it comes to queries on nested fields. Just take a look at all these open issues in neural-search related to nested fields.
Describe the solution you'd like
A simple (?) solution to this is to introduce a processor that take an IngestDocument as input and produces a List as output. So, if the text chunking processor produces 100 chunks out of a document, instead of the parent document having 100 nested fields, I would like to produce 100 documents. Continuing with the above example, I can then use a batch processor for text embedding which takes those 100 documents as input and 100 documents (vectors) as output.
I don't have any use case for many-to-one ("reduce") processors, but I just thought I would mention it for symmetry. In theory, you can just use a batch processor for this (a List of many to a List of one).
Related component
Indexing
Describe alternatives you've considered
Alternatively, we could introduce an operation that turns each nested field into a full document. Since nested fields are already internally treated as documents, maybe this would be relatively easy to do? This could be a feature that works at the index level where you can configure OpenSearch to expand nested fields to explicit documents at the time of ingestion/indexing or a separate API that does that at the document level or both.
Additional context
One issue with the one-to-many processor is that I still want to index the parent document. I would need something like a conditional pipeline that is able to process the parent document in one flow and all the child chunks in another.
I can probably get by without having conditional pipelines because the parent document won't have the input field that the text embedding processor will look for (it'll be a no op).
The text was updated successfully, but these errors were encountered: