[Feature Request] One-to-many and many-to-one Ingest Processors #16029

austintlee · 2024-09-22T02:11:30Z

Is your feature request related to a problem? Please describe

A common use case in semantic search is chunking of large (unstructured) documents where documents get broken into smaller chunks and chunks are then converted to (dense) vectors before they are stored in a vector database. The way this is done in OpenSearch is through a sequence of two processors - the Text Chunking processor and the Text Embedding processor. The problem with these ingest processors is that the resulting embeddings are indexed as nested fields and in my opinion, having to always qualify my queries as "nested" leads to poor user experience and there seem to be a lot of limitations when it comes to queries on nested fields. Just take a look at all these open issues in neural-search related to nested fields.

Describe the solution you'd like

A simple (?) solution to this is to introduce a processor that take an IngestDocument as input and produces a List as output. So, if the text chunking processor produces 100 chunks out of a document, instead of the parent document having 100 nested fields, I would like to produce 100 documents. Continuing with the above example, I can then use a batch processor for text embedding which takes those 100 documents as input and 100 documents (vectors) as output.

I don't have any use case for many-to-one ("reduce") processors, but I just thought I would mention it for symmetry. In theory, you can just use a batch processor for this (a List of many to a List of one).

Related component

Indexing

Describe alternatives you've considered

Alternatively, we could introduce an operation that turns each nested field into a full document. Since nested fields are already internally treated as documents, maybe this would be relatively easy to do? This could be a feature that works at the index level where you can configure OpenSearch to expand nested fields to explicit documents at the time of ingestion/indexing or a separate API that does that at the document level or both.

Additional context

One issue with the one-to-many processor is that I still want to index the parent document. I would need something like a conditional pipeline that is able to process the parent document in one flow and all the child chunks in another.

                    / -- processor 2 - index parent doc
chunk processor 1--| 
                    \ -- processor 3 - create embeddings for chunks

I can probably get by without having conditional pipelines because the parent document won't have the input field that the text embedding processor will look for (it'll be a no op).

The text was updated successfully, but these errors were encountered:

bharath-techie · 2024-09-30T15:13:27Z

Thank you for opening the issue.

bharath-techie · 2024-10-14T15:24:12Z

[Triage attendees - 1 2 3 4]

austintlee added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 22, 2024

github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Sep 22, 2024

andrross added the ingest-pipeline label Sep 25, 2024

bharath-techie removed the untriaged label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] One-to-many and many-to-one Ingest Processors #16029

[Feature Request] One-to-many and many-to-one Ingest Processors #16029

austintlee commented Sep 22, 2024

bharath-techie commented Sep 30, 2024

bharath-techie commented Oct 14, 2024

[Feature Request] One-to-many and many-to-one Ingest Processors #16029

[Feature Request] One-to-many and many-to-one Ingest Processors #16029

Comments

austintlee commented Sep 22, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

bharath-techie commented Sep 30, 2024

bharath-techie commented Oct 14, 2024