Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] One-to-many and many-to-one Ingest Processors #16029

Open
austintlee opened this issue Sep 22, 2024 · 2 comments
Open

[Feature Request] One-to-many and many-to-one Ingest Processors #16029

austintlee opened this issue Sep 22, 2024 · 2 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing ingest-pipeline

Comments

@austintlee
Copy link
Contributor

Is your feature request related to a problem? Please describe

A common use case in semantic search is chunking of large (unstructured) documents where documents get broken into smaller chunks and chunks are then converted to (dense) vectors before they are stored in a vector database. The way this is done in OpenSearch is through a sequence of two processors - the Text Chunking processor and the Text Embedding processor. The problem with these ingest processors is that the resulting embeddings are indexed as nested fields and in my opinion, having to always qualify my queries as "nested" leads to poor user experience and there seem to be a lot of limitations when it comes to queries on nested fields. Just take a look at all these open issues in neural-search related to nested fields.

Describe the solution you'd like

A simple (?) solution to this is to introduce a processor that take an IngestDocument as input and produces a List as output. So, if the text chunking processor produces 100 chunks out of a document, instead of the parent document having 100 nested fields, I would like to produce 100 documents. Continuing with the above example, I can then use a batch processor for text embedding which takes those 100 documents as input and 100 documents (vectors) as output.

I don't have any use case for many-to-one ("reduce") processors, but I just thought I would mention it for symmetry. In theory, you can just use a batch processor for this (a List of many to a List of one).

Related component

Indexing

Describe alternatives you've considered

Alternatively, we could introduce an operation that turns each nested field into a full document. Since nested fields are already internally treated as documents, maybe this would be relatively easy to do? This could be a feature that works at the index level where you can configure OpenSearch to expand nested fields to explicit documents at the time of ingestion/indexing or a separate API that does that at the document level or both.

Additional context

One issue with the one-to-many processor is that I still want to index the parent document. I would need something like a conditional pipeline that is able to process the parent document in one flow and all the child chunks in another.

                    / -- processor 2 - index parent doc
chunk processor 1--| 
                    \ -- processor 3 - create embeddings for chunks

I can probably get by without having conditional pipelines because the parent document won't have the input field that the text embedding processor will look for (it'll be a no op).

@austintlee austintlee added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 22, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Sep 22, 2024
@bharath-techie
Copy link
Contributor

Thank you for opening the issue.

@bharath-techie
Copy link
Contributor

[Triage attendees - 1 2 3 4]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing ingest-pipeline
Projects
None yet
Development

No branches or pull requests

3 participants