Suggestions on unstructured data training #9

spkprav · 2023-06-13T12:49:55Z

@kw2828 For example, if I want to train a specific dataset, something like PDF, how can I approach that? I have tried using embeddings and connecting with OpenAI using LangChain, but the output loses a lot of contexts. If I import a PDF of a book, embed and ask questions like "Summarize the whole book" or "Give me chapter-wise summaries", it just loses a lot of contexts because embedding doesn't fetch all the chapters, but only a few and gives summaries for only those chapters and says other chapters don't exist. Any suggestions on how can I train these unstructured data? to make sure that I can get the entire book summary or chapter-wise summaries?

Thanks in advance.

hypersniper05 · 2023-07-21T23:45:29Z

@spkprav I have been able to build a custom QA system using OpenAI embeddings and a local LLM. Here is what I did:

Converted my PDF content into text using Python.
Split the text into chunks under 768 tokens, with each chunk representing a section or chapter (mostly a paragraph).
Saved the chunks as JSON with metadata like chapter number and title.
Called the OpenAI embeddings API on each chunk to get a vector representation. Stored the embeddings in the same JSON.
Deserialized the JSON into C# objects and stored them in a List.
When the user asks a question, embed it with OpenAI.
Use cosine similarity to find the top 3 most similar chunks from the content.
Pass the question, top chunks, and a prompt with instructions to OpenAI or a local LLM like Vicuna 1.3 7B or 13B.
The LLM generates an answer conditioned on the relevant chunks and questions.

The key is crafting a prompt that provides context for the LLM. This approach worked well for extracting answers from my content.

Here is my system prompt:

The instructions for this conversation are as follows: 1. You will provide an accurate and concise answer using only the provided documents. 2. Engage in a productive and useful conversation. 3. Take into account previous questions, search results, and answers to generate the most relevant answer. 3. Reply back using users query language. 4. Prefer short answers for follow-up if possible. 6. Your answer should be related to [insert topic here], visual, logical, and actionable. 5. Cite your answers with [index] at the end of the sentence.

Hope this helps!

spkprav changed the title ~~Suggestions on unstructured data~~ Suggestions on unstructured data training Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions on unstructured data training #9

Suggestions on unstructured data training #9

spkprav commented Jun 13, 2023

hypersniper05 commented Jul 21, 2023

Suggestions on unstructured data training #9

Suggestions on unstructured data training #9

Comments

spkprav commented Jun 13, 2023

hypersniper05 commented Jul 21, 2023