You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@kw2828 For example, if I want to train a specific dataset, something like PDF, how can I approach that? I have tried using embeddings and connecting with OpenAI using LangChain, but the output loses a lot of contexts. If I import a PDF of a book, embed and ask questions like "Summarize the whole book" or "Give me chapter-wise summaries", it just loses a lot of contexts because embedding doesn't fetch all the chapters, but only a few and gives summaries for only those chapters and says other chapters don't exist. Any suggestions on how can I train these unstructured data? to make sure that I can get the entire book summary or chapter-wise summaries?
Thanks in advance.
The text was updated successfully, but these errors were encountered:
spkprav
changed the title
Suggestions on unstructured data
Suggestions on unstructured data training
Jun 13, 2023
@spkprav I have been able to build a custom QA system using OpenAI embeddings and a local LLM. Here is what I did:
Converted my PDF content into text using Python.
Split the text into chunks under 768 tokens, with each chunk representing a section or chapter (mostly a paragraph).
Saved the chunks as JSON with metadata like chapter number and title.
Called the OpenAI embeddings API on each chunk to get a vector representation. Stored the embeddings in the same JSON.
Deserialized the JSON into C# objects and stored them in a List.
When the user asks a question, embed it with OpenAI.
Use cosine similarity to find the top 3 most similar chunks from the content.
Pass the question, top chunks, and a prompt with instructions to OpenAI or a local LLM like Vicuna 1.3 7B or 13B.
The LLM generates an answer conditioned on the relevant chunks and questions.
The key is crafting a prompt that provides context for the LLM. This approach worked well for extracting answers from my content.
Here is my system prompt:
The instructions for this conversation are as follows: 1. You will provide an accurate and concise answer using only the provided documents. 2. Engage in a productive and useful conversation. 3. Take into account previous questions, search results, and answers to generate the most relevant answer. 3. Reply back using users query language. 4. Prefer short answers for follow-up if possible. 6. Your answer should be related to [insert topic here], visual, logical, and actionable. 5. Cite your answers with [index] at the end of the sentence.
@kw2828 For example, if I want to train a specific dataset, something like PDF, how can I approach that? I have tried using embeddings and connecting with OpenAI using LangChain, but the output loses a lot of contexts. If I import a PDF of a book, embed and ask questions like "Summarize the whole book" or "Give me chapter-wise summaries", it just loses a lot of contexts because embedding doesn't fetch all the chapters, but only a few and gives summaries for only those chapters and says other chapters don't exist. Any suggestions on how can I train these unstructured data? to make sure that I can get the entire book summary or chapter-wise summaries?
Thanks in advance.
The text was updated successfully, but these errors were encountered: