You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+41-27
Original file line number
Diff line number
Diff line change
@@ -1,25 +1,21 @@
1
-
2
1
# Build a chatbot with always updated data sources using Pathway + LlamaIndex + Streamlit
3
2
4
-
## Subtitle: Create a RAG application without a Vector DB, ETL pipelines or separate backend!
3
+
## Create a RAG App without a Vector DBor fragmented ETL pipelines!
5
4
5
+
This repository will show you how to build a RAG App that always has up-to-date information from your documents and sources stored in Google Drive, Dropbox, Sharepoint and more.
6
6
7
-
In this post, we explore how to build a RAG application that always has up-to-date information from your documents and sources stored in Google Drive, Dropbox, Sharepoint and more.
7
+
The setup guide below describes how to build your **App**. You then connect your App to a public **Pathway Vector Store** sandbox, which is in sync with some public Google Drive and Sharepoint folders. Here, you can upload your own non-confidential files, and try out the App with the sandbox. Finally, we will show you how to quickly spin up your very own Pathway Vector Store which is kept in sync with your own private folders.
8
8
9
+
> ℹ To run the full solution (your very own Pathway Vector Store + App) in a single go in production, with your own private folders, we recommend using this complete [🐋 Dockerized setup 🐋](https://github.com/pathwaycom/llm-app/blob/main/examples/pipelines/demo-document-indexing/README.md) directly.
9
10
10
11
## What is Pathway
11
12
Pathway is an open data processing framework. It allows you to easily develop data transformation pipelines and Machine Learning applications that work with live data sources and changing data. Pathway listens to our documents for changes, additions or removals. It handles loading and indexing without the need for an ETL. Specifically, we will use Pathway hosted offering that makes it particularly easy to launch advanced RAG applications with very little overhead.
12
13
13
-
(Meta note) select one:
14
-
- In this demo, you will use Pathway with LlamaIndex with Pathway's LlamaIndex integration which makes it particularly easy to create chatbots that have memory and can access our documents.
15
-
16
-
- In this demo, you will use LlamaIndex with the Pathway's LlamaIndex integration, and Pathway hosted index solution. Using Pathway and LlamaIndex is a quick way to create powerful chatbots that have memory and can access our documents.
17
-
18
-
- In this blog, we showcase the integration of LlamaIndex with Pathway's hosted index solution. You can effortlessly develop advanced chatbots with memory capabilities, providing easy real-time access to your documents.
14
+
In this repository, we showcase the integration of LlamaIndex with Pathway's Vector Store solution. You can effortlessly develop advanced chatbots with memory capabilities, providing easy real-time access to your documents. The instructions below are intended as a step-by-step tutorial for learning.
19
15
20
16
## Why Pathway?
21
17
22
-
Pathway offers an indexing solution that is always up to date without the need for traditional ETL pipelines, which are needed in regular VectorDBs. It can monitor several data sources (files, S3 folders, cloud storage) and provide the latest information to your LLM application.
18
+
Pathway is a data processing framework allowing easy building of advanced data processing pipelines. Among others, it offers [Pathway Vector Store](https://pathway.com/developers/user-guide/llm-xpack/vectorstore_pipeline/), a document indexing solution that is always up to date without the need for traditional ETL pipelines, which are needed in regular VectorDBs. It can monitor several data sources (files, S3 folders, cloud storage) and provide the latest information to your LLM application.
23
19
24
20
This means you do not need to worry about:
25
21
- Checking files to see if there are any changes
@@ -30,22 +26,19 @@ These are all handled by Pathway.
30
26
31
27
## App Overview
32
28
33
-
This demo consists of three parts. For always up-to-date knowledge and information retrieval from the documents in our folders, Pathway vector store is used.
34
-
LlamaIndex provides search capability to OpenAI LLM and combines functionalities such as chat memory, and OpenAI API calls for the app. Finally, Streamlit powers the easy-to-navigate user interface for easy access to the app.
29
+
This demo combines three technologies.
30
+
* For always up-to-date knowledge and information retrieval from the documents in our folders, **Pathway Vector Store** is used.
31
+
***LlamaIndex** provides search capability to OpenAI LLM and combines functionalities such as chat memory, and OpenAI API calls for the app.
32
+
* Finally, **Streamlit** powers the easy-to-navigate user interface for easy access to the app.
Want to jump right in? Check out the app and the [code](https://github.com/pathway-labs/realtime-indexer-qa-chat).
41
-
```
34
+
## Tutorial: Creating always up-to-date RAG App with Pathway Vector Store + LlamaIndex
42
35
43
36
## Prerequisites
44
37
- An OpenAI API Key (Only needed for OpenAI models)
45
-
- Pathway instance (Hosted version is provided free for the demo)
38
+
-Running Pathway Vector Store process (a hosted version is provided for the demo, instructoins to self-host one are provided below)
46
39
47
-
## Adding data to source
48
-
First, add example documents to your pipeline by uploading files to Google Drive that is registered to Pathway as a source. Pathway can listen to many sources simultaneously, such as local files, S3 folders, cloud storage and any data stream for data changes. For this demo, a Google Drive folder is provided for you to upload files. There is Pathway Github repository's readme that is provided in the folder. In this demo, we will ask our questions about Pathway our assistant and it will respond based on the available files in the Drive folder.
40
+
## Adding new documents
41
+
First, add example documents to the vector store by uploading files to Google Drive that is registered to Pathway Vector Store as a source. Pathway can listen to many sources simultaneously, such as local files, S3 folders, cloud storage and any data stream for data changes. For this demo, a public Google Drive folder is provided for you to upload file. It is pre-populated with Pathway Github repository's readme. In this demo, we will ask questions about Pathway to our assistant and it will respond based on the available files in the Drive folder.
49
42
50
43
See [pathway-io](https://pathway.com/developers/api-docs/pathway-io) for more information on available connectors and how to implement custom connectors.
51
44
@@ -60,7 +53,7 @@ from llama_index.query_engine import RetrieverQueryEngine
60
53
from llama_index.chat_engine.condense_question import CondenseQuestionChatEngine
61
54
```
62
55
63
-
Then, initialize the retriever with the hosted Pathway instance and create query engine:
56
+
Then, initialize the retriever with the chosen Pathway Vector Store instance (for an easy start we point to the managed instance) and create the query engine:
@@ -107,7 +100,7 @@ if "messages" not in st.session_state.keys():
107
100
108
101
When the app is first run, `messages` will not be in the `st.session_state` and it will be initialized.
109
102
110
-
Then, print the messages both from the user and the assistant. Streamlit works in a way that resembles running a script, the whole file will be running each time there is a change in components, and the session state is the only component that has states. Making it powerful for saving and keeping elements that do not need to be re-initialized. That is why, all messages are printed iteratively.
103
+
Then, print messages both from the user and the assistant. Streamlit works in a way that resembles running a script, the whole file will be running each time there is a change in components, and the session state is the only component that has states. Making it powerful for saving and keeping elements that do not need to be re-initialized. That is why, all messages are printed iteratively.
111
104
112
105
```python
113
106
if prompt := st.chat_input("Your question"):
@@ -132,22 +125,43 @@ if st.session_state.messages[-1]["role"] != "assistant":
132
125
```
133
126
134
127
135
-
## Running the App
128
+
## 1️⃣ Running the App
136
129
137
130
### On Streamlit Community Cloud
138
131
132
+
The demo is hosted on Streamlit Community Cloud [here](https://chat-realtime-sharepoint-gdrive.streamlit.app/). This version of the app uses Pathway's [hosted document pipelines](https://cloud.pathway.com/docindex).
139
133
140
134
### On your local machine
141
135
142
-
Clone [this repository](change this to tutorial repo or folder) to your machine.
136
+
Clone this repository to your machine.
143
137
Create a `.env` file under the root folder, this will store your OpenAI API key, demo uses the OpenAI GPT model to answer questions.
144
138
145
-
You need a Pathway instance for vector search, for local deployment see the [vector store guide](https://pathway.com/developers/showcases/vectorstore_pipeline) and also [Pathway Deployment](https://pathway.com/developers/user-guide/deployment/docker-deployment). For this demo, a free instance is provided that reads documents in [Google Drive](https://drive.google.com/drive/u/2/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs) and [Sharepoint](https://navalgo.sharepoint.com/:f:/s/ConnectorSandbox/EgBe-VQr9h1IuR7VBeXsRfIBuOYhv-8z02_6zf4uTH8WbQ?e=YmlA05).
139
+
You need access to a running Pathway Vector Store pipeline. For this demo, a public instance is provided that reads documents in [Google Drive](https://drive.google.com/drive/u/2/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs) and [Sharepoint](https://navalgo.sharepoint.com/:f:/s/ConnectorSandbox/EgBe-VQr9h1IuR7VBeXsRfIBuOYhv-8z02_6zf4uTH8WbQ?e=YmlA05). However, it is easy to run our own locally. Please see the [vector store guide](https://pathway.com/developers/showcases/vectorstore_pipeline) and also [Pathway Deployment](https://pathway.com/developers/user-guide/deployment/docker-deployment).
146
140
147
141
Open a terminal and run `streamlit run ui.py`. This will prompt you a URL, simply click and open the demo.
148
142
149
143
Congrats! Now you are ready to chat with your documents with updated knowledge provided by Pathway.
150
144
145
+
### Running with Docker
146
+
147
+
We provide a Dockerfile to run the application. From the root folder of the repository run
148
+
149
+
```
150
+
docker build -t realtime_chat .
151
+
docker run -p 8501:8501 realtime_chat
152
+
```
153
+
154
+
We recommend running in docker when working on a Windows machine.
155
+
156
+
## 2️⃣ Running a local Pathway Vector Store
157
+
158
+
OK, so far you have managed to get the RAG App and running and it's working - but it still connects to the public demo folders! Let's fix that - we will now show you how to connect your very own folders, in a private deployment. This means you will need to spin up a light web server which provides the "Pathway Vector Store" service, responsible for the whole document ingestion and indexing pipeline.
159
+
160
+
The code for the Pathway Vector Store pipeline, along with a Dockerfile is provided in the [Pathway LLM examples repository](https://github.com/pathwaycom/llm-app/tree/main/examples/pipelines/demo-document-indexing). Please follow instructions to run only the vector store pipeline, or to run the pipeline and the Streamlit UI as a joint deployment using `docker compose`.
161
+
162
+
Note that if you want to create a RAG application connected to your Google Drive, you need to set up a Google Service account, [refer to the instructions here](https://github.com/pathwaycom/llm-app/blob/main/examples/pipelines/demo-question-answering/README.md#create-a-new-project-in-the-google-api-console).
163
+
Also, if you are not planning to use local files in your app, you can skip the `binding local volume` part explained in the llm-app instructions linked above.
164
+
151
165
## Summing Up
152
166
153
-
In this tutorial, you learned how to create and deploy a simple yet powerful RAG application with always up-to-date knowledge of your documents, without ETL jobs and buffers to check and read documents for any changes. You also learned how to get started with LlamaIndex using Pathway vector store, and how easy it is to get going with hosted Pathway that handles the majority of hurdles for you.
167
+
In this tutorial, you learned how to create and deploy a simple yet powerful RAG application with always up-to-date knowledge of your documents, without ETL jobs and buffers to check and read documents for any changes. You also learned how to get started with LlamaIndex using Pathway vector store, and how easy it is to get going with hosted Pathway that handles the majority of hurdles for you.
"[View code on GitHub.](https://github.com/pathway-labs/chat-realtime-sharepoint-gdrive)"
60
-
)
60
+
st.markdown(htm, unsafe_allow_html=True)
61
61
62
-
st.markdown(
63
-
"""Pathway pipelines ingest documents from [Google Drive](https://drive.google.com/drive/u/0/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs) and [Sharepoint](https://navalgo.sharepoint.com/:f:/s/ConnectorSandbox/EgBe-VQr9h1IuR7VBeXsRfIBuOYhv-8z02_6zf4uTH8WbQ?e=YmlA05) simultaneously. It automatically manages and syncs indexes enabling RAG applications."""
64
-
)
62
+
st.markdown("\n\n\n\n\n\n\n")
63
+
st.markdown("\n\n\n\n\n\n\n")
64
+
st.markdown(
65
+
"[View code on GitHub.](https://github.com/pathway-labs/chat-realtime-sharepoint-gdrive)"
66
+
)
67
+
st.markdown(
68
+
"""Pathway pipelines ingest documents from [Google Drive](https://drive.google.com/drive/u/0/folders/1cULDv2OaViJBmOfG5WB0oWcgayNrGtVs) and [Sharepoint](https://navalgo.sharepoint.com/:f:/s/ConnectorSandbox/EgBe-VQr9h1IuR7VBeXsRfIBuOYhv-8z02_6zf4uTH8WbQ?e=YmlA05) simultaneously. It automatically manages and syncs indexes enabling RAG applications."""
69
+
)
70
+
else:
71
+
st.markdown(f"**Connected to:** {PATHWAY_HOST}")
72
+
st.markdown(
73
+
"[View code on GitHub.](https://github.com/pathway-labs/chat-realtime-sharepoint-gdrive)"
0 commit comments