Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added code generation example #164

Merged
merged 3 commits into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,081 changes: 1,081 additions & 0 deletions data/astrapy.jsonl

Large diffs are not rendered by default.

817 changes: 817 additions & 0 deletions docs/examples/code-generation.ipynb

Large diffs are not rendered by default.

18 changes: 17 additions & 1 deletion docs/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,20 @@
It loads Wikipedia articles and traverses based on links ("mentions") and named entities (extracted from the content). It retrieves a large number of articles, groups them by community, and extracts claims from each community. The best claims are used to answer the question.

[:material-fast-forward: Lazy Graph RAG Example](lazy-graph-rag.ipynb)
</div>

- :material-code-braces-box:{ .lg .middle } __Code Generation__

---
This example notebook shows how to load documentation for python packages into a
vector store so that it can be used to provide context to an LLM for code generation.

It uses LangChain and `langchain-graph-retriever` with a custom traversal Strategy
to in order to improve LLM generated code output. It shows that using GraphRAG can
provide a significant increase in quality over using either a LLM alone or standard
RAG.

GraphRAG is traverses through the documentation in a way similar to how a
software engineer would, in order to determine how to solve a coding problem.

[:material-fast-forward: Code Generation Example](code-generation.ipynb)
</div>
7 changes: 7 additions & 0 deletions docs/examples/lazy-graph-rag.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,13 @@
"The last package -- `graph-rag-example-helpers` -- includes some helpers for setting up environment helpers and allowing the loading of wikipedia data to be restarted if it fails."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
2 changes: 2 additions & 0 deletions packages/graph-rag-example-helpers/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ dependencies = [
"astrapy>=1.5.2",
"backoff>=2.2.1",
"graph-retriever",
"griffe>=1.5.7",
"httpx>=0.28.1",
"langchain-core>=0.3.29",
"python-dotenv>=1.0.1",
Expand All @@ -54,6 +55,7 @@ dependencies = [
astrapy = "astrapy"
backoff = "backoff"
graph-retriever = "graph_retriever"
griffe = "griffe"
httpx = "httpx"
langchain-core = "langchain_core"
mypy = "mypy"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from ...examples.code_generation.format import add_tabs, format_docs, format_document
from .fetch import fetch_documents

__all__ = [
"fetch_documents",
"add_tabs",
"format_document",
"format_docs",
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import json

import requests
from langchain_core.documents import Document

# TODO: revert to main branch before code generation is merged
# ASTRAPY_JSONL_URL = "https://raw.githubusercontent.com/datastax/graph-rag/refs/heads/main/data/astrapy.jsonl"
ASTRAPY_JSONL_URL = "https://raw.githubusercontent.com/datastax/graph-rag/refs/heads/code_generation/data/astrapy.jsonl"


def fetch_documents() -> list[Document]:
"""
Download and parse a list of Documents for use with Graph Retriever.

This dataset contains the documentation for the AstraPy project as of version 1.5.2.

This method downloads the dataset each time -- generally it is preferable
to invoke this only once and store the documents in memory or a vector
store.

Returns
-------
:
The fetched astra-py documentation Documents.

Notes
-----
- The dataset is setup in a way where the path of the item is the `id`, the pydoc
description is the `page_content`, and the items other attributes are stored in the
`metadata`.
- There are many documents that contain an id and metadata, but no page_content.
"""
response = requests.get(ASTRAPY_JSONL_URL)
response.raise_for_status() # Ensure we got a valid response

return [
Document(id=data["id"], page_content=data["text"], metadata=data["metadata"])
for line in response.text.splitlines()
if (data := json.loads(line))
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from .format import add_tabs, format_docs, format_document

__all__ = [
"add_tabs",
"format_document",
"format_docs",
]
Loading
Loading