Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to latest graphrag library release #213

Merged
merged 45 commits into from
Jan 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
98fc3b8
refactor of bicep variable names to be more generic and pytest cleanup
jgbradley1 Jan 2, 2025
2b41976
add RBAC reader role assignment to cosmosdb bicep deployment
jgbradley1 Jan 2, 2025
699bfa5
update dependabot settings
jgbradley1 Jan 2, 2025
ff5714a
update code references to new locations in graphrag library
jgbradley1 Jan 3, 2025
0252646
refactor variable names to be more generic and add integration tests
jgbradley1 Jan 3, 2025
fbbad71
add new pytests
jgbradley1 Jan 3, 2025
2f744c1
ruff format updates
jgbradley1 Jan 3, 2025
4e1b1ba
revert bicep api version changes to a working condition
jgbradley1 Jan 3, 2025
44d7859
fix bad import
jgbradley1 Jan 3, 2025
97d8b52
temporary move of import statement
jgbradley1 Jan 4, 2025
3c96059
update synthetic index dataset
jgbradley1 Jan 4, 2025
96dcd95
refactor bicep to be cleaner and remove ssh public key generation for…
jgbradley1 Jan 4, 2025
8d448f8
update common bicep variable name
jgbradley1 Jan 4, 2025
404eed1
temporarily install from graphrag repo
jgbradley1 Jan 16, 2025
92af333
working version of indexing endpoint
jgbradley1 Jan 17, 2025
c171387
update callbacks
jgbradley1 Jan 17, 2025
e85c9c0
update pyproject.toml
jgbradley1 Jan 17, 2025
a8bf673
refactor and reorganize indexing code out of api code
jgbradley1 Jan 21, 2025
4c5d947
fixed appinsights logging
jgbradley1 Jan 21, 2025
4f2734d
add app insights rbac role assignment
jgbradley1 Jan 22, 2025
4f8ff8e
reorganize rbac assignments to be cleaner
jgbradley1 Jan 22, 2025
1db953c
bicep code cleanup
jgbradley1 Jan 22, 2025
97ea601
add custom cosmosdb rbac role
jgbradley1 Jan 23, 2025
17845d8
change cicd pytest to use https protocol
jgbradley1 Jan 23, 2025
ba883dc
update pytest dataset
jgbradley1 Jan 23, 2025
57e8664
refactor AzureClientManager code and update pytests
jgbradley1 Jan 23, 2025
8f2f0ea
fix import filepath
jgbradley1 Jan 23, 2025
72d759a
apply az bicep format --file to all bicep files
jgbradley1 Jan 23, 2025
84f2770
update how prompts get saved in cosmosdb
jgbradley1 Jan 24, 2025
7554483
rename environment variable to align with standard azure naming conve…
jgbradley1 Jan 24, 2025
9a73ef2
convert app to proper python package
jgbradley1 Jan 25, 2025
d0273c3
reorganize k8s manifest files and scripts
jgbradley1 Jan 26, 2025
46b2a8c
update notebook
jgbradley1 Jan 26, 2025
c6ad3f6
fix logging of extra metadata
jgbradley1 Jan 27, 2025
ec49d32
attempt to cleanup fastapi code
jgbradley1 Jan 28, 2025
bc9eed0
minor refactoring of code
jgbradley1 Jan 28, 2025
64b19cd
fix variable references in url path
jgbradley1 Jan 29, 2025
95a5dff
Merge branch 'joshbradley/fastapi-cleanup' into joshbradley/upgrade-t…
jgbradley1 Jan 29, 2025
854238b
update pytest to working condition
jgbradley1 Jan 29, 2025
c3d7f91
remove unnecesary validation check
jgbradley1 Jan 29, 2025
67663fd
update pytests
jgbradley1 Jan 29, 2025
c430cf2
update lock file
jgbradley1 Jan 29, 2025
8c56f7f
improve error logging and mark a working version of global search
jgbradley1 Jan 30, 2025
2df02c8
cleanup vnet deployment in bicep
jgbradley1 Jan 30, 2025
a9e9803
temporary removal of multi-index query capability and query streaming…
jgbradley1 Jan 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "/"
directory: "/backend"
schedule:
interval: "weekly"
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ jobs:
- name: Run pytests
working-directory: ${{ github.workspace }}/backend
run: |
pytest --cov=src --junitxml=test-results.xml tests/
pytest --cov=graphrag_app --junitxml=test-results.xml tests/

- name: Upload test results
uses: actions/upload-artifact@v4
Expand Down
1 change: 0 additions & 1 deletion backend/.coveragerc
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
[run]
omit =
**/__init__.py
src/models.py
22 changes: 22 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Web App
This directory contains the source code for a FastAPI application implements a REST API wrapper around the graphrag library. The app has been packaged up as a python package for a cleaner install/deployment experience.

## Package Layout
The code has the following structure:
```shell
backend
├── README.md
├── graphrag_app # contains the main application files
│   ├── __init__.py
│   ├── api # endpoint definitions
│   ├── logger # custom loggers designed for graphrag use
│   ├── main.py # initializes the FastAPI application
│   ├── typing # data validation models
│   └── utils # utility/helper functions
├── manifests # k8s manifest files
├── poetry.lock
├── pyproject.toml
├── pytest.ini
├── scripts # miscellaneous scripts that get executed in k8s
└── tests # pytests (integration tests + unit tests)
```
File renamed without changes.
File renamed without changes.
133 changes: 66 additions & 67 deletions backend/src/api/data.py → backend/graphrag_app/api/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,30 @@

import asyncio
import re
import traceback
from math import ceil
from typing import List

from azure.storage.blob import ContainerClient
from azure.storage.blob.aio import ContainerClient
from fastapi import (
APIRouter,
Depends,
HTTPException,
UploadFile,
)

from src.api.azure_clients import AzureClientManager
from src.api.common import (
delete_blob_container,
delete_cosmos_container_item,
sanitize_name,
validate_blob_container_name,
)
from src.logger import LoggerSingleton
from src.models import (
from graphrag_app.logger.load_logger import load_pipeline_logger
from graphrag_app.typing.models import (
BaseResponse,
StorageNameList,
)
from graphrag_app.utils.common import (
delete_cosmos_container_item_if_exist,
delete_storage_container_if_exist,
get_blob_container_client,
get_cosmos_container_store_client,
sanitize_name,
)

data_route = APIRouter(
prefix="/data",
Expand All @@ -34,26 +36,27 @@

@data_route.get(
"",
summary="Get all data storage containers.",
summary="Get list of data containers.",
response_model=StorageNameList,
responses={200: {"model": StorageNameList}},
)
async def get_all_data_storage_containers():
async def get_all_data_containers():
"""
Retrieve a list of all data storage containers.
Retrieve a list of all data containers.
"""
azure_client_manager = AzureClientManager()
items = []
try:
container_store_client = azure_client_manager.get_cosmos_container_client(
database="graphrag", container="container-store"
)
container_store_client = get_cosmos_container_store_client()
for item in container_store_client.read_all_items():
if item["type"] == "data":
items.append(item["human_readable_name"])
except Exception:
reporter = LoggerSingleton().get_instance()
reporter.on_error("Error getting list of blob containers.")
except Exception as e:
reporter = load_pipeline_logger()
reporter.error(
message="Error getting list of blob containers.",
cause=e,
stack=traceback.format_exc(),
)
raise HTTPException(
status_code=500, detail="Error getting list of blob containers."
)
Expand Down Expand Up @@ -112,10 +115,13 @@ def __exit__(self, *args):
responses={200: {"model": BaseResponse}},
)
async def upload_files(
files: List[UploadFile], storage_name: str, overwrite: bool = True
files: List[UploadFile],
container_name: str,
sanitized_container_name: str = Depends(sanitize_name),
overwrite: bool = True,
):
"""
Create a data storage container in Azure and upload files to it.
Create a Azure Storage container and upload files to it.

Args:
files (List[UploadFile]): A list of files to be uploaded.
Expand All @@ -128,80 +134,73 @@ async def upload_files(
Raises:
HTTPException: If the container name is invalid or if any error occurs during the upload process.
"""
sanitized_storage_name = sanitize_name(storage_name)
# ensure container name follows Azure Blob Storage naming conventions
try:
validate_blob_container_name(sanitized_storage_name)
except ValueError:
raise HTTPException(
status_code=500,
detail=f"Invalid blob container name: '{storage_name}'. Please try a different name.",
)
try:
azure_client_manager = AzureClientManager()
blob_service_client = azure_client_manager.get_blob_service_client_async()
container_client = blob_service_client.get_container_client(
sanitized_storage_name
)
if not await container_client.exists():
await container_client.create_container()

# clean files - remove illegal XML characters
files = [UploadFile(Cleaner(f.file), filename=f.filename) for f in files]

# upload files in batches of 1000 to avoid exceeding Azure Storage API limits
blob_container_client = await get_blob_container_client(
sanitized_container_name
)
batch_size = 1000
batches = ceil(len(files) / batch_size)
for i in range(batches):
num_batches = ceil(len(files) / batch_size)
for i in range(num_batches):
batch_files = files[i * batch_size : (i + 1) * batch_size]
tasks = [
upload_file_async(file, container_client, overwrite)
upload_file_async(file, blob_container_client, overwrite)
for file in batch_files
]
await asyncio.gather(*tasks)
# update container-store in cosmosDB since upload process was successful
container_store_client = azure_client_manager.get_cosmos_container_client(
database="graphrag", container="container-store"
)
container_store_client.upsert_item({
"id": sanitized_storage_name,
"human_readable_name": storage_name,

# update container-store entry in cosmosDB once upload process is successful
cosmos_container_store_client = get_cosmos_container_store_client()
cosmos_container_store_client.upsert_item({
"id": sanitized_container_name,
"human_readable_name": container_name,
"type": "data",
})
return BaseResponse(status="File upload successful.")
except Exception:
logger = LoggerSingleton().get_instance()
logger.on_error("Error uploading files.", details={"files": files})
except Exception as e:
logger = load_pipeline_logger()
logger.error(
message="Error uploading files.",
cause=e,
stack=traceback.format_exc(),
details={"files": [f.filename for f in files]},
)
raise HTTPException(
status_code=500,
detail=f"Error uploading files to container '{storage_name}'.",
detail=f"Error uploading files to container '{container_name}'.",
)


@data_route.delete(
"/{storage_name}",
"/{container_name}",
summary="Delete a data storage container",
response_model=BaseResponse,
responses={200: {"model": BaseResponse}},
)
async def delete_files(storage_name: str):
async def delete_files(
container_name: str, sanitized_container_name: str = Depends(sanitize_name)
):
"""
Delete a specified data storage container.
"""
# azure_client_manager = AzureClientManager()
sanitized_storage_name = sanitize_name(storage_name)
try:
# delete container in Azure Storage
delete_blob_container(sanitized_storage_name)
# delete entry from container-store in cosmosDB
delete_cosmos_container_item("container-store", sanitized_storage_name)
except Exception:
logger = LoggerSingleton().get_instance()
logger.on_error(
f"Error deleting container {storage_name}.",
details={"Container": storage_name},
delete_storage_container_if_exist(sanitized_container_name)
delete_cosmos_container_item_if_exist(
"container-store", sanitized_container_name
)
except Exception as e:
logger = load_pipeline_logger()
logger.error(
message=f"Error deleting container {container_name}.",
cause=e,
stack=traceback.format_exc(),
details={"Container": container_name},
)
raise HTTPException(
status_code=500, detail=f"Error deleting container '{storage_name}'."
status_code=500,
detail=f"Error deleting container '{container_name}'.",
)
return BaseResponse(status="Success")
36 changes: 22 additions & 14 deletions backend/src/api/graph.py → backend/graphrag_app/api/graph.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

import traceback

from fastapi import (
APIRouter,
Depends,
HTTPException,
)
from fastapi.responses import StreamingResponse

from src.api.azure_clients import AzureClientManager
from src.api.common import (
from graphrag_app.logger.load_logger import load_pipeline_logger
from graphrag_app.utils.azure_clients import AzureClientManager
from graphrag_app.utils.common import (
sanitize_name,
validate_index_file_exist,
)
from src.logger import LoggerSingleton

graph_route = APIRouter(
prefix="/graph",
Expand All @@ -21,31 +24,36 @@


@graph_route.get(
"/graphml/{index_name}",
"/graphml/{container_name}",
summary="Retrieve a GraphML file of the knowledge graph",
response_description="GraphML file successfully downloaded",
)
async def get_graphml_file(index_name: str):
# validate index_name and graphml file existence
async def get_graphml_file(
container_name, sanitized_container_name: str = Depends(sanitize_name)
):
# validate graphml file existence
azure_client_manager = AzureClientManager()
sanitized_index_name = sanitize_name(index_name)
graphml_filename = "summarized_graph.graphml"
graphml_filename = "graph.graphml"
blob_filepath = f"output/{graphml_filename}" # expected file location of the graph based on the workflow
validate_index_file_exist(sanitized_index_name, blob_filepath)
validate_index_file_exist(sanitized_container_name, blob_filepath)
try:
blob_client = azure_client_manager.get_blob_service_client().get_blob_client(
container=sanitized_index_name, blob=blob_filepath
container=sanitized_container_name, blob=blob_filepath
)
blob_stream = blob_client.download_blob().chunks()
return StreamingResponse(
blob_stream,
media_type="application/octet-stream",
headers={"Content-Disposition": f"attachment; filename={graphml_filename}"},
)
except Exception:
logger = LoggerSingleton().get_instance()
logger.on_error("Could not retrieve graphml file")
except Exception as e:
logger = load_pipeline_logger()
logger.error(
message="Could not fetch graphml file",
cause=e,
stack=traceback.format_exc(),
)
raise HTTPException(
status_code=500,
detail=f"Could not retrieve graphml file for index '{index_name}'.",
detail=f"Could not fetch graphml file for '{container_name}'.",
)
Loading
Loading