-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: feat: Initial code to load workspaces from a specific container path #583
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
import asyncio | ||
import json | ||
import uuid | ||
from pathlib import Path | ||
from typing import Dict, List, Optional, Union | ||
|
||
from pydantic import BaseModel | ||
|
||
from codegate.db.connection import DbRecorder | ||
from codegate.db.models import Workspace | ||
|
||
|
||
class Folder(BaseModel): | ||
files: List[str] = [] | ||
|
||
|
||
class Repository(BaseModel): | ||
name: str | ||
folder_tree: Dict[str, Folder] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the intent to store the whole directory tree of a repository? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about storing the root of the repo instead of the whole filesystem? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the intent is to store the whole directory tree of a repository. The reasoning behind it is to do fast lookups when we see a path in the received code snippets. Right now, we get the path of a code snippet if it was supplied for context to the LLM. Example: {
"messages": [
{
"role": "user",
"content": "\n\n```py codegate/src/codegate/pipeline/factory.py (1-57)\nfrom typing import List\n\nfrom codegate.config import Config\nfrom codegate.pipeline.base import PipelineStep, SequentialPipelineProcessor\nfrom codegate.pipeline.codegate_context_retriever.codegate import CodegateContextRetriever\nfrom codegate.pipeline.extract_snippets.extract_snippets import CodeSnippetExtractor\nfrom codegate.pipeline.extract_snippets.output import CodeCommentStep\nfrom codegate.pipeline.output import OutputPipelineProcessor, OutputPipelineStep\nfrom codegate.pipeline.secrets.manager import SecretsManager\nfrom codegate.pipeline.secrets.secrets import (\n CodegateSecrets,\n SecretRedactionNotifier,\n SecretUnredactionStep,\n)\nfrom codegate.pipeline.system_prompt.codegate import SystemPrompt\nfrom codegate.pipeline.version.version import CodegateVersion\n\n\nclass PipelineFactory:\n def __init__(self, secrets_manager: SecretsManager):\n self.secrets_manager = secrets_manager\n\n def create_input_pipeline(self) -> SequentialPipelineProcessor:\n input_steps: List[PipelineStep] = [\n # make sure that this step is always first in the pipeline\n # the other steps might send the request to a LLM for it to be analyzed\n # and without obfuscating the secrets, we'd leak the secrets during those\n # later steps\n CodegateSecrets(),\n CodegateVersion(),\n CodeSnippetExtractor(),\n CodegateContextRetriever(),\n SystemPrompt(Config.get_config().prompts.default_chat),\n ]\n return SequentialPipelineProcessor(input_steps, self.secrets_manager, is_fim=False)\n\n def create_fim_pipeline(self) -> SequentialPipelineProcessor:\n fim_steps: List[PipelineStep] = [\n CodegateSecrets(),\n ]\n return SequentialPipelineProcessor(fim_steps, self.secrets_manager, is_fim=True)\n\n def create_output_pipeline(self) -> OutputPipelineProcessor:\n output_steps: List[OutputPipelineStep] = [\n SecretRedactionNotifier(),\n SecretUnredactionStep(),\n CodeCommentStep(),\n ]\n return OutputPipelineProcessor(output_steps)\n\n def create_fim_output_pipeline(self) -> OutputPipelineProcessor:\n fim_output_steps: List[OutputPipelineStep] = [\n # temporarily disabled\n # SecretUnredactionStep(),\n ]\n return OutputPipelineProcessor(fim_output_steps)\n\n```\nwhats this code doing?"
}
],
"model": "hosted_vllm/unsloth/Qwen2.5-Coder-32B-Instruct",
"max_tokens": 4096,
"stream": true,
"base_url": "https://inference.codegate.ai/v1"
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. gotcha, that makes sense. Let's give this a little bit of thought; every time a file is added/removed we'd have to rewrite the JSON blob into the database and that's not optimal either. |
||
|
||
|
||
class FolderRepoScanner: | ||
|
||
def __init__(self, ignore_paths: Optional[List[str]] = None): | ||
if ignore_paths is None: | ||
ignore_paths = [] | ||
self.ignore_paths = ignore_paths | ||
|
||
def _should_skip(self, path: Path): | ||
"""Skip certain paths that are not relevant for scanning.""" | ||
return any(part in path.parts for part in self.ignore_paths) | ||
|
||
def _read_repository_structure(self, repo_path: Path) -> Dict[str, Folder]: | ||
folder_tree: Dict[str, Folder] = {} | ||
for path in repo_path.rglob('*'): | ||
if self._should_skip(path): | ||
continue | ||
|
||
relative_path = path.relative_to(repo_path) | ||
if path.is_dir(): | ||
folder_tree[str(relative_path)] = Folder() | ||
else: | ||
parent_dir = str(relative_path.parent) | ||
if parent_dir not in folder_tree: | ||
folder_tree[parent_dir] = Folder() | ||
folder_tree[parent_dir].files.append(path.name) | ||
return folder_tree | ||
|
||
def read(self, path_str: Union[str, Path]) -> List[Repository]: | ||
path_dir = Path(path_str) | ||
if not path_dir.is_dir(): | ||
print(f"Path {path_dir} is not a directory") | ||
return [] | ||
|
||
found_repos = [] | ||
for child_path in path_dir.rglob('*'): | ||
if child_path.is_dir() and (child_path / ".git").exists(): | ||
repo_structure = self._read_repository_structure(child_path) | ||
new_repo = Repository(name=child_path.name, folder_tree=repo_structure) | ||
found_repos.append(new_repo) | ||
print(f"Found repository at {child_path}.") | ||
|
||
return found_repos | ||
|
||
class Workspaces: | ||
|
||
def __init__(self): | ||
self._db_recorder = DbRecorder() | ||
|
||
def read_workspaces(self, path: str, ignore_paths: Optional[List[str]] = None) -> None: | ||
repos = FolderRepoScanner(ignore_paths).read(path) | ||
workspaces = [ | ||
Workspace( | ||
id=str(uuid.uuid4()), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean that if I restart the container I get a new Workspace per repo in the tree? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I still need to add functionality to avoid creating a new workspace if the repo already exists. |
||
name=repo.name, | ||
folder_tree_json=json.dumps(repo.folder_tree) | ||
) | ||
for repo in repos | ||
] | ||
asyncio.run(self._db_recorder.record_workspaces(workspaces)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to just include the contents of gitignore?
Sounds like we should make this configurable down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider the contents of
.gitignore
but if we use that it would mean skipping files that may contain secrets but could still leaked to LLMs.I was planning to make
ignore_paths_workspaces
configurable through the cli, I just didn't have time to do so. The values here would be the defaults