-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: feat: Initial code to load workspaces from a specific container path #583
Conversation
…path Related: #454 This is the initial work to create workspaces when the server is initialized. The idea is that the user mounts a volume at the specific location: `/app/codegate_workspaces` and read from there the git repositories.
@@ -54,6 +54,9 @@ class Config: | |||
force_certs: bool = False | |||
|
|||
max_fim_hash_lifetime: int = 60 * 5 # Time in seconds. Default is 5 minutes. | |||
ignore_paths_workspaces = [ | |||
".git", "__pycache__", ".venv", ".DS_Store", "node_modules", ".pytest_cache", ".ruff_cache" | |||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to just include the contents of gitignore?
Sounds like we should make this configurable down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider the contents of .gitignore
but if we use that it would mean skipping files that may contain secrets but could still leaked to LLMs.
I was planning to make ignore_paths_workspaces
configurable through the cli, I just didn't have time to do so. The values here would be the defaults
repos = FolderRepoScanner(ignore_paths).read(path) | ||
workspaces = [ | ||
Workspace( | ||
id=str(uuid.uuid4()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that if I restart the container I get a new Workspace per repo in the tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I still need to add functionality to avoid creating a new workspace if the repo already exists.
|
||
class Repository(BaseModel): | ||
name: str | ||
folder_tree: Dict[str, Folder] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent to store the whole directory tree of a repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about storing the root of the repo instead of the whole filesystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the intent is to store the whole directory tree of a repository. The reasoning behind it is to do fast lookups when we see a path in the received code snippets. Right now, we get the path of a code snippet if it was supplied for context to the LLM. Example:
{
"messages": [
{
"role": "user",
"content": "\n\n```py codegate/src/codegate/pipeline/factory.py (1-57)\nfrom typing import List\n\nfrom codegate.config import Config\nfrom codegate.pipeline.base import PipelineStep, SequentialPipelineProcessor\nfrom codegate.pipeline.codegate_context_retriever.codegate import CodegateContextRetriever\nfrom codegate.pipeline.extract_snippets.extract_snippets import CodeSnippetExtractor\nfrom codegate.pipeline.extract_snippets.output import CodeCommentStep\nfrom codegate.pipeline.output import OutputPipelineProcessor, OutputPipelineStep\nfrom codegate.pipeline.secrets.manager import SecretsManager\nfrom codegate.pipeline.secrets.secrets import (\n CodegateSecrets,\n SecretRedactionNotifier,\n SecretUnredactionStep,\n)\nfrom codegate.pipeline.system_prompt.codegate import SystemPrompt\nfrom codegate.pipeline.version.version import CodegateVersion\n\n\nclass PipelineFactory:\n def __init__(self, secrets_manager: SecretsManager):\n self.secrets_manager = secrets_manager\n\n def create_input_pipeline(self) -> SequentialPipelineProcessor:\n input_steps: List[PipelineStep] = [\n # make sure that this step is always first in the pipeline\n # the other steps might send the request to a LLM for it to be analyzed\n # and without obfuscating the secrets, we'd leak the secrets during those\n # later steps\n CodegateSecrets(),\n CodegateVersion(),\n CodeSnippetExtractor(),\n CodegateContextRetriever(),\n SystemPrompt(Config.get_config().prompts.default_chat),\n ]\n return SequentialPipelineProcessor(input_steps, self.secrets_manager, is_fim=False)\n\n def create_fim_pipeline(self) -> SequentialPipelineProcessor:\n fim_steps: List[PipelineStep] = [\n CodegateSecrets(),\n ]\n return SequentialPipelineProcessor(fim_steps, self.secrets_manager, is_fim=True)\n\n def create_output_pipeline(self) -> OutputPipelineProcessor:\n output_steps: List[OutputPipelineStep] = [\n SecretRedactionNotifier(),\n SecretUnredactionStep(),\n CodeCommentStep(),\n ]\n return OutputPipelineProcessor(output_steps)\n\n def create_fim_output_pipeline(self) -> OutputPipelineProcessor:\n fim_output_steps: List[OutputPipelineStep] = [\n # temporarily disabled\n # SecretUnredactionStep(),\n ]\n return OutputPipelineProcessor(fim_output_steps)\n\n```\nwhats this code doing?"
}
],
"model": "hosted_vllm/unsloth/Qwen2.5-Coder-32B-Instruct",
"max_tokens": 4096,
"stream": true,
"base_url": "https://inference.codegate.ai/v1"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, that makes sense. Let's give this a little bit of thought; every time a file is added/removed we'd have to rewrite the JSON blob into the database and that's not optimal either.
Related: #583 We had been using a single DB schema that didn't change until now. This introduces migrations using `alembic`. To create a new migration one can use: ```sh alembic revision -m "My migration" ``` That should generate an empty migration file that needs to be hand-filled. Specifically the `upgrade` method which will be the one executed when running the migration. ```python """My migration Revision ID: <some_hash> Revises: <previous_hash> Create Date: YYYY-MM-DD HH:MM:SS.XXXXXX """ from alembic import op import sqlalchemy as sa revision = '<some_hash>' down_revision = '<previous_hash>' branch_labels = None depends_on = None def upgrade(): pass def downgrade(): pass ```
Related: #583 We had been using a single DB schema that didn't change until now. This introduces migrations using `alembic`. To create a new migration one can use: ```sh alembic revision -m "My migration" ``` That should generate an empty migration file that needs to be hand-filled. Specifically the `upgrade` method which will be the one executed when running the migration. ```python """My migration Revision ID: <some_hash> Revises: <previous_hash> Create Date: YYYY-MM-DD HH:MM:SS.XXXXXX """ from alembic import op import sqlalchemy as sa revision = '<some_hash>' down_revision = '<previous_hash>' branch_labels = None depends_on = None def upgrade(): pass def downgrade(): pass ```
The effort to automatically detect repositories from the information provided by the client is stopped. At the moment we don't have enough information to accurately pin-point the repository in which a user is working on. The workspaces effort will continue with #600 |
Related: #454
This is the initial work to create workspaces when the server is initialized. The idea is that the user mounts a volume at the specific location:
/app/codegate_workspaces
and read from there the git repositories.