WIP: feat: Initial code to load workspaces from a specific container path #583

aponcedeleonch · 2025-01-14T15:08:22Z

Related: #454

This is the initial work to create workspaces when the server is initialized. The idea is that the user mounts a volume at the specific location: /app/codegate_workspaces and read from there the git repositories.

…path Related: #454 This is the initial work to create workspaces when the server is initialized. The idea is that the user mounts a volume at the specific location: `/app/codegate_workspaces` and read from there the git repositories.

jhrozek · 2025-01-14T20:00:40Z

src/codegate/config.py

@@ -54,6 +54,9 @@ class Config:
    force_certs: bool = False

    max_fim_hash_lifetime: int = 60 * 5  # Time in seconds. Default is 5 minutes.
+    ignore_paths_workspaces = [
+        ".git", "__pycache__", ".venv", ".DS_Store", "node_modules", ".pytest_cache", ".ruff_cache"
+    ]


would it make sense to just include the contents of gitignore?
Sounds like we should make this configurable down the road.

I did consider the contents of .gitignore but if we use that it would mean skipping files that may contain secrets but could still leaked to LLMs.

I was planning to make ignore_paths_workspaces configurable through the cli, I just didn't have time to do so. The values here would be the defaults

jhrozek · 2025-01-14T20:03:34Z

src/codegate/workspaces/workspaces.py

+        repos = FolderRepoScanner(ignore_paths).read(path)
+        workspaces = [
+            Workspace(
+                id=str(uuid.uuid4()),


Does this mean that if I restart the container I get a new Workspace per repo in the tree?

Yes, I still need to add functionality to avoid creating a new workspace if the repo already exists.

JAORMX · 2025-01-15T06:38:16Z

src/codegate/workspaces/workspaces.py

+
+class Repository(BaseModel):
+    name: str
+    folder_tree: Dict[str, Folder]


Is the intent to store the whole directory tree of a repository?

What about storing the root of the repo instead of the whole filesystem?

Yes, the intent is to store the whole directory tree of a repository. The reasoning behind it is to do fast lookups when we see a path in the received code snippets. Right now, we get the path of a code snippet if it was supplied for context to the LLM. Example:

{ "messages": [ { "role": "user", "content": "\n\n```py codegate/src/codegate/pipeline/factory.py (1-57)\nfrom typing import List\n\nfrom codegate.config import Config\nfrom codegate.pipeline.base import PipelineStep, SequentialPipelineProcessor\nfrom codegate.pipeline.codegate_context_retriever.codegate import CodegateContextRetriever\nfrom codegate.pipeline.extract_snippets.extract_snippets import CodeSnippetExtractor\nfrom codegate.pipeline.extract_snippets.output import CodeCommentStep\nfrom codegate.pipeline.output import OutputPipelineProcessor, OutputPipelineStep\nfrom codegate.pipeline.secrets.manager import SecretsManager\nfrom codegate.pipeline.secrets.secrets import (\n CodegateSecrets,\n SecretRedactionNotifier,\n SecretUnredactionStep,\n)\nfrom codegate.pipeline.system_prompt.codegate import SystemPrompt\nfrom codegate.pipeline.version.version import CodegateVersion\n\n\nclass PipelineFactory:\n def __init__(self, secrets_manager: SecretsManager):\n self.secrets_manager = secrets_manager\n\n def create_input_pipeline(self) -> SequentialPipelineProcessor:\n input_steps: List[PipelineStep] = [\n # make sure that this step is always first in the pipeline\n # the other steps might send the request to a LLM for it to be analyzed\n # and without obfuscating the secrets, we'd leak the secrets during those\n # later steps\n CodegateSecrets(),\n CodegateVersion(),\n CodeSnippetExtractor(),\n CodegateContextRetriever(),\n SystemPrompt(Config.get_config().prompts.default_chat),\n ]\n return SequentialPipelineProcessor(input_steps, self.secrets_manager, is_fim=False)\n\n def create_fim_pipeline(self) -> SequentialPipelineProcessor:\n fim_steps: List[PipelineStep] = [\n CodegateSecrets(),\n ]\n return SequentialPipelineProcessor(fim_steps, self.secrets_manager, is_fim=True)\n\n def create_output_pipeline(self) -> OutputPipelineProcessor:\n output_steps: List[OutputPipelineStep] = [\n SecretRedactionNotifier(),\n SecretUnredactionStep(),\n CodeCommentStep(),\n ]\n return OutputPipelineProcessor(output_steps)\n\n def create_fim_output_pipeline(self) -> OutputPipelineProcessor:\n fim_output_steps: List[OutputPipelineStep] = [\n # temporarily disabled\n # SecretUnredactionStep(),\n ]\n return OutputPipelineProcessor(fim_output_steps)\n\n```\nwhats this code doing?" } ], "model": "hosted_vllm/unsloth/Qwen2.5-Coder-32B-Instruct", "max_tokens": 4096, "stream": true, "base_url": "https://inference.codegate.ai/v1" }

gotcha, that makes sense. Let's give this a little bit of thought; every time a file is added/removed we'd have to rewrite the JSON blob into the database and that's not optimal either.

Related: #583 We had been using a single DB schema that didn't change until now. This introduces migrations using `alembic`. To create a new migration one can use: ```sh alembic revision -m "My migration" ``` That should generate an empty migration file that needs to be hand-filled. Specifically the `upgrade` method which will be the one executed when running the migration. ```python """My migration Revision ID: <some_hash> Revises: <previous_hash> Create Date: YYYY-MM-DD HH:MM:SS.XXXXXX """ from alembic import op import sqlalchemy as sa revision = '<some_hash>' down_revision = '<previous_hash>' branch_labels = None depends_on = None def upgrade(): pass def downgrade(): pass ```

aponcedeleonch · 2025-01-16T13:56:37Z

The effort to automatically detect repositories from the information provided by the client is stopped. At the moment we don't have enough information to accurately pin-point the repository in which a user is working on. The workspaces effort will continue with #600

aponcedeleonch marked this pull request as draft January 14, 2025 15:08

jhrozek reviewed Jan 14, 2025

View reviewed changes

JAORMX reviewed Jan 15, 2025

View reviewed changes

aponcedeleonch mentioned this pull request Jan 15, 2025

feat: Introduce DB migrations #593

Merged

aponcedeleonch closed this Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: feat: Initial code to load workspaces from a specific container path #583

WIP: feat: Initial code to load workspaces from a specific container path #583

aponcedeleonch commented Jan 14, 2025

jhrozek Jan 14, 2025

aponcedeleonch Jan 15, 2025

jhrozek Jan 14, 2025

aponcedeleonch Jan 15, 2025

JAORMX Jan 15, 2025

JAORMX Jan 15, 2025

aponcedeleonch Jan 15, 2025

JAORMX Jan 15, 2025

aponcedeleonch commented Jan 16, 2025

WIP: feat: Initial code to load workspaces from a specific container path #583

WIP: feat: Initial code to load workspaces from a specific container path #583

Conversation

aponcedeleonch commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aponcedeleonch commented Jan 16, 2025