Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: feat: Initial code to load workspaces from a specific container path #583

Closed
wants to merge 1 commit into from

Conversation

aponcedeleonch
Copy link
Contributor

Related: #454

This is the initial work to create workspaces when the server is initialized. The idea is that the user mounts a volume at the specific location: /app/codegate_workspaces and read from there the git repositories.

…path

Related: #454

This is the initial work to create workspaces when the server is initialized.
The idea is that the user mounts a volume at the specific location: `/app/codegate_workspaces`
and read from there the git repositories.
@aponcedeleonch aponcedeleonch marked this pull request as draft January 14, 2025 15:08
@@ -54,6 +54,9 @@ class Config:
force_certs: bool = False

max_fim_hash_lifetime: int = 60 * 5 # Time in seconds. Default is 5 minutes.
ignore_paths_workspaces = [
".git", "__pycache__", ".venv", ".DS_Store", "node_modules", ".pytest_cache", ".ruff_cache"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to just include the contents of gitignore?
Sounds like we should make this configurable down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider the contents of .gitignore but if we use that it would mean skipping files that may contain secrets but could still leaked to LLMs.

I was planning to make ignore_paths_workspaces configurable through the cli, I just didn't have time to do so. The values here would be the defaults

repos = FolderRepoScanner(ignore_paths).read(path)
workspaces = [
Workspace(
id=str(uuid.uuid4()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if I restart the container I get a new Workspace per repo in the tree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I still need to add functionality to avoid creating a new workspace if the repo already exists.


class Repository(BaseModel):
name: str
folder_tree: Dict[str, Folder]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent to store the whole directory tree of a repository?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about storing the root of the repo instead of the whole filesystem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the intent is to store the whole directory tree of a repository. The reasoning behind it is to do fast lookups when we see a path in the received code snippets. Right now, we get the path of a code snippet if it was supplied for context to the LLM. Example:

{
  "messages": [
    {
      "role": "user",
      "content": "\n\n```py codegate/src/codegate/pipeline/factory.py (1-57)\nfrom typing import List\n\nfrom codegate.config import Config\nfrom codegate.pipeline.base import PipelineStep, SequentialPipelineProcessor\nfrom codegate.pipeline.codegate_context_retriever.codegate import CodegateContextRetriever\nfrom codegate.pipeline.extract_snippets.extract_snippets import CodeSnippetExtractor\nfrom codegate.pipeline.extract_snippets.output import CodeCommentStep\nfrom codegate.pipeline.output import OutputPipelineProcessor, OutputPipelineStep\nfrom codegate.pipeline.secrets.manager import SecretsManager\nfrom codegate.pipeline.secrets.secrets import (\n    CodegateSecrets,\n    SecretRedactionNotifier,\n    SecretUnredactionStep,\n)\nfrom codegate.pipeline.system_prompt.codegate import SystemPrompt\nfrom codegate.pipeline.version.version import CodegateVersion\n\n\nclass PipelineFactory:\n    def __init__(self, secrets_manager: SecretsManager):\n        self.secrets_manager = secrets_manager\n\n    def create_input_pipeline(self) -> SequentialPipelineProcessor:\n        input_steps: List[PipelineStep] = [\n            # make sure that this step is always first in the pipeline\n            # the other steps might send the request to a LLM for it to be analyzed\n            # and without obfuscating the secrets, we'd leak the secrets during those\n            # later steps\n            CodegateSecrets(),\n            CodegateVersion(),\n            CodeSnippetExtractor(),\n            CodegateContextRetriever(),\n            SystemPrompt(Config.get_config().prompts.default_chat),\n        ]\n        return SequentialPipelineProcessor(input_steps, self.secrets_manager, is_fim=False)\n\n    def create_fim_pipeline(self) -> SequentialPipelineProcessor:\n        fim_steps: List[PipelineStep] = [\n            CodegateSecrets(),\n        ]\n        return SequentialPipelineProcessor(fim_steps, self.secrets_manager, is_fim=True)\n\n    def create_output_pipeline(self) -> OutputPipelineProcessor:\n        output_steps: List[OutputPipelineStep] = [\n            SecretRedactionNotifier(),\n            SecretUnredactionStep(),\n            CodeCommentStep(),\n        ]\n        return OutputPipelineProcessor(output_steps)\n\n    def create_fim_output_pipeline(self) -> OutputPipelineProcessor:\n        fim_output_steps: List[OutputPipelineStep] = [\n            # temporarily disabled\n            # SecretUnredactionStep(),\n        ]\n        return OutputPipelineProcessor(fim_output_steps)\n\n```\nwhats this code doing?"
    }
  ],
  "model": "hosted_vllm/unsloth/Qwen2.5-Coder-32B-Instruct",
  "max_tokens": 4096,
  "stream": true,
  "base_url": "https://inference.codegate.ai/v1"
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, that makes sense. Let's give this a little bit of thought; every time a file is added/removed we'd have to rewrite the JSON blob into the database and that's not optimal either.

aponcedeleonch added a commit that referenced this pull request Jan 15, 2025
Related: #583

We had been using a single DB schema that didn't change until now.
This introduces migrations using `alembic`. To create a new migration
one can use:
```sh
alembic revision -m "My migration"
```
That should generate an empty migration file that needs to be hand-filled.
Specifically the `upgrade` method which will be the one executed when
running the migration.
```python
"""My migration

Revision ID: <some_hash>
Revises: <previous_hash>
Create Date: YYYY-MM-DD HH:MM:SS.XXXXXX
"""
from alembic import op
import sqlalchemy as sa

revision = '<some_hash>'
down_revision = '<previous_hash>'
branch_labels = None
depends_on = None

def upgrade():
    pass

def downgrade():
    pass
```
aponcedeleonch added a commit that referenced this pull request Jan 15, 2025
Related: #583

We had been using a single DB schema that didn't change until now.
This introduces migrations using `alembic`. To create a new migration
one can use:
```sh
alembic revision -m "My migration"
```
That should generate an empty migration file that needs to be hand-filled.
Specifically the `upgrade` method which will be the one executed when
running the migration.
```python
"""My migration

Revision ID: <some_hash>
Revises: <previous_hash>
Create Date: YYYY-MM-DD HH:MM:SS.XXXXXX
"""
from alembic import op
import sqlalchemy as sa

revision = '<some_hash>'
down_revision = '<previous_hash>'
branch_labels = None
depends_on = None

def upgrade():
    pass

def downgrade():
    pass
```
@aponcedeleonch
Copy link
Contributor Author

The effort to automatically detect repositories from the information provided by the client is stopped. At the moment we don't have enough information to accurately pin-point the repository in which a user is working on. The workspaces effort will continue with #600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants