Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid race conditions when installing multiple dependencies from the same git repository. #9658

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

gustavgransbo
Copy link

@gustavgransbo gustavgransbo commented Aug 30, 2024

Multiple installs from the same git repository causes a race condition when the repository is cloned.
This change only allows one operation per repository to be executed by the parallel workers.
Any extra operation on the same git repository will be performed serially.
Since the git repository is cached after the first operation, this is fast.

Building on the setup of @JonathanRayner, I was able to consistently reproduce the error in this repository.

Thank you @JonathanRayner for the hard work!

Pull Request Check List

Resolves: #6958

  • Added tests for changed code.

The race condition occurs inside poetry.vcs.git.Git.clone, this method is mocked in all tests: conftest.py.
This makes the actual race condition hard to test.

Also, since the race condition required quite a bit of external setup, I think it could be challenging to add a good test for this.

  • Updated documentation for changed code.
    This change should not require any new documentation.

Multiple installs from the same git repository causes arace condition
when the repository is cloned.
This commit only allows one operation per repository to be executed by
the parallel workers.
Any extra operation will be performed serially.
Since the git repository is cached after the first operation, this is
blazingly fast.

Resolves: python-poetry#6958
@seeker25
Copy link

Thank you so much

Copy link
Member

@radoering radoering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a minor nitpick.

I thought a bit about how to write a unit test but even with mocking I could not come up with a good test. Maybe, we can just add kind of a smoke test like test_executor_should_append_subdirectory_for_git but with two packages from the same git repo and check that both are still installed. At least, the order should be fix now, since they are installed serially.

src/poetry/installation/executor.py Outdated Show resolved Hide resolved
@seeker25
Copy link

seeker25 commented Sep 5, 2024

A workaround for now, that works for 1.8.3

poetry config installer.max-workers 1

if operation.package.source_type == "git" and not operation.skipped:
# If this is the first operation cloning this repository,
# allow it to be executed in parallel
if operation.package.source_url not in git_repositories_to_clone:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on forked vs non-forked?
https://github.com/repo1/code
https://github.com/repo2/code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it does not! The race condition stems from having multiple installs that try to clone to the same local directory. The name for this directory is created here.

In your example, both repositories would get the same target: <source_root>/code.

But as you point out, I would not catch this in my checks for identical source_urls, since the forks have different urls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two ways this can be resolved:

  1. In the Git.clone method, making sure that a larger portion of the source url is used to craft the target directory.
  2. In the executor logic, ensuring that packages that will have the same target directory will be considered parallel unsafe.

Does anyone have an opinion on what route to take?

Copy link
Contributor

@TheSven73 TheSven73 Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or 3. fix the race condition ?

From the little I understand, it appears that the root cause is concurrent execution of poetry.vcs.git.Git.clone() in the same physical directory. Neither the git executable, nor dulwich support concurrent execution in the same directory. Does this sound right?

What if we prevent concurrent clone execution in the same directory? We could use threading.Lock.

In vcs/git/backend.py this would look like:

from collections import defaultdict
from threading import Lock

_clone_lock: dict[Path, Lock] = defaultdict(Lock)

@classmethod
def clone(
    cls,
    url: str,
    name: str | None = None,
    branch: str | None = None,
    tag: str | None = None,
    revision: str | None = None,
    source_root: Path | None = None,
    clean: bool = False,
) -> Repo:
    source_root = source_root or cls.get_default_source_root()
    source_root.mkdir(parents=True, exist_ok=True)

    name = name or cls.get_name_from_source_url(url=url)
    target = source_root / name
    refspec = GitRefSpec(branch=branch, revision=revision, tag=tag)

    # Prohibit concurrent clones in the same target Path
    with cls._clone_lock[target]:
      # rest of clone follows, only the indentation changes

This will deadlock if clone() calls back into clone() using the same directory. And although recursive calls do happen via submodules, it doesn't look like they use the same directory.

I tried this out on the sample repo in #6958 and it seems to fix the issue, but I'm not familiar enough with the poetry code to know if this would be a viable approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Fixing the problem at it's root would of course be great.

An unfortunate functional drawback would be that when multiple installs from the same repository are executed in parallel, they would all execute past the cache lookup, and then proceed to wait right before the clone.

I have not dug deep into this, but it seems like this would cause the repository to be downloaded multiple times instead of just once. With the solution in this PR, it's only downloaded once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. My suggestion in #9658 (comment) could potentially work, but would cause a performance hit for sure.

So how about this: git operations with the same url name (from Git.get_name_from_source_url()) might conflict, because they're cloned in <source_root>/src/<url name>. So we serialize all git operations which share an url name.

For example, on top of poetry main:

diff --git a/src/poetry/installation/executor.py b/src/poetry/installation/executor.py
index 196cdd5a..ff439d50 100644
--- a/src/poetry/installation/executor.py
+++ b/src/poetry/installation/executor.py
@@ -32,6 +32,7 @@ from poetry.utils.helpers import pluralize
 from poetry.utils.helpers import remove_directory
 from poetry.utils.isolated_build import IsolatedBuildError
 from poetry.utils.isolated_build import IsolatedBuildInstallError
+from poetry.vcs.git import Git
 
 
 if TYPE_CHECKING:
@@ -144,6 +145,7 @@ class Executor:
         for _, group in groups:
             tasks = []
             serial_operations = []
+            git_operations = []
             for operation in group:
                 if self._shutdown:
                     break
@@ -162,8 +164,32 @@ class Executor:
                     serial_operations.append(operation)
                     continue
 
+                # Defer git operations, they might need to be serialized
+                if operation.package.source_type == "git" and not operation.skipped:
+                    git_operations.append(operation)
+                    continue
+
                 tasks.append(self._executor.submit(self._execute_operation, operation))
 
+            def url_name_from_operation(operation: Operation) -> str | None:
+                if url := operation.package.repository_url:
+                    return Git.get_name_from_source_url(url)
+                return None
+
+            # git operations on the same url name end up being cloned in the same directory,
+            # so these must execute serially or they'll interfere with each other.
+            for _, git_ops_group in itertools.groupby(
+                git_operations,
+                key=url_name_from_operation,
+            ):
+                ops = list(git_ops_group)
+
+                def _serialize(ops: list[Operation] = ops) -> None:
+                    for op in ops:
+                        self._execute_operation(op)
+
+                tasks.append(self._executor.submit(_serialize))
+
             try:
                 wait(tasks)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards ensuring that forked repositories (actually any repository that happen to have the same name) are not cloned into the same target directory. I'd have to examine if there are any implicit dependencies on the name of the target folder though.

If that route seems acceptable, I would also consider implementing it in a separate PR, since it could be viewed as a separate fix.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate PR sounds right to me.

I am wondering what the use case behind this issue is? If you have two forks with the same name the two packages that are built from the forks are likely to have the same name, right? If you have two packages with the same name from different repos you will get a conflict because you can only install one package with a specific name into an environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem occurs when the repositories have the same name, not the package. So you would encounter this problem with dependencies like this:

pkg_1 = {git = "[email protected]:foo/my_packages.git", subdirectory = "pkg_1"}
pkg_2 = {git = "[email protected]:bar/my_packages.git", subdirectory = "pkg_2"}

It doesn't matter if one repository is a fork of the other or not, having the same name should be enough to cause the conflict.

I'm not sure how common this case is though, so I would be happy to defer it to a separate fix.

Operations for a single repository are grouped and executed serially by
a single worker.
@radoering
Copy link
Member

@gustavgransbo @TheSven73 Just to be sure before I take another detailed look and potentially merge it: If we outsource #9658 (comment) into a separate issue/PR, this PR is ready from your end, isn't it?

@gustavgransbo
Copy link
Author

@gustavgransbo @TheSven73 Just to be sure before I take another detailed look and potentially merge it: If we outsource #9658 (comment) into a separate issue/PR, this PR is ready from your end, isn't it?

Yes, from my perspective it's ready to be merged!

self._executor.submit(
functools.partial(
_serialize,
repository_serial_operations=repository_git_operations,
Copy link
Contributor

@TheSven73 TheSven73 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is functools.partial an unnecessary complication here? Isn't this code just equivalent to

tasks.append(self._executor.submit(_serialize, repository_git_operations))

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NotGitRepository error when installing multiple packages from one git repository
4 participants