Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pip cache remove pattern matching. #13094

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions news/13086.bugfix.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix ``pip cache remove`` pattern matching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As evident by all of the edge cases, this is moreso a step in the right direction than a total fix. It's a good step forward and certainly needed, but I don't want to claim everything will work w/o issues now :)

ALSO, this PR will affect pip cache list too. That's a good thing, we may as well mention it.

55 changes: 45 additions & 10 deletions src/pip/_internal/commands/cache.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
import fnmatch
import os
import re
import textwrap
from optparse import Values
from typing import Any, List
Expand Down Expand Up @@ -207,22 +209,55 @@ def _find_http_files(self, options: Values) -> List[str]:
def _find_wheels(self, options: Values, pattern: str) -> List[str]:
wheel_dir = self._cache_dir(options, "wheels")

# The wheel filename format, as specified in PEP 427, is:
# The wheel filename format, as originally specified in PEP 427, is:
# {distribution}-{version}(-{build})?-{python}-{abi}-{platform}.whl
#
# Additionally, non-alphanumeric values in the distribution are
# normalized to underscores (_), meaning hyphens can never occur
# before `-{version}`.
Comment on lines 215 to 217
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, due to bugs in build backends (like pypa/setuptools#3777), this isn't guaranteed. Dashes are always normalized (as the wheel filename would be invalid otherwise) but the name may contain periods. For example, zope.sqlalchemy-3.1-py3-none-any.whl.

Normalization of the wheel filenames is going to be necessary to make this more robust. That can be saved for later, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you realize this, but this is pre-existing commentary. As Pex maintainer I've become intimately familiar with packaging history and reality.

#
# Given that information:
# - If the pattern we're given contains a hyphen (-), the user is
# providing at least the version. Thus, we can just append `*.whl`
# to match the rest of it.
# - If the pattern we're given doesn't contain a hyphen (-), the
# user is only providing the name. Thus, we append `-*.whl` to
# match the hyphen before the version, followed by anything else.
# - If the pattern we're given is not a glob:
# + We attempt project name normalization such that pyyaml matches PyYAML,
# etc. as outlined here:
# https://packaging.python.org/specifications/name-normalization.
# The one difference is we normalize non-alphanumeric to `_` instead of `-`
# since that is how the wheel filename components are normalized
# (facepalm).
Comment on lines +221 to +226
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this isn't quite accurate? The match is always case insensitive as we're using re.IGNORECASE. What is happening for non-globs is that any non-alphanumeric characters are normalized so, say, pip cache remove typing.extensions will match typing_extensions-4.12.2-py3-none-any.whl.

# + If the pattern we're given contains a hyphen (-), the user is providing
# at least the version. Thus, we can just append `*.whl` to match the rest
# of it if it doesn't already glob to the end of the wheel file name.
Comment on lines +227 to +229
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# + If the pattern we're given contains a hyphen (-), the user is providing
# at least the version. Thus, we can just append `*.whl` to match the rest
# of it if it doesn't already glob to the end of the wheel file name.
# + If the pattern contains a hyphen (-), the user is providing
# the version. Thus, we can just append `*.whl` to match the rest
# of the wheel filename.

I realize that this comment is pre-existing, but it's also too dense to be readable. Also, this addition always occurs, no? This bullet point shouldn't be under the If the pattern we're given is not a glob: header. Ah, this is repeated for the glob case below.

Comment on lines +227 to +229
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is not always true. pip cache remove typing-extensions should remove a wheel named typing_extensions-4.12.2-py3-none-any.whl but this assumption prevents this. However, this doesn't work on main either (due to lack of normalization), so I wouldn't mind punting this limitation.1

Footnotes

  1. It quickly becomes tricky with multiple dashes... like pip cache remove typing-extensions-4.12.2 🙃.

# + If the pattern we're given is not a glob and doesn't contain a
# hyphen (-), the user is only providing the name. Thus, we append `-*.whl`
# to match the hyphen before the version, followed by anything else.
Comment on lines +230 to +232
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# + If the pattern we're given is not a glob and doesn't contain a
# hyphen (-), the user is only providing the name. Thus, we append `-*.whl`
# to match the hyphen before the version, followed by anything else.
# + If the pattern does NOT contain a
# hyphen (-), the user is only providing the name. Thus, we append `-*.whl`
# to match the hyphen before the version, followed by anything else.

This has already been implied by the preceding comments.

# - If the pattern is a glob, we ensure `*.whl` is appended if needed but
# cowardly do not attempt glob parsing / project name normalization. The user
# is on their own at that point.
#
# PEP 427: https://www.python.org/dev/peps/pep-0427/
Comment on lines 236 to 237
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#
# PEP 427: https://www.python.org/dev/peps/pep-0427/

If we're linking to the canonical specification, the reference to the original PEP isn't needed anymore.

pattern = pattern + ("*.whl" if "-" in pattern else "-*.whl")

return filesystem.find_files(wheel_dir, pattern)
# And: https://packaging.python.org/specifications/binary-distribution-format/
Comment on lines 237 to +238
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Pip / PyPA knowledgeable reviewers - what the heck is up with this by the way? I've always found it off-putting since the PEP banners have started appearing on some of the PyPA PEPs re-directing to these new standards pages. The fact that some PEPs get a new page and some don't and what process the ones with the new page go through to get updated is all mysterious feeling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ve been pushing people toward the standard pages on packaging.python.org since those are the up-to-date standard documents, while PEP pages are no longer updated once they are done, which can be a problem when you want to track done how to do a thing that’s been revised a couple of times. I don’t have knowledge how the auto redirection is actually implemented though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The procedure for modern packaging PEPs is that once accepted, the specification is transferred to the PUG (Packaging User Guide, which is the name for packaging.python.org). This is documented in PyPA procedure:

[Proposals for new interoperability specifications] must be accompanied by a pull request to the Python Packaging User Guide repository, against the PyPA Specifications section, that adds a new subsection defining the purpose of the new specification and the role it plays in the wider Python packaging ecosystem.

I'd imagine that this doesn't happen for all packaging PEPs though... (due to forgetting or not realizing this is a requirement).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yeah, I was hoping for a better reason. I personally don't trust the non-PEP pages and never read them. I read the PEPs and their backlinks from follow-on PEPs to know the actually agreed on thing and the history. I need to be sure of both as a packing ecosystem maintainer that maintains a tool that never breaks backward compatibility (knowingly), including never dropping old Python version support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's entirely possible according to the process for the canonical specification (on packaging.python.org) to be updated without needing a PEP. So using only the PEPs actually risks not following the current standards. In practice, it's quite rare that this happens (it requires an "uncontroversial" change, and we all know how likely that is 🙄)

The canonical spec should include a changelog (see here for an example, which also includes a change that wasn't in a PEP) but as with everything else, it relies on people maintaining it correctly.

Copy link
Contributor Author

@jsirois jsirois Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a packaging ecosystem maintainer I find that horrifying. I'd much rather have a chain of PEPs than that chain + maybe a summary that may also have new content on top and may or may not have a changelog fully filled out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're interpreting it wrongly. The spec documents are definitive, and PEPs are proposals to change that set of documents. Just like the relationship between the Python source and PEPs that modify it. The problem with treating PEPs as definitive is that they are not allowed to change, so you end up having to read a series of documents, applying them as a series of diffs, to understand the actual spec.

Criticising the process because people make mistakes seems unreasonable to me (and honestly, I don't think mistakes are frequent - most discrepancies between the defined process and reality are cases of PEPs that were written before the current process was defined, and no-one has volunteered yet to write the necessary historical documents)

uses_glob_patterns = any(
glob_pattern_char in pattern for glob_pattern_char in ("[", "*", "?")
)
Comment on lines +239 to +241
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uses_glob_patterns = any(
glob_pattern_char in pattern for glob_pattern_char in ("[", "*", "?")
)
uses_glob_patterns = any(char in "[]*?" for char in pattern)

if not uses_glob_patterns:
project_name, sep, rest = pattern.partition("-")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. I documented punting on deciphering glob patterns with respect to name normalization. Someone braver than I might try that later. I assume forward progress here is better than no progress though.

I did not, however, call the shot on version normalization. Again, I was cowardly, but that problem still remains. If a wheel in the cache is foo-1.0-py3-none-any.whl, then, in my opinion, pip cache remove foo-1.0.0 should remove it, but does not today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume forward progress here is better than no progress though.

Indeed!

I documented punting on deciphering glob patterns with respect to name normalization.

Hmm, how does this limitation come into play? The pattern is matched with case sensitivity disabled so that's not a problem.

I did not, however, call the shot on version normalization

The proper way to handle this would be performing a fully normalized version match. Extract the version from the pattern, convert it to packaging.version.Version, do the same for the cached wheels, and compare them. For wildcards, packaging.specifiers.Specifier could be used (although we'd have to be careful to only accept * and reject any other operator because pip cache remove should not support PEP 440 semantics).

# N.B.: This only normalizes non-alphanumeric characters in the name, which
# is 1/2 the job. Stripping case sensitivity is the other 1/2 and is
# handled below in the conversion from a glob to a regex executed with the
# IGNORECASE flag active.
Comment on lines +244 to +247
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# N.B.: This only normalizes non-alphanumeric characters in the name, which
# is 1/2 the job. Stripping case sensitivity is the other 1/2 and is
# handled below in the conversion from a glob to a regex executed with the
# IGNORECASE flag active.
# NOTE: This only normalizes non-alphanumeric characters in the name.
# Stripping case sensitivity is handled below in the conversion from a glob
# to a regex executed with the IGNORECASE flag.

As someone who likes to write wordy comments, I totally get where you're coming from, but being concise is a good thing :)

partially_normalized_project_name = re.sub(r"[^\w.]+", "_", project_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
partially_normalized_project_name = re.sub(r"[^\w.]+", "_", project_name)
partially_normalized_project_name = re.sub(r"[-_.]+", "_", project_name)

It's probably safer to simply reuse the pattern from the specification.

if not sep:
sep = "-"
pattern = f"{partially_normalized_project_name}{sep}{rest}"

if not pattern.endswith((".whl", "*")):
pattern = f"{pattern}*.whl"

regex = fnmatch.translate(pattern)

return [
os.path.join(root, filename)
for root, _, files in os.walk(wheel_dir)
for filename in files
if re.match(regex, filename, flags=re.IGNORECASE)
]
Comment on lines +256 to +263
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that filesystem.find_files() gained an optional case_insensitive keyword parameter that did this switch to regex dance under the hood.

104 changes: 85 additions & 19 deletions tests/functional/test_cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,23 +68,26 @@ def populate_http_cache(http_cache_dir: str) -> List[Tuple[str, str]]:
return files


@pytest.fixture
def populate_wheel_cache(wheel_cache_dir: str) -> List[Tuple[str, str]]:
def _populate_wheel_cache(
wheel_cache_dir: str, *name_and_versions: str
) -> List[Tuple[str, str]]:
destination = os.path.join(wheel_cache_dir, "arbitrary", "pathname")
os.makedirs(destination)

files = [
("yyy-1.2.3", os.path.join(destination, "yyy-1.2.3-py3-none-any.whl")),
("zzz-4.5.6", os.path.join(destination, "zzz-4.5.6-py3-none-any.whl")),
("zzz-4.5.7", os.path.join(destination, "zzz-4.5.7-py3-none-any.whl")),
("zzz-7.8.9", os.path.join(destination, "zzz-7.8.9-py3-none-any.whl")),
]

for _name, filename in files:
wheel_info = []
for name_and_version in name_and_versions:
filename = os.path.join(destination, f"{name_and_version}-py3-none-any.whl")
with open(filename, "w"):
pass
wheel_info.append((name_and_version, filename))
return wheel_info

return files

@pytest.fixture
def populate_wheel_cache(wheel_cache_dir: str) -> List[Tuple[str, str]]:
return _populate_wheel_cache(
wheel_cache_dir, "yyy-1.2.3", "zzz-4.5.6", "zzz-4.5.7", "zzz-7.8.9"
)


@pytest.fixture
Expand Down Expand Up @@ -332,34 +335,97 @@ def test_cache_remove_too_many_args(script: PipTestEnvironment) -> None:
script.pip("cache", "remove", "aaa", "bbb", expect_error=True)


@pytest.mark.parametrize(
"pattern",
[
"zzz",
"ZZZ",
"zZz-*-py3-none-any.whl",
"zZz-*.whl",
"zzz-*",
"zzz*",
"zz[a-zA-Z]?",
"[!y]",
],
)
@pytest.mark.usefixtures("populate_wheel_cache")
def test_cache_remove_name_match(
script: PipTestEnvironment, remove_matches_wheel: RemoveMatches
script: PipTestEnvironment, pattern: str, remove_matches_wheel: RemoveMatches
) -> None:
"""Running `pip cache remove zzz` should remove zzz-4.5.6 and zzz-7.8.9,
but nothing else."""
result = script.pip("cache", "remove", "zzz", "--verbose")
"""Running `pip cache remove <zzz matching pattern>` should remove zzz-4.5.6,
zzz-4.5.7 and zzz-7.8.9, but nothing else."""
result = script.pip("cache", "remove", pattern, "--verbose")

assert not remove_matches_wheel("yyy-1.2.3", result)
assert remove_matches_wheel("zzz-4.5.6", result)
assert remove_matches_wheel("zzz-4.5.7", result)
assert remove_matches_wheel("zzz-7.8.9", result)


@pytest.mark.parametrize(
"pattern",
[
"zzz-4.5.6",
"ZZZ-4.5.6",
"zZz-4.5.6-py3-none-any.whl",
"zZz-4.5.6-*.whl",
"zZz-4.5.[0-6]",
],
)
@pytest.mark.usefixtures("populate_wheel_cache")
def test_cache_remove_name_and_version_match(
script: PipTestEnvironment, remove_matches_wheel: RemoveMatches
script: PipTestEnvironment, pattern: str, remove_matches_wheel: RemoveMatches
) -> None:
"""Running `pip cache remove zzz-4.5.6` should remove zzz-4.5.6, but
nothing else."""
result = script.pip("cache", "remove", "zzz-4.5.6", "--verbose")
"""Running `pip cache remove <zzz-4.5.6 matching pattern>` should remove zzz-4.5.6,
but nothing else."""
result = script.pip("cache", "remove", pattern, "--verbose")

assert not remove_matches_wheel("yyy-1.2.3", result)
assert remove_matches_wheel("zzz-4.5.6", result)
assert not remove_matches_wheel("zzz-4.5.7", result)
assert not remove_matches_wheel("zzz-7.8.9", result)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also include some test cases with periods and runs of underscores? The latter is not something I'd expect an user to pass to pip, but it'd be good to validate this is following the specifications.

"pattern",
[
"pyfftw",
"pyfftw-0.15",
"pyfftw-0.15.0",
"pyFFTW",
"pyFFTW-0.15",
"pyFFTW-0.15.0",
],
)
def test_issue_13086_op_case(
script: PipTestEnvironment,
wheel_cache_dir: str,
pattern: str,
remove_matches_wheel: RemoveMatches,
) -> None:
_populate_wheel_cache(wheel_cache_dir, "pyFFTW-0.15.0", "foo-1.0")
result = script.pip("cache", "remove", pattern, "--verbose")

assert remove_matches_wheel("pyFFTW-0.15.0", result)
assert not remove_matches_wheel("foo-1.0", result)


def test_issue_13086_glob_case(
script: PipTestEnvironment,
wheel_cache_dir: str,
remove_matches_wheel: RemoveMatches,
) -> None:
_populate_wheel_cache(
wheel_cache_dir, "foo0-1.0", "foo1-1.0", "foo2-1.0", "foo1bob-1.0"
)
result = script.pip("cache", "remove", "foo[0-2]", "--verbose")

assert remove_matches_wheel("foo0-1.0", result)
assert remove_matches_wheel("foo1-1.0", result)
assert remove_matches_wheel("foo2-1.0", result)
assert not remove_matches_wheel("foo1bob", result)


@pytest.mark.usefixtures("populate_http_cache", "populate_wheel_cache")
def test_cache_purge(
script: PipTestEnvironment,
Expand Down