Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob.translate incorrectly matches path separator in character ranges #130942

Open
euangoodbrand opened this issue Mar 7, 2025 · 23 comments
Open
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@euangoodbrand
Copy link

euangoodbrand commented Mar 7, 2025

Bug report

Bug description:

Description

The glob.translate function in Python 3.13 doesn't properly handle path separators when they're included in character ranges. According to the documentation, "wildcards do not match path separators," but this isn't enforced for character ranges that include the path separator.

Expected behavior

Character ranges in glob patterns should not match path separators, consistent with the behavior of single character wildcards (?) and the documented behavior of the module.

Actual behavior

Character ranges that include the path separator (e.g., [%-0] which includes the / character) will match the path separator, contradicting the documented behavior.

Code to reproduce

import glob
import re

# Proper behavior with ? wildcard
r1 = re.compile(glob.translate('a?c', seps=['/']))
print(r1)  # (?s:a[^/]c)\Z
print(r1.match('abc'))  # match object
print(r1.match('a/c'))  # None, correctly doesn't match path separator

# Incorrect behavior with character range
r2 = re.compile(glob.translate('a[%-0]c', seps=['/']))
print(r2)  # (?s:a[%-0]c)\Z
print(r2.match('a0c'))  # match object
print(r2.match('a/c'))  # match object, incorrectly matches path separator

### CPython versions tested on:

3.13

### Operating systems tested on:

_No response_

Link to original find: https://stackoverflow.com/questions/79492274/glob-translate-incorrectly-matches-path-separator-when-using-ranges

Linked PRs

@euangoodbrand euangoodbrand added the type-bug An unexpected behavior, bug, or error label Mar 7, 2025
@ZeroIntensity ZeroIntensity added the stdlib Python modules in the Lib dir label Mar 7, 2025
@dkaszews
Copy link

dkaszews commented Mar 7, 2025

Hi, I am the author of the original find, so let me add a bit of context and rationale.

The reason I consider this a bug in glob.translate is that it does not match the behavior of bash, nor of glob.glob which correctly ignore the path separator inside a character class, even if a single character and not a part of range:

$ mkdir a
$ touch a/c
$ shopt -s failglob
$ ls -1 a/c
a/c
$ ls -1 a[/]c
-bash: no match: a[/]c
$ ls -1 a[%-0]c
-bash: no match: a[%-0]c
$ python -c 'import glob; print(glob.glob("a/c"))' 
[ 'a/c' ]
$ python -c 'import glob; print(glob.glob("a[/]c"))'
[]
$ python -c 'import glob; print(glob.glob("a[%-0]c"))'
[]

From man fnmatch(3):

FNM_PATHNAME
              If this flag is set, match a slash in string only with a
              slash in pattern and not by an asterisk (*) or a question
              mark (?) metacharacter, nor by a bracket expression ([])
              containing a slash.

@dmitya26
Copy link

dmitya26 commented Mar 8, 2025

Hello. I can make a PR for this issue.

@picnixz
Copy link
Member

picnixz commented Mar 8, 2025

The relevant manpage paragraph for this issue is (emphasis mine)

Globbing is applied on each of the components of a pathname separately.
A '/' in a pathname cannot be matched by a '?' or '*' wildcard, or by a range
like "[.-0]"
. A range containing an
explicit '/' character is syntactically incorrect. (POSIX requires that
syntactically incorrect patterns are left unchanged.)

As for the second issue, please open a separate ticket.

I think the issue stems from the fact that glob.translate is based on fnmatch.translate for which separators have no special meaning. However I am on mobile and it's hard to debug this.

cc @barneygale

@picnixz
Copy link
Member

picnixz commented Mar 8, 2025

Now, before opening a PR, I want to be sure that the pattern translated by glob and the pattern used by glob.glob are equivalent in the sense that:

glob.glob(p) == [
  s in all_paths
  if re.match(glob.translate(p), s)
]

namely regex + glob.translate is the slow alternative of glob.glob.

@dkaszews
Copy link

dkaszews commented Mar 8, 2025

@picnixz Thank you for pulling up that manpage, globbing is documented in several of them, that one is in fact most explicitly relevant here.

As for the second issue, please open a separate ticket

Created #130985

pattern translated by glob and the pattern used by glob.glob

That would be my expectation as well. In fact, I found this issue by trying to use Python's glob.translate to generate test cases for my C++ code which does exactly that - implement globbing via regex translation.

@picnixz
Copy link
Member

picnixz commented Mar 8, 2025

Sorry for having misread the second issue and make you open a separate issue but I don't think there is an issue with [][!] (let's comtinue the discussion in the other issue)

@dmitya26
Copy link

dmitya26 commented Mar 8, 2025

Once I verify that glob.glob(p) and re.match(glob.translate(s), p) have the same output given the issue's testcase, would it be okay for me to proceed with the PR?

@picnixz
Copy link
Member

picnixz commented Mar 8, 2025

Probably, but I also want @barneygale's input on this beforehand as I think he's the one who added glob.translate() in 3.13 (but I may be wrong). I may be wrong in my assumptions as I don't remember exactly the specs here (though my gut feeling is telling me that there's an issue as we don't align with the glob manpage).

@barneygale
Copy link
Contributor

This is a legitimate bug in glob.translate() - it shouldn't be possible to match a path separator with a wildcard (including a [group]). Thanks for reporting - happy to review a fix.

@picnixz
Copy link
Member

picnixz commented Mar 8, 2025

@dmitya26 go ahead with a PR :')

@dmitya26
Copy link

dmitya26 commented Mar 8, 2025

I will, I'm adding the unit test for it and will open one after I finish with that in a few minutes.

@picnixz
Copy link
Member

picnixz commented Mar 8, 2025

No rush needed! I just wanted to confirm that your PR will be welcome!

@dmitya26
Copy link

dmitya26 commented Mar 8, 2025

Thanks!!

I'm currently working on the fix, but am done implementing the testcase, so can I just open the PR against my fork and then @ you when I'm done implementing the fix?

@picnixz
Copy link
Member

picnixz commented Mar 8, 2025

Yes, just create a draft PR until you're good for reviews and then, once everything is fine, mark it as ready for review and you can @ me and Barney (I am travelling so I might take a bit of time to reply) Don't forget to read the devguide and add a blurb entry (NEWS entry).

@dkaszews
Copy link

dkaszews commented Mar 9, 2025

How should a path separator literal be handled? I am not sure how to interpret "are left unchanged" in:

A range containing an
explicit '/' character is syntactically incorrect. (POSIX requires that
syntactically incorrect patterns are left unchanged.)

I see couple options:

  1. "Unchanged" means leave it as be, [xy/] stays [xy/]
  2. "Unchanged" means escape the entire class, [xy/] becomes \[xy/\]
  3. Remove the offending character, [xy/] becomes [xy], but [/] cannot become [] because Python regex does not allow empty classes
  4. Raise an error

In my test directory:

$ tree --charset=ascii
.
|-- a
|   `-- c
|-- a[
|   `-- ]c
`-- a[xy
    `-- ]c
3 directories, 3 files

glob.glob('a[/]c') and glob.glob('a[xy/]c') match the latter two, meaning option 2. is correct by the glob.glob == re.match(glob.translate) equivalence argument. Running same two patterns through bash ls confirms this.

@dmitya26
Copy link

dmitya26 commented Mar 9, 2025

That sounds like a good idea.
I was fiddling around with invalid patterns in glob.translate such as "foo[z-a]bar" and "foo[0-%]bar", and they seemed to do a similar thing, which was replace the invalid range with an empty negative lookahead making the pattern unmatchable.

I think I'm going to proceed with this approach if you don't think we need to consult anybody else about it.

Thanks! :)

@dkaszews
Copy link

dkaszews commented Mar 9, 2025

@dmitya26 I edited my comment because I was actually incorrect. Negative lookahead can be used for ranges including path separator (remember to check seps and not assume it equal /), but path separator literal has to be handled in a different way.

@dkaszews
Copy link

dkaszews commented Mar 9, 2025

Oh, I may have found an actual bug in glob.glob where it behaves differently to bash globs. Backwards ranges don't match anything, whereas in bash they cause the entire character class to be invalid and get escaped:

$ touch 'a[y-x]b'
$ ls -1 a[y-x]b
'a[y-x]b'
$ python -c 'import glob; print(glob.glob("a[y-x]b"))'
[]

In this case, glob.translate is consistent with glob.glob as it generates (?s:(?!))\Z, where the empty negative lookahead always fails. But the inconsistency with bash looks to me like a much bigger problem.

@picnixz @barneygale I cannot find any man page nor Google anything that specifies behavior of glob backwards ranges. Could you please quickly confirm this before I open a separate issue?

@picnixz
Copy link
Member

picnixz commented Mar 9, 2025

glob.translare() is based on fnmatch.translate() which itself removes empty ranges IIRC. Maybe we should escape them instead? I don't know, as this is the first time someone found that inconsistency I think =/

@picnixz
Copy link
Member

picnixz commented Mar 9, 2025

Could it also be something that is ls-specific and not just bash specific?

@dkaszews
Copy link

dkaszews commented Mar 9, 2025

ls does not handle globs, they are all expanded by shell. If I swap ls with touch, wc, rm or any other command, they will still match that file. This is an issue for anyone who tries to expand globs for themselves, expecting subprocess.run('ls', [ '--', pattern ]) and subprocess.run('ls', [ '--' ] + map(glob.escape, glob.glob(pattern))) to be equivalent.

@picnixz
Copy link
Member

picnixz commented Mar 9, 2025

I guess this indeed counts as a separate issue. I think we need to amend how fnmatch.translate() handles empty ranges but maybe we should just inline the call to the internal helper of fnmatch.translate() in glob.translate as it becomes more and more glob-specific (note that fnmatch(3) and Python's fnmatch differ as the former support special classes like [:digit:] while the latter doesn't).

What's the recommended plan of action @barneygale?

@dkaszews
Copy link

dkaszews commented Mar 9, 2025

Okay, another false alarm. Python is correct here, issue is with bash's stupid default handing of unmatched globs, passing them to commands as literals for them to match with exact filename.

$ touch 'a[y-x]'
$ touch a
$ ls -1 a[y-x]b  # No match, equivalent to `ls -1 'a[y-x]b'`
'a[y-x]b'
$ shopt -s failglob
$ ls -1 a[y-x]b
-bash: no match: a[y-x]b
$ shopt -u failglob
$ shopt -s nullglob
$ ls -1 a[y-x]b  # No match, equivalent to `ls -1`
a
'a[y-x]b' 

This behavior can be seen in a simpler case:

$ touch '[x]'
$ touch y
$ ls -1 [x]  # No match, equivalent to `ls -1 '[x]'`
'[x]'
$ shopt -s failglob
$ ls -1 [x]
-bash: no match: [x]
$ shopt -u failglob
$ shopt -s nullglob
$ ls -1 [x] # No match, equivalent to `ls -1`
'[x]'
y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants