-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge subpattern references #18
base: master
Are you sure you want to change the base?
Conversation
Be sure to keep track of named subpattern references as well as the highes numbered subpattern reference encountered.
…ERT. Also, keep track of which registers have been referenced by number.
…patterns referenced in the regex.
… register closure. This required several things that may not have been necessary and will have to be revisited. First of all, for every register, we now create two inner matchers: one that matches the contents of the register and what follows the register, and one that only matches the contents of the register. Also, we now stop accumulating into STARTS-WITH once we encounter a register or subpattern reference. With this patch, subpattern references seem to work for the most part. They do not yet work with repetitions.
At this point, one thing that doesn't work quite right is the determination of register offsets for registers accessed indirectly by subpattern references. For example: (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))") says that the second register is at position (3, 6), though it should be (1, 6). Fixing this will require binding a special variable from subpattern reference closures that tells register closures not to touch the register offsets.
…ctly from a subpattern reference. With this patch, the following invocation: (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))") gives the correct offset values for the second register as (1,6). One problem that remains is the danger of infinite recursion during backtracking. The following invocation: (cl-ppcre:scan "(?1)(?2)(a|b|(?1))(c)" "acba") causes a stack overflow because the second (?1) is called endlessly during backtracking without the match position advancing through the string. Such behavior may be able to be remedied by having the subpattern reference's closure keep track of where in *STRING* it has been called before.
This is going to be reverted immediately, since apparently Perl isn't smart enough to do this and will itself overflow the stack.
This reverts commit 69f0d7c.
1634 and 1635 currently don't work.
…trings in patterns containing subpattern references.
Current, the following tests fail: 1638, 1639, 1641, 1642, 1643, 1644, 1645, 1646.
…th subpattern references.
…specific functions.
This went undetected for so long because of a bug in SBCL (and ECL, apparently). The way it was written, it shouldn't have worked, but it did--except on CLISP, which is how the bug was caught.
Tested on SBCL, ECL, and CLISP. The documentation says that subpattern refs were added in version 2.1.0, so you might want to change that if you don't bump the version number like that. I wrote the pull request as plain text (ignoring GitHub's "markdown") so it could be used as the merge's commit message. |
;; only push the register states for this register and registers | ||
;; local to it | ||
(loop for idx from num upto (+ num subregister-count) do | ||
(let () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this gratitious let.
My review does not constitute a willingness to merge the change, which is up to Edi to decide. |
Suggested changes have been made. |
This should be a SPECIAL declaration, not a type declaration.
This PR is beyond my ability to review. It's also quite old. Is anyone still interested? |
Yeah, that looks tricky. Don't worry about leaving old PRs, maybe someday it'll be useful, a closed PR will never find the light. |
Subpattern references enable the matching of self-similar strings by
way of recursion. Unlike backreferences, which refer to the string
matched by a register, subpattern references refer to the pattern
contained within the register and cause the regex engine to recurse,
as though by an actual function call, to the referenced subpattern.
SYNTAX
A subpattern reference node has the form
where is a positive fixnum denoting a register number or a
string or symbol denoting a register name.
Using the Perl syntax, a subpattern reference looks like
or
(?&NAME)
where N is a positive (decimal) integer and NAME is a register name.
API CHANGES
There are no API changes.
KNOWN ISSUES
Perl Incompatibilities
The semantics of subpattern references (or "sub calls") in Perl are
not well defined. In particular, as of version 5.19.9, the
interaction between subpattern references and backreferences is
inconsistent. This issue was recently raised on the p5p mailing
list, and the Perl devs seem to be seriously considering adopting
the semantics implemented here. See
https://rt.perl.org/Public/Bug/Display.html?id=121299 for details.
Embedded Modifiers
The interaction between subpattern references and embedded modifiers
(e.g. :CASE-INSENSITIVE-P) is undefined for now and will be
addressed in a future release.
AllegroCL Compatibility Mode
So far as I know, the AllegroCL compatibility mode (enabled by
adding :USE-ACL-REGEXP2-ENGINE to FEATURES before compiling) does
not support this feature.
Other Bugs
Several outstanding bugs are known to at least indirectly affect
subpattern references. Cf. #17 and #12, for example.
IMPLEMENTATION DETAILS
During the match phase, the subpattern reference closure calls the
register closure, passing it an extra argument: the match
continuation.
When the register closure sees that it has been called with an extra
argument, it knows that it has been entered via subpattern
reference. At this point, it saves the state of the local
registers' offsets and creates new dynamic "bindings" for them.
Then it calls the register's inner matcher, restoring the register
offsets state upon return therefrom. If the inner matcher has
succeeded, the subpattern reference's continuation is called.
The presence of one or more subpattern references precludes certain
optimizations. However, the performance for existing code (i.e.,
for regular expressions not containing subpattern references) should
be unaffected hereby.
OTHER CHANGES
The testing code has been overhauled. Of note: