-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experimental: Add eager outputs, similar to endids but eagerly matched. #29
Conversation
When combining several unanchored regexes it becomes VERY expensive to handle combinations of matches via the end state -- essentially, the whole reachable DFA gets separate matching and non-matching copies for each pattern, leading to a DFA whose size is proportional to the number of *possible combinations* of matches. With eager outputs, we can set a flag for matching as we reach the end of the original pattern (before looping back and possibly also matching other patterns), which keeps the state count from blowing up in fsm_determinise. To see how much difference this makes, the test tests/eager_output/run7 combines 26 different patterns. It should finish very quickly (~50 msec, just now). Try running it with `env FORCE_ENDIDS=N` for N increasing from 4 to 26. Around 10-11 it will start taking several seconds, and memory usage will roughly double with each step. This PR adds `fsm_union_repeated_pattern_group`, a variant of `fsm_union_array` that combines a set of DFAs into a single NFA, but correctly handles a mix of anchored and unanchored ends without the state count blowing up. It currently needs flags passed in for each fsm indicating whether the start and/or end are anchored, and there is a hacky special case that removes self-edges from states with eager outputs and instead connects them to a single overall unanchored end loop. I haven't yet figured out how to handle this properly in the general case, but it works for this specific use case, provided all the DFAs are combined at once. (Combining multiple DFAs each produced by determinising fsm_union_repeated_pattern_group's result probably won't work correctly.) I have tried detecting and ignoring those edges in fsm_determinise, after epsilon removal, but so far either it still causes the graph size to blow up or subtly breaks something else. This is still experimental, and the code generation for `-lc` here is quite hacky -- it expects the caller to define a `FSM_SET_EAGER_OUTPUT` acro, since the code generation interface doesn't define where the match info will go yet. A later PR will add a new code generation mode with better support for eager outputs, and I plan to eventually integrate this better with rx, AMBIG_MULTIPLE, and so on. (This squashes down a couple false starts.)
fsm_eager_output_count and fsm_eager_output_get aligns better with the endid interface, and now fsm_eager_output_get ensures the buffer contents are sorted,
Do the same check with eager outputs as it does with endids -- any two states with distinct sets for either should end up in different equivalence classes.
/* If .fragment is set and the state has eager outputs, then emit a call to a | ||
* macro (the caller is expected to define). This is a temporary interface. */ | ||
for (size_t i = 0; i < cs->eager_outputs->count; i++) { | ||
fprintf(f, "\t\t\tFSM_SET_EAGER_OUTPUT(%u);\n", cs->eager_outputs->ids[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really ugly. I figured we should talk about what the code generation interface should look like. It'd be awkward to generate non-fragment function prototypes with or without extra argument(s) for the eager outputs, but I'm not sure always adding it is a good plan either -- it probably won't be used as widely as endids. There could be a flag in the print options for whether to include it, and require the flag to be set when eager outputs are used?
Also, this interface passes the buck on whether the IDs are dense or sparse. In DD's case it's a regex ID, so there's a bit array 0..N, but the API doesn't prevent someone from using non-continguous values like 32-bit hashes. Here, the macro can expand to something that includes other variables from the caller's scope, whether that's a bit vector or some other kind of collection.
struct fsm * | ||
fsm_union_repeated_pattern_group(size_t entry_count, | ||
struct fsm_union_entry *entries, struct fsm_combined_base_pair *bases); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we be able to use this from the <re/strings.h> interfaces?
include/re/re.h
Outdated
re_is_anchored(enum re_dialect dialect, re_getchar_fun *f, void *opaque, | ||
enum re_flags flags, struct re_err *err, | ||
struct re_anchoring_info *info); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self; I requested pulling this out to its own PR, straight to upstream
@@ -405,7 +428,7 @@ hash_iss(interned_state_set_id iss) | |||
} | |||
|
|||
static struct mapping * | |||
map_first(struct map *map, struct map_iter *iter) | |||
map_first(const struct map *map, struct map_iter *iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always makes me happy to see a const
* as the combined pattern count increases; it's essentially what I'm | ||
* trying to avoid by adding eager output support in the first place. | ||
* | ||
* FIXME: instead of actively removing these, filter in fsm_determinise? */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 subject to syncing from upstream etc
Addressed merge conflict on upstreamed variant of re_is_anchored.
Fixed merge conflict with changed version of |
Use the first byte of input as a seed, when available.
When combining several unanchored regexes it becomes VERY expensive to handle combinations of matches via the end state -- essentially, the whole reachable DFA gets separate matching and non-matching copies for each pattern, leading to a DFA whose size is proportional to the number of possible combinations of matches. With eager outputs, we can set a flag for matching as we reach the end of the original pattern (before looping back and possibly also matching other patterns), which keeps the state count from blowing up in fsm_determinise.
To see how much difference this makes, the test
tests/eager_output/run7
combines 26 different patterns. It should finish very quickly (~50 msec, just now). Try running it withenv FORCE_ENDIDS=N
for N increasing from 4 to 26. Around 10-11 it will start taking several seconds, and memory usage will roughly double with each step.This PR adds
fsm_union_repeated_pattern_group
, a variant offsm_union_array
that combines a set of DFAs into a single NFA, but correctly handles a mix of anchored and unanchored ends without the state count blowing up. It currently needs flags passed in for each fsm indicating whether the start and/or end are anchored, and there is a hacky special case that removes self-edges from states with eager outputs and instead connects them to a single overall unanchored end loop. I haven't yet figured out how to handle this properly in the general case, but it works for this specific use case, provided all the DFAs are combined at once. (Combining multiple DFAs each produced by determinising fsm_union_repeated_pattern_group's result probably won't work correctly.) I have tried detecting and ignoring those edges in fsm_determinise, after epsilon removal, but so far either it still causes the graph size to blow up or subtly breaks something else.This is still experimental, and the code generation for
-lc
here is quite hacky -- it expects the caller to define aFSM_SET_EAGER_OUTPUT
acro, since the code generation interface doesn't define where the match info will go yet. A later PR will add a new code generation mode with better support for eager outputs, and I plan to eventually integrate this better with rx, AMBIG_MULTIPLE, and so on.(This squashes down a couple false starts. Sorry, it's a large PR, but there isn't a meaningful intermediate state here.)