-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1) Optimize regular expressions used for splitting by ~20% #234
Conversation
By combining the contractions to a single non-capturing group prefixed by "'", we can speed up matches by roughly 20%. By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking. The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it. Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences
@@ -73,7 +73,7 @@ def cl100k_base(): | |||
} | |||
return { | |||
"name": "cl100k_base", | |||
"pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""", | |||
"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the contractions that were grouped in the other regexes, here I've optimized using possessive quantifiers to avoid backtracking.
The changes were guided by JMH benchmarks for the same regex: https://github.com/knuddelsgmbh/jtokkit/pull/75/files#r1434984132
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you! I reproduced the benchmarks. In some configurations / datasets, I actually see much more than a 20% win. I also tested that the possessive quantifier change preserves behaviour on a large and varied corpus, just in case I was missing something.
I'll get to your next PR soon. I appreciate this change and your patience and wanted to find a way to say thank you — please check your email :-)
Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via [possessive quantifiers](https://www.regular-expressions.info/possessive.html). <img width="400" alt="image" src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569"> Interestingly these possesives make the encoding a lot faster again in `fancy-regex`. Before this change (but with large byte pair merge PR cherry-picked): ``` num_threads: 1, num_bytes: 98379553 tiktoken 11,946,036 bytes / s tiktoken 11,961,343 bytes / s tiktoken 11,995,846 bytes / s tiktoken 11,951,263 bytes / s tiktoken 11,983,405 bytes / s ``` Same, with these changes applied: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,511,827 bytes / s tiktoken 14,638,134 bytes / s tiktoken 14,644,029 bytes / s tiktoken 14,729,030 bytes / s tiktoken 14,666,903 bytes / s ``` Updating the regex libs makes it a tiny bit faster still: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,485,590 bytes / s tiktoken 14,854,049 bytes / s tiktoken 14,891,086 bytes / s tiktoken 14,843,007 bytes / s tiktoken 14,874,520 bytes / s ``` This is almost 2x faster than [before any of the optimizations](#234). ------- Opened an issue for increasing the [default backtrack limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407), see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore. --------- Co-authored-by: Lőrinc <[email protected]>
By combining the contractions to a single non-capturing group prefixed by
'
, we can speed up matches by roughly 20%.By using possessive quantifiers for the
cl100k_base
in the word and punctuation groups we're avoiding some backtracking.The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.
Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences.
This is the first part of the optimizations I did for jtokkit, reducing the speed of the tokenization from ~10.5 seconds to ~1.6 seconds in several big steps.
If this change is accepted I'll continue migrating the changes I've made.
I've modified
benchmark.py
locally to measure the improvement:Here the speedup is as follows:
Before:
After regex optimization:
The other 50k tokenizers are also sped up slightly, not just the C100k.