Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1) Optimize regular expressions used for splitting by ~20% #234

Merged
merged 3 commits into from
Feb 9, 2024

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Dec 31, 2023

By combining the contractions to a single non-capturing group prefixed by ', we can speed up matches by roughly 20%.

By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking.

The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.

Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences.

This is the first part of the optimizations I did for jtokkit, reducing the speed of the tokenization from ~10.5 seconds to ~1.6 seconds in several big steps.
If this change is accepted I'll continue migrating the changes I've made.

I've modified benchmark.py locally to measure the improvement:

def benchmark_batch(documents: list[str]) -> None:
    num_threads = int(os.environ.get("RAYON_NUM_THREADS", "1"))
    num_bytes = sum(map(len, map(str.encode, documents)))
    print(f"num_threads: {num_threads}, num_bytes: {num_bytes}")

    enc = tiktoken.get_encoding("cl100k_base")
    enc.encode("warmup")

    for _ in range(5):
        start = time.perf_counter_ns()
        enc.encode_ordinary_batch(documents, num_threads=num_threads)
        end = time.perf_counter_ns()
        bytes_per_second = num_bytes / (end - start) * 1e9
        print(f"tiktoken \t{bytes_per_second:,.0f} bytes / s")

Here the speedup is as follows:

Before:

num_threads: 1, num_bytes: 98359164
tiktoken 	8,040,959 bytes / s
tiktoken 	8,047,612 bytes / s
tiktoken 	8,059,961 bytes / s
tiktoken 	8,097,749 bytes / s
tiktoken 	8,125,161 bytes / s

After regex optimization:

num_threads: 1, num_bytes: 98359164
tiktoken 	9,861,159 bytes / s
tiktoken 	9,888,486 bytes / s
tiktoken 	9,918,514 bytes / s
tiktoken 	9,902,705 bytes / s
tiktoken 	9,917,494 bytes / s

The other 50k tokenizers are also sped up slightly, not just the C100k.

By combining the contractions to a single non-capturing group prefixed by "'", we can speed up matches by roughly 20%.

By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking.

The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.

Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences
@l0rinc l0rinc changed the title Optimize regular expressions used for splitting by ~20% 1) Optimize regular expressions used for splitting by ~20% Jan 6, 2024
@@ -73,7 +73,7 @@ def cl100k_base():
}
return {
"name": "cl100k_base",
"pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""",
"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the contractions that were grouped in the other regexes, here I've optimized using possessive quantifiers to avoid backtracking.
The changes were guided by JMH benchmarks for the same regex: https://github.com/knuddelsgmbh/jtokkit/pull/75/files#r1434984132

Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you! I reproduced the benchmarks. In some configurations / datasets, I actually see much more than a 20% win. I also tested that the possessive quantifier change preserves behaviour on a large and varied corpus, just in case I was missing something.

I'll get to your next PR soon. I appreciate this change and your patience and wanted to find a way to say thank you — please check your email :-)

@hauntsaninja hauntsaninja merged commit 6cc3a46 into openai:main Feb 9, 2024
31 of 42 checks passed
@l0rinc l0rinc deleted the paplorinc/optimize-regex branch February 9, 2024 09:21
hauntsaninja pushed a commit that referenced this pull request Oct 3, 2024
Fixes the crash in #245 by
prohibiting the regex engine from backtracking catastrophically via
[possessive
quantifiers](https://www.regular-expressions.info/possessive.html).

<img width="400" alt="image"
src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569">

Interestingly these possesives make the encoding a lot faster again in
`fancy-regex`.

Before this change (but with large byte pair merge PR cherry-picked):
```
num_threads: 1, num_bytes: 98379553
tiktoken 	11,946,036 bytes / s
tiktoken 	11,961,343 bytes / s
tiktoken 	11,995,846 bytes / s
tiktoken 	11,951,263 bytes / s
tiktoken 	11,983,405 bytes / s
```
Same, with these changes applied:
```
num_threads: 1, num_bytes: 98379553
tiktoken 	14,511,827 bytes / s
tiktoken 	14,638,134 bytes / s
tiktoken 	14,644,029 bytes / s
tiktoken 	14,729,030 bytes / s
tiktoken 	14,666,903 bytes / s
```
Updating the regex libs makes it a tiny bit faster still:
```
num_threads: 1, num_bytes: 98379553
tiktoken 	14,485,590 bytes / s
tiktoken 	14,854,049 bytes / s
tiktoken 	14,891,086 bytes / s
tiktoken 	14,843,007 bytes / s
tiktoken 	14,874,520 bytes / s
```

This is almost 2x faster than [before any of the
optimizations](#234).

-------

Opened an issue for increasing the [default backtrack
limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407),
see: fancy-regex/fancy-regex#134, but it
shouldn't be necessary here anymore.

---------

Co-authored-by: Lőrinc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants