This is an implementation of the Tiktoken tokeniser, a BPE used by OpenAI's models. It's a partial Dart port from the original tiktoken library from OpenAI, but with a much nicer API.
Although there are other tokenizers available on pub.dev, as of November 2024, none of them support the GPT-4o and o1 model families. This package was created to fill that gap.
The supported models are these:
- Gpt-4
- Gpt-4o
- Gpt-4o-mini
- o1
- o1-mini
- o1-preview
Also important, this is a Dart-only package (does not require any platform channels to work), and the tokenization is done synchronously.
Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you whether:
- Some text is too long for a text model to process.
- How much an OpenAI API call costs (as usage is priced by token).
To see it in action, run the example app:
// Create a Tiktoken instance for the model you want to use.
var tiktoken = Tiktoken(OpenAiModel.gpt_4);
// Encode a text string into tokens.
var encoded = tiktoken.encode("hello world");
// Decode a token string back into text.
var decoded = tiktoken.decode(encoded);
// Count the number of tokens in a text string.
int numberOfTokens = tiktoken.count("hello world");
Alternatively, you can use the static helper functions getEncoder
and getEncoderForModel
to get a TiktokenEncoder
instance first:
var encoder = Tiktoken.getEncoder(TiktokenEncodingType.o200k_base);
var encoder = Tiktoken.getEncoderForModel(OpenAiModel.gpt_4o);
The TiktokenEncoder
instance gives you more fine-grained control over the encoding
process, as you now have access to more advanced methods:
Uint32List encode(
String text, {
SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
});
Uint32List encodeOrdinary(String text);
(List<int>, Set<List<int>>) encodeWithUnstable(
String text, {
SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
});
int encodeSingleToken(List<int> bytes);
Uint8List decodeBytes(List<int> tokens);
String decode(List<int> tokens, {bool allowMalformed = true});
Uint8List decodeSingleTokenBytes(int token)
List<Uint8List> decodeTokenBytes(List<int> tokens);
int? get eotToken;
I've added many tests to make sure this Dart implementation is correct, but you can also compare yourself the output of this package with the output of the default implementation, by visiting the online Tiktokenizer.
What's the relationship between words and tokens? Every language has a different word-to-token ratio. Here are a few general rules:
- For English: 1 word is about 1.3 tokens
- For Spanish and French: 1 word is about 2 tokens
- How Many Tokens Are Punctuation Marks, Special Characters, and Emojis? Each punctuation mark (like ,:;?!) counts as 1 token. Special characters (like ∝√∅°¬) range from 1 to 3 tokens, and emojis (like 😁🙂🤩) range from 2 to 3 tokens.
In this package I provide a word counter. Here is how you can use it:
var wordCounter = WordCounter();
// Prints 0
print(wordCounter.count(''));
// Prints 1
print(wordCounter.count('hello'));
// Prints 2
print(wordCounter.count('hello world!'));
Counting words is complex because each language has its own rules for what constitutes a word. For this reason, the provided word counter is only an approximation and will give reasonable results only for languages written in the Latin alphabet.
This package code was mostly adapted from: https://pub.dev/packages/langchain_tiktoken from publisher dragonx.cloud / website. I've just added more encodings, added tests, and made the API more user-friendly.
By Marcelo Glasberg
glasberg.dev
github.com/marcglasberg
linkedin.com/in/marcglasberg/
twitter.com/glasbergmarcelo
stackoverflow.com/users/3411681/marcg
medium.com/@marcglasberg
My article in the official Flutter documentation:
The Flutter packages I've authored:
- async_redux
- provider_for_redux
- i18n_extension
- align_positioned
- network_to_file_image
- image_pixels
- matrix4_transform
- back_button_interceptor
- indexed_list_view
- animated_size_and_fade
- assorted_layout_widgets
- weak_map
- themed
- bdd_framework
- tiktoken_tokenizer_gpt4o_o1
My Medium Articles: