Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Byte Pair Encoding (BPE) class for subword tokenization #3056

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Cydral
Copy link
Contributor

@Cydral Cydral commented Feb 15, 2025

Description:

This PR introduces a new bpe_tokenizer class to Dlib, implementing the Byte Pair Encoding (BPE) algorithm for subword tokenization. The BPE tokenizer is a widely used technique in natural language processing (NLP) for handling out-of-vocabulary words and reducing vocabulary size while maintaining text representation capabilities.

Key Features:

  • BPE Algorithm: Implements the BPE algorithm as described in Sennrich et al., 2016.
  • Special Tokens: Supports predefined special tokens (e.g., <text>, <url>, <image>) for marking specific elements in the text.
  • Training and Encoding: Provides methods for training the tokenizer on a text corpus and encoding/decoding text into subword tokens.
  • Serialization: Supports saving and loading the tokenizer model and vocabulary for reuse.
  • Thread-Safe: Utilizes multi-threading for efficient frequency statistics computation during training.

Usage:

dlib::bpe_tokenizer tokenizer;
tokenizer.train(corpus_text, target_vocab_size, true); // Train on a text corpus
std::vector<int> tokens = tokenizer.encode("Sample text to tokenize."); // Encode text
std::string decoded_text = tokenizer.decode(tokens); // Decode tokens back to text

- Implement BPE (Byte Pair Encoding) tokenization
- Add training and encoding methods
- Include unit tests
@Cydral Cydral changed the title Add Byte Pair Encoding Class for Subword Tokenization Add Byte Pair Encoding (BPE) class for subword tokenization Feb 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant