A Swift implementation of HuggingFace tokenizers using a RUST -> C -> Swift bridge.
This is an experimental implementation. Use the battle tested version from Swift-Transformers.
In contrast to the Tokenizer in swift-transformers
, this implementation uses
the original Rust Tokenizers as
its core. We then use cbindgen
to generate C headers from the Rust code, which
can then be imported into Swift.
RUST (Core Tokenizer) -> C (Bridge) -> Swift (API)
Only works on ARM Macs today, common sense required to build for other platforms (dylib
-> so
etc etc).
- Create a parent directory and cd into it.
- Clone tokenizers-sys.
- Clone swift-tokenizers.
cd tokenizers-sys
- Run
./compile-ex.sh
. - Check that
./target/release/libtokenizers_sys.dylib
exists. cd ..
cp ./tokenizers-sys/target/release/libtokenizers_sys.dylib ./swift-tokenizers/dependencies/libtokenizers_sys.dylib
cd swift-tokenizers
swift build
swift test
- 😎
func NLLBTokenizer() async throws {
let tokenizer = try Tokenizer.fromPretrained(name: "facebook/nllb-200-distilled-600M")
let encoding = try tokenizer.encode("how much wood could a woodchuck chuck?")
print(encoding.ids)
let decoded = try tokenizer.decode(encoding.ids)
print(decoded)
}
Right now this just links a dylib
compiled from tokenizers-sys, so resolving packaging for all platforms is another step to take.
- Pass 100% of
swift-transformers
Tokenizer tests - C API won't expose async, so we may want to use
Hub
package and avoid usingfromPretrained
from the Rust package. - Cross platform packaging
- Drop in replacement for
swift-transformers
tokenizer - Implement Chat Templates