Skip to content

Wrapping C `tokenizers` in ergonomic, safe Swift.

Notifications You must be signed in to change notification settings

bpkeene/swift-tokenizers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Swift HuggingFace Tokenizers

A Swift implementation of HuggingFace tokenizers using a RUST -> C -> Swift bridge.

⚠️ EXPERIMENTAL WARNING

This is an experimental implementation. Use the battle tested version from Swift-Transformers.

Overview

In contrast to the Tokenizer in swift-transformers, this implementation uses the original Rust Tokenizers as its core. We then use cbindgen to generate C headers from the Rust code, which can then be imported into Swift.

RUST (Core Tokenizer) -> C (Bridge) -> Swift (API)

Build the project

Only works on ARM Macs today, common sense required to build for other platforms (dylib -> so etc etc).

  1. Create a parent directory and cd into it.
  2. Clone tokenizers-sys.
  3. Clone swift-tokenizers.
  4. cd tokenizers-sys
  5. Run ./compile-ex.sh.
  6. Check that ./target/release/libtokenizers_sys.dylib exists.
  7. cd ..
  8. cp ./tokenizers-sys/target/release/libtokenizers_sys.dylib ./swift-tokenizers/dependencies/libtokenizers_sys.dylib
  9. cd swift-tokenizers
  10. swift build
  11. swift test
  12. 😎

Usage

func NLLBTokenizer() async throws {
    let tokenizer = try Tokenizer.fromPretrained(name: "facebook/nllb-200-distilled-600M")
    let encoding = try tokenizer.encode("how much wood could a woodchuck chuck?")
    print(encoding.ids)
    let decoded = try tokenizer.decode(encoding.ids)
    print(decoded)
}

Packaging

Right now this just links a dylib compiled from tokenizers-sys, so resolving packaging for all platforms is another step to take.

TODO

  • Pass 100% of swift-transformers Tokenizer tests
  • C API won't expose async, so we may want to use Hub package and avoid using fromPretrained from the Rust package.
  • Cross platform packaging
  • Drop in replacement for swift-transformers tokenizer
  • Implement Chat Templates

About

Wrapping C `tokenizers` in ergonomic, safe Swift.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Swift 75.3%
  • C 24.7%