Cryptic error when using AutoTokenizer with SentencePiece tokenizers without sentencepiece installed #36291

yifanmai · 2025-02-19T23:31:53Z

System Info

transformers version: 4.49.0
Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.29.0
Safetensors version: 0.4.2
Accelerate version: not installed
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.1.2+cu121 (False)
Tensorflow version (GPU?): 2.11.1 (False)
Flax version (CPU?/GPU?/TPU?): 0.6.11 (cpu)
Jax version: 0.4.13
JaxLib version: 0.4.13
Using distributed or parallel set-up in script?: no

Who can help?

tokenizers: @ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

To reproduce:

Uninstall sentencepiece (if installed)
Run AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

Expected behavior: The call should fail with an error message telling the user to install sentencepiece. (Or should it succeed and return a slow tokenizer?)
Actual behavior: The error message incorrectly suggests that the problem is inside tiktoken.

Underlying problems:

The code that knows if sentencepiece is installed and needed for the current tokenizer is quite far from where the exception is thrown, so there is no easy way to get this information and raise it to the user.
The code always falls back to trying tiktoken even if the tokenizer is not a tiktoken tokenizer, which leads to a confusing error message.

Error message:

Traceback (most recent call last):
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1725, in convert_slow_tokenizer
    ).converted()
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1622, in converted
    tokenizer = self.tokenizer()
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1615, in tokenizer
    vocab_scores, merges = self.extract_vocab_merges_from_model(self.vocab_file)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1591, in extract_vocab_merges_from_model
    bpe_ranks = load_tiktoken_bpe(tiktoken_url)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/tiktoken/load.py", line 148, in load_tiktoken_bpe
    return {
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/tiktoken/load.py", line 150, in <dictcomp>
    for token, rank in (line.split() for line in contents.splitlines() if line)
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 963, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2052, in from_pretrained
    return cls._from_pretrained(
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2292, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 157, in __init__
    super().__init__(
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 139, in __init__
    fast_tokenizer = convert_slow_tokenizer(self, from_tiktoken=True)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1727, in convert_slow_tokenizer
    raise ValueError(
ValueError: Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast convertors: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']

Expected behavior

The call should fail with an error message telling the user to install sentencepiece. (Or should it succeed and return a slow tokenizer?)

The text was updated successfully, but these errors were encountered:

yifanmai added the bug label Feb 19, 2025

yifanmai mentioned this issue Feb 19, 2025

GPU summarization metrics require sentencepiece stanford-crfm/helm#3350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cryptic error when using AutoTokenizer with SentencePiece tokenizers without sentencepiece installed #36291

Cryptic error when using AutoTokenizer with SentencePiece tokenizers without sentencepiece installed #36291

yifanmai commented Feb 19, 2025

Cryptic error when using AutoTokenizer with SentencePiece tokenizers without sentencepiece installed #36291

Cryptic error when using AutoTokenizer with SentencePiece tokenizers without sentencepiece installed #36291

Comments

yifanmai commented Feb 19, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior