Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryptic error when using AutoTokenizer with SentencePiece tokenizers without sentencepiece installed #36291

Open
2 of 4 tasks
yifanmai opened this issue Feb 19, 2025 · 0 comments
Labels

Comments

@yifanmai
Copy link

System Info

  • transformers version: 4.49.0
  • Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • Huggingface_hub version: 0.29.0
  • Safetensors version: 0.4.2
  • Accelerate version: not installed
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (GPU?): 2.1.2+cu121 (False)
  • Tensorflow version (GPU?): 2.11.1 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.6.11 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using distributed or parallel set-up in script?: no

Who can help?

tokenizers: @ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

To reproduce:

  1. Uninstall sentencepiece (if installed)
  2. Run AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

Expected behavior: The call should fail with an error message telling the user to install sentencepiece. (Or should it succeed and return a slow tokenizer?)
Actual behavior: The error message incorrectly suggests that the problem is inside tiktoken.

Underlying problems:

  1. The code that knows if sentencepiece is installed and needed for the current tokenizer is quite far from where the exception is thrown, so there is no easy way to get this information and raise it to the user.
  2. The code always falls back to trying tiktoken even if the tokenizer is not a tiktoken tokenizer, which leads to a confusing error message.

Error message:

Traceback (most recent call last):
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1725, in convert_slow_tokenizer
    ).converted()
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1622, in converted
    tokenizer = self.tokenizer()
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1615, in tokenizer
    vocab_scores, merges = self.extract_vocab_merges_from_model(self.vocab_file)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1591, in extract_vocab_merges_from_model
    bpe_ranks = load_tiktoken_bpe(tiktoken_url)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/tiktoken/load.py", line 148, in load_tiktoken_bpe
    return {
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/tiktoken/load.py", line 150, in <dictcomp>
    for token, rank in (line.split() for line in contents.splitlines() if line)
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 963, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2052, in from_pretrained
    return cls._from_pretrained(
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2292, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 157, in __init__
    super().__init__(
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 139, in __init__
    fast_tokenizer = convert_slow_tokenizer(self, from_tiktoken=True)
  File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1727, in convert_slow_tokenizer
    raise ValueError(
ValueError: Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast convertors: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']

Expected behavior

The call should fail with an error message telling the user to install sentencepiece. (Or should it succeed and return a slow tokenizer?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant