You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
To reproduce:
Uninstall sentencepiece (if installed)
Run AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
Expected behavior: The call should fail with an error message telling the user to install sentencepiece. (Or should it succeed and return a slow tokenizer?)
Actual behavior: The error message incorrectly suggests that the problem is inside tiktoken.
Underlying problems:
The code that knows if sentencepiece is installed and needed for the current tokenizer is quite far from where the exception is thrown, so there is no easy way to get this information and raise it to the user.
The code always falls back to trying tiktoken even if the tokenizer is not a tiktoken tokenizer, which leads to a confusing error message.
Error message:
Traceback (most recent call last):
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1725, in convert_slow_tokenizer
).converted()
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1622, in converted
tokenizer = self.tokenizer()
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1615, in tokenizer
vocab_scores, merges = self.extract_vocab_merges_from_model(self.vocab_file)
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1591, in extract_vocab_merges_from_model
bpe_ranks = load_tiktoken_bpe(tiktoken_url)
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/tiktoken/load.py", line 148, in load_tiktoken_bpe
return {
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/tiktoken/load.py", line 150, in <dictcomp>
for token, rank in (line.split() for line in contents.splitlines() if line)
ValueError: not enough values to unpack (expected 2, got 1)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 963, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2052, in from_pretrained
return cls._from_pretrained(
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2292, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 157, in __init__
super().__init__(
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 139, in __init__
fast_tokenizer = convert_slow_tokenizer(self, from_tiktoken=True)
File "/home/yifanmai/.pyenv/versions/crfm-helm/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1727, in convert_slow_tokenizer
raise ValueError(
ValueError: Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast convertors: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']
Expected behavior
The call should fail with an error message telling the user to install sentencepiece. (Or should it succeed and return a slow tokenizer?)
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.49.0Who can help?
tokenizers: @ArthurZucker and @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
To reproduce:
sentencepiece
(if installed)AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
Expected behavior: The call should fail with an error message telling the user to install
sentencepiece
. (Or should it succeed and return a slow tokenizer?)Actual behavior: The error message incorrectly suggests that the problem is inside
tiktoken
.Underlying problems:
sentencepiece
is installed and needed for the current tokenizer is quite far from where the exception is thrown, so there is no easy way to get this information and raise it to the user.tiktoken
even if the tokenizer is not atiktoken
tokenizer, which leads to a confusing error message.Error message:
Expected behavior
The call should fail with an error message telling the user to install
sentencepiece
. (Or should it succeed and return a slow tokenizer?)The text was updated successfully, but these errors were encountered: