You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
corpus = [
"This is the Hugging Face Course.",
"This chapter is about tokenization.",
"This section shows several tokenizer algorithms.",
"Hopefully, you will be able to understand how they are trained and generate tokens.",
]
follow https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
I am tring to replicate result using train_new_from_iterator
old_tokenizer = AutoTokenizer.from_pretrained('gpt2')
corpus = [
"This is the Hugging Face Course.",
"This chapter is about tokenization.",
"This section shows several tokenizer algorithms.",
"Hopefully, you will be able to understand how they are trained and generate tokens.",
]
tokenizer = old_tokenizer.train_new_from_iterator(corpus, vocab_size=50)
But the reuslt is totally different, and the tokenizer.vocab_size is 257 which is not 50. Why?
The text was updated successfully, but these errors were encountered: