When training new tokenizer, why vocab size setting doesn't work? #801

fengyunflya · 2025-02-27T08:12:58Z

follow https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

I am tring to replicate result using train_new_from_iterator

old_tokenizer = AutoTokenizer.from_pretrained('gpt2')

corpus = [
"This is the Hugging Face Course.",
"This chapter is about tokenization.",
"This section shows several tokenizer algorithms.",
"Hopefully, you will be able to understand how they are trained and generate tokens.",
]

tokenizer = old_tokenizer.train_new_from_iterator(corpus, vocab_size=50)

But the reuslt is totally different, and the tokenizer.vocab_size is 257 which is not 50. Why?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When training new tokenizer, why vocab size setting doesn't work? #801

When training new tokenizer, why vocab size setting doesn't work? #801

fengyunflya commented Feb 27, 2025

When training new tokenizer, why vocab size setting doesn't work? #801

When training new tokenizer, why vocab size setting doesn't work? #801

Comments

fengyunflya commented Feb 27, 2025