Tokenizer Mismatch Problem #27

Jacobsonradical · 2024-10-06T18:06:52Z

Hi, I have a question about the tokenizer mismatch.

When the reference model is fixed to be "gpt-j-6B", several scoring models do not share the same tokenizer, such as "gpt-neox-20b" and "llama". In this case, this line of code

assert torch.all(tokenized.input_ids[:, 1:] == labels), "tokenizer_mismatch"

would give me the following-like runtime error instead of assertion error:

RuntimeError: The size of tensor a (92) must match the size of tensor b (93) at non-singleton dimension 1

In your original experiment, what did you do to solve this problem? And why didn't "opt-" models use the GPT2FastTokenizer? Isn't the default option to fast-tokenizer false? Thank you.

Edit:
It seems like, from the .sh file, that the experiment of fast-detect-gpt, M1 and M2 are fixed to be the best scorer and best sampler. However, without running the experiments with all combinations of different models, how did you know which ones would be the best? And one of the underlying assumptions of fast-detect-gpt is that the tokenizer between the two models must be the same? Many thanks

The text was updated successfully, but these errors were encountered:

baoguangsheng · 2024-10-07T03:15:14Z

The sampling and scoring models must share the same vocabulary, otherwise the sampled tokens may not have corresponding items in the scoring distribution. It is ensured when we use the same model for sampling and scoring.

However, when we use different models for them, we need check the vocabulary first. As we know, gpt-j-6B shares the same vocabulary with gpt-2 and gpt-neo-2.7B, and we only test the combination between them (see Appendix E in the paper).

Hope it helpful for you.

Jacobsonradical · 2024-10-07T12:41:23Z

Make sense. Thank you very much for the explanation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer Mismatch Problem #27

Tokenizer Mismatch Problem #27

Jacobsonradical commented Oct 6, 2024 •

edited

Loading

baoguangsheng commented Oct 7, 2024

Jacobsonradical commented Oct 7, 2024

Tokenizer Mismatch Problem #27

Tokenizer Mismatch Problem #27

Comments

Jacobsonradical commented Oct 6, 2024 • edited Loading

baoguangsheng commented Oct 7, 2024

Jacobsonradical commented Oct 7, 2024

Jacobsonradical commented Oct 6, 2024 •

edited

Loading