Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer Mismatch Problem #27

Open
Jacobsonradical opened this issue Oct 6, 2024 · 2 comments
Open

Tokenizer Mismatch Problem #27

Jacobsonradical opened this issue Oct 6, 2024 · 2 comments

Comments

@Jacobsonradical
Copy link

Jacobsonradical commented Oct 6, 2024

Hi, I have a question about the tokenizer mismatch.

When the reference model is fixed to be "gpt-j-6B", several scoring models do not share the same tokenizer, such as "gpt-neox-20b" and "llama". In this case, this line of code

assert torch.all(tokenized.input_ids[:, 1:] == labels), "tokenizer_mismatch"

would give me the following-like runtime error instead of assertion error:

RuntimeError: The size of tensor a (92) must match the size of tensor b (93) at non-singleton dimension 1

In your original experiment, what did you do to solve this problem? And why didn't "opt-" models use the GPT2FastTokenizer? Isn't the default option to fast-tokenizer false? Thank you.

Edit:
It seems like, from the .sh file, that the experiment of fast-detect-gpt, M1 and M2 are fixed to be the best scorer and best sampler. However, without running the experiments with all combinations of different models, how did you know which ones would be the best? And one of the underlying assumptions of fast-detect-gpt is that the tokenizer between the two models must be the same? Many thanks

@baoguangsheng
Copy link
Owner

The sampling and scoring models must share the same vocabulary, otherwise the sampled tokens may not have corresponding items in the scoring distribution. It is ensured when we use the same model for sampling and scoring.

However, when we use different models for them, we need check the vocabulary first. As we know, gpt-j-6B shares the same vocabulary with gpt-2 and gpt-neo-2.7B, and we only test the combination between them (see Appendix E in the paper).

Hope it helpful for you.

@Jacobsonradical
Copy link
Author

Make sense. Thank you very much for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants