Figure 4.17 Explanation (4.7 Generating text) #308

labdmitriy · 2024-08-10T16:25:52Z

labdmitriy
Aug 10, 2024

Could you please explain why do we choose the last vector in the output matrix from the model (box with label "GPT") in Figure 4.17 as the vector "which corresponds to the next token that the GPT model is supposed to generate"?
What is the reason for this?
We have 4 tokens as the input and 4 rows in this output matrix, so I can't quite figure it out why the last row is the right choice.
You also have the comment about this last vector in Figure 4.15 but probably it will be great to get your explanation about that.

Thank you.

Answered by rasbt

Aug 11, 2024

These are good points, and it sounds like there are two related questions. Let's talk about inference first ("generate"), which you mentioned at the top of this thread. Here, we take the last token only, because we already have the other input tokens via the provided input. E.g., consider the input

"Sunday is my favorite day of the week, because"

In this case, it would be wasteful (and error prone) to have the model regenerate the input shifted by +1 token as we do during training. Instead, we are only interested in the token that comes after "because".

Then, you mentioned

For example, as I understand, for the first token in the training sample we have corresponding target (next token), b…

View full answer

labdmitriy · 2024-08-10T20:37:39Z

labdmitriy
Aug 10, 2024
Author

Probably Section 5.1.2 explains my question in more detail, thanks a lot for it!

5 replies

rasbt Aug 10, 2024
Maintainer

Ok great, glad that the question was already answered in the future sections. Otherwise, please let me know, I am really happy to explain more in that case!

labdmitriy Aug 11, 2024
Author

Thanks a lot Sebastian!

labdmitriy Aug 11, 2024
Author

Hi @rasbt,

I think that probably your additional explanation of why training dataset is constructed as you described will be great because even I understand that for each token of text in the batch we have corresponding next token in the target, I didn't understand how we use the position of the input token in training dataset

For example, as I understand, for the first token in the training sample we have corresponding target (next token), but for the 5th token we have target and also 4 tokens before as the context, and I didn't understand how this context is taken into account during training.

Thank you.

rasbt Aug 11, 2024
Maintainer

These are good points, and it sounds like there are two related questions. Let's talk about inference first ("generate"), which you mentioned at the top of this thread. Here, we take the last token only, because we already have the other input tokens via the provided input. E.g., consider the input

"Sunday is my favorite day of the week, because"

In this case, it would be wasteful (and error prone) to have the model regenerate the input shifted by +1 token as we do during training. Instead, we are only interested in the token that comes after "because".

Then, you mentioned

For example, as I understand, for the first token in the training sample we have corresponding target (next token), but for the 5th token we have target and also 4 tokens before as the context, and I didn't understand how this context is taken into account during training.
If "Sunday is my favorite day of the week, because" were used during training, the first input token is "Sunday" and the first target token is "is" (I am just showing the strings instead of token IDs, because it makes it easier to follow).

Note that we can still pass in the whole sequence though

Input: "Sunday is my favorite day of the week, because"
Target: "is my favorite day of the week, because it"

If we predict "Sunday", the other tokens like "is my favorite day of the week, because" won't impact the generated results though because of the causal attention mask. The causal attention mask only lets the model look to the left-hand side of the current token (and there is nothing to the left side because it's the first token.)

Then, if we consider the token 5 ("day"), in "Sunday is my favorite day", the other tokens will play a role for computing the target token via the model, because they are on the left side of "day".

I hope this is addressing the concern with the input? If not, please feel free to ask follow-up questions, I am happy to explain more!

Answer selected by labdmitriy

labdmitriy Aug 11, 2024
Author

Great explanation, thank you!

I think I was a little confused by the difference between outputs during the training and inference, and at first I didn't understand how we simultaneously take into account all previous tokens for all tokens, and then based on your answer I remembered that masked attention does this both simultaneously for all tokens of each sequence in the batch.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure 4.17 Explanation (4.7 Generating text) #308

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Figure 4.17 Explanation (4.7 Generating text) #308

labdmitriy Aug 10, 2024

Replies: 1 comment · 5 replies

labdmitriy Aug 10, 2024 Author

rasbt Aug 10, 2024 Maintainer

labdmitriy Aug 11, 2024 Author

labdmitriy Aug 11, 2024 Author

rasbt Aug 11, 2024 Maintainer

labdmitriy Aug 11, 2024 Author

labdmitriy
Aug 10, 2024

Replies: 1 comment 5 replies

labdmitriy
Aug 10, 2024
Author

rasbt Aug 10, 2024
Maintainer

labdmitriy Aug 11, 2024
Author

labdmitriy Aug 11, 2024
Author

rasbt Aug 11, 2024
Maintainer

labdmitriy Aug 11, 2024
Author