Why instruction fine-tune do not need to freeze the paraments #528

azraelxuemo · 2025-02-14T07:32:45Z

azraelxuemo
Feb 14, 2025

Hi, I have a simple question.
When the books talks about the classfication finetune, it freeze the basic paraments and just set the last tranfomer block, Layer Normalizetion and output head to be trainable.
But when use the instruction fine-tune, it sets all paraments trainable.

I am not sure why instruction fine-tune here do not freeze the basic paraments.
Thanks ~

Answered by rasbt

Feb 14, 2025

Hi there, this is commonly done due to modeling performance reasons. For classification finetuning, which is a simpler task, you don't need to update the previous layers as you assume that the previous layers are already good for extracting general information from text. For instruction finetuning, you are changing the behavior of the LLM, hence you update more layers (and often all layers).

Appendix E discusses LoRA, which is a variant where you actually keep main model parameters frozen during instruction finetuning.

View full answer

rasbt · 2025-02-14T12:27:04Z

rasbt
Feb 14, 2025
Maintainer

Hi there, this is commonly done due to modeling performance reasons. For classification finetuning, which is a simpler task, you don't need to update the previous layers as you assume that the previous layers are already good for extracting general information from text. For instruction finetuning, you are changing the behavior of the LLM, hence you update more layers (and often all layers).

Appendix E discusses LoRA, which is a variant where you actually keep main model parameters frozen during instruction finetuning.

0 replies

azraelxuemo · 2025-02-15T09:31:40Z

azraelxuemo
Feb 15, 2025
Author

Thank you for your response.
So. as you said, the classification finetuning is a simpler task, so the book by default just choose the last layer to train. And the instruction finetuning may be harder. so by default the book choose to train all layers.
I don't know if I understand correctly, but thank you for your time.

1 reply

rasbt Feb 15, 2025
Maintainer

Yes, you are correct. In addition, you can also think of classification as being "easier" because it's more of a recognition task. In this specific case recognizing if something belongs to one of two pre-specified categories ('spam' and 'not spam'). In contrast, instruction finetuned LLM is generating text, which is a harder task.

azraelxuemo · 2025-02-15T13:36:31Z

azraelxuemo
Feb 15, 2025
Author

In short, thank you very much for your help. Also, I have another question, could you please help me?
Why is 1/10000 used as the scaling factor in the position encoding in the paper Attention Is All You Need? I don't know if you have any relevant insights on this.
Thank you for your help

1 reply

rasbt Feb 15, 2025
Maintainer

It's basically a hyperparameter. I think it just empirically worked well for them, but it could have been 1/50,000 or 1/100,000 etc.

azraelxuemo · 2025-02-16T00:58:48Z

azraelxuemo
Feb 16, 2025
Author

Thank you for your answer, it has been very helpful to me. Thank you very much

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why instruction fine-tune do not need to freeze the paraments #528

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why instruction fine-tune do not need to freeze the paraments #528

azraelxuemo Feb 14, 2025

Replies: 4 comments · 2 replies

rasbt Feb 14, 2025 Maintainer

azraelxuemo Feb 15, 2025 Author

rasbt Feb 15, 2025 Maintainer

azraelxuemo Feb 15, 2025 Author

rasbt Feb 15, 2025 Maintainer

azraelxuemo Feb 16, 2025 Author

azraelxuemo
Feb 14, 2025

Replies: 4 comments 2 replies

rasbt
Feb 14, 2025
Maintainer

azraelxuemo
Feb 15, 2025
Author

rasbt Feb 15, 2025
Maintainer

azraelxuemo
Feb 15, 2025
Author

rasbt Feb 15, 2025
Maintainer

azraelxuemo
Feb 16, 2025
Author