Trainable Tokens: Support for Weight Tying #2399
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a follow-up PR of #2376 to add support for weight-tying. Do not merge before the other is not merged.
What is this
Some models, such as gpt2, tie the weights between the LM head and the input embeddings for various reasons. If we use the trainable tokens adapter, we're changing the result of the
forward()
of the input embeddings but we do not change the weights (unless wemerge()
). This means that the changes are not reflected in the tied weights, such as the LM head, leading to wrong results when training.How it is solved
The current approach is searching for tied layers and putting
TrainableTokensLayer
adapters on them as well but initialized to use the parameters from the embedding layer'sTrainableTokensLayer
. This is done via thetied_adapter
argument ofTrailableTokensLayer.__init__()
.What needs to be done
TrainableTokens
adapter