Trainable Tokens: Support for Weight Tying #2399

githubnemo · 2025-02-25T18:37:12Z

This is a follow-up PR of #2376 to add support for weight-tying. Do not merge before the other is not merged.

What is this

Some models, such as gpt2, tie the weights between the LM head and the input embeddings for various reasons. If we use the trainable tokens adapter, we're changing the result of the forward() of the input embeddings but we do not change the weights (unless we merge()). This means that the changes are not reflected in the tied weights, such as the LM head, leading to wrong results when training.

How it is solved

The current approach is searching for tied layers and putting TrainableTokensLayer adapters on them as well but initialized to use the parameters from the embedding layer's TrainableTokensLayer. This is done via the tied_adapter argument of TrailableTokensLayer.__init__().

What needs to be done

encoder-decoder model tests
support for standalone TrainableTokens adapter
more tests

HuggingFaceDocBuilderDev · 2025-02-25T18:40:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Notably we are removing the duplication filter of `named_modules` when searching for the (tied) target modules since tied weights are by definition duplicates.

It's now possible to let the adapter decide which is the input embedding layer based on the output of `model.get_input_embeddings()`. If that fails, the default is still `embed_tokens`.

githubnemo mentioned this pull request Feb 25, 2025

Standalone Custom Tokens Tuner and integrated into LoRA #2376

Merged

6 tasks

Basic support

ac70db6

githubnemo force-pushed the feature/custom-token-tuner-weight-tying branch from 69948b9 to ac70db6 Compare February 26, 2025 16:00

nemo added 5 commits February 26, 2025 17:21

Add tests from stale branch

a730112

Update docs with new measurements

b7b23b1

Implement weight-tying for encoder-decoder models

7d2a715

Notably we are removing the duplication filter of `named_modules` when searching for the (tied) target modules since tied weights are by definition duplicates.

Implement embedding name inference

66b8078

It's now possible to let the adapter decide which is the input embedding layer based on the output of `model.get_input_embeddings()`. If that fails, the default is still `embed_tokens`.

Make style

8865631

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainable Tokens: Support for Weight Tying #2399

Trainable Tokens: Support for Weight Tying #2399

githubnemo commented Feb 25, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 25, 2025

Trainable Tokens: Support for Weight Tying #2399

Are you sure you want to change the base?

Trainable Tokens: Support for Weight Tying #2399

Conversation

githubnemo commented Feb 25, 2025 • edited Loading

What is this

How it is solved

What needs to be done

HuggingFaceDocBuilderDev commented Feb 25, 2025

githubnemo commented Feb 25, 2025 •

edited

Loading