Standalone Custom Tokens Tuner and integrated into LoRA #2376

githubnemo · 2025-02-13T13:52:08Z

This PR is based on the nifty addition of @marcusinthesky from #1541.

I took the liberty of bringing the branch up-to-date and do concept where we not only have CustomTokens as a PEFT method for selectively re-training tokens but also have a trainable_token_indices parameter in LoRA to combine both approaches (and possibly with other methods in the future).

What is this

When adding tokens or fine-tuning the representation of specific tokens we currently have little choice but to retrain the whole embedding matrix which can be huge and adds to the memory footprint (in RAM but also on disk). This method creates a sparse matrix of shape (n, embed_dim) where n is the number of tokens to be customized and only trains these few values.

How to use this

Two possibilities:

use the CustomTokens PEFT method

peft_config = CustomTokensConfig(target_modules=['embed_tokens'], token_indices=[0, 1, 2])
peft_model = get_peft_model(model, peft_config)

use in conjunction with LoRA

peft_config = LoraConfig(
    target_modules='all-linear',
    trainable_token_indices={'embed_tokens': [0, 1, 2]},
)
peft_model = get_peft_model(model, peft_config)

Implementation details

This is an early draft since I found no better way of implementing it without touching the modules_to_save infrastructure.

The idea is to abstract the ModulesToSaveWrapper into an AuxiliaryTrainingWrapper that allows for more functionality than simply setting requires_grad_(True) on specific modules and saving them alongside other modules. There are now three classes,

AuxiliaryTrainingWrapper the base class that provides a common interface for wrapping modules and forwarding getattr/forward calls from said modules
ModulesToSaveWrapper is the same as before but extended by having a method to get the state dict from the wrapped models for the given adapter so that we know which modules to save without having to match the state dict names
NewTokensWrapper is a thin wrapper around CustomTokensLayer that can be applied to layers specified by the trainable_token_indices parameter from LoraConfig (and others in the future)

To load and save these modules we iterate over the model's named_modules to filter all AuxiliaryTrainingWrapper instances, get their state dicts and - depending on load or save - read adapter-specific names and write them out to be adapter-less or vice versa. In theory this should handle saving modules_to_save as well as trainable_token_indices but that's one point that needs verification and careful review.

Things that I did not explicitly address as of yet:

I'm unsure about how weight-tying comes into play here, writing an explicit test for this is one of my immediate next steps but I think that it should be fine as long as we restore the embedding weight matrix properly
get_peft_model_state_dict will probably also mark the embedding layer as target since it is a valid embedding layer name. this is useless. we could prevent this by overriding the default setting for save_embedding_layers but unsure if it is a good idea. We could also just tell the user that they can delete the weights if they want to. Not sure about this yet.

Open tasks

add documentation and example on how to use this method (with LoRA and standalone)
Add custom model tests with TrainableTokensConfig being used directly
Add test that runs on GPU
allow for different token_indices per adapter
tests that cover having multiple custom token tuners at once
tests that cover having multiple targets for custom tokens

This change makes it possible to combine the `CustomTokens` tuner with LoRA (and potentially other) tuners.

Particularly interesting is the method for enabling adapters which now needs to check for `AuxiliaryTrainingWrapper` instead of `ModulesToSaveWrapper`. This is something that ought to be done for each tuner that does this type of enabling.

This will probably be moved to somewhere else but these are necessary for development so they can live here for now.

It turns out that it is more common than I thought for the embedding layer to be called something else so we need to support dictionary inputs to the `trainable_token_indices` parameter.

It was too late to make that change.

There's a dependency of `super().__init__()` on `.update()` but the latter depends on an attribute that is set in the child class. Therefore initialization of that attribute now happens in `.update()` which is not ideal but better than changing the parent class even more.

In theory there are now two parts that handle modules to save so a next step is to see if there are conflicts between the two.

HuggingFaceDocBuilderDev · 2025-02-13T13:55:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

marcusinthesky · 2025-02-13T14:04:09Z

Thanks for the shout-out. Looks super cool.

*sigh*

BenjaminBossan

Thanks a lot for picking up this feature, having this should be very useful for many PEFT users.

At this point, I have only done a quick review to iterate fast. In addition to my inline comments, I have some more general points:

Naming: I wonder if "custom tokens" is the right name for the feature. WDYT about "extra tokens", "trainable tokens", or "additional tokens"? Or even something like "sparse embedding update" or so? Let's just hold on a for a sec and ensure we find the best name, as we can't change it once the feature is out.
I'm just wondering out loud about whether a sparse matrix is the best way to implement this. If we have a high embedding dimension, the matrix will contain a lot of items, not sure if this could be inefficient. If it's implemented as a dense matrix (of course, only for the relevant columns), would that be possible? Perhaps via usage of index_add or scatter_add. I haven't investigated this option, just throwing some ideas out there.
Let's try to address as many TODOs as possible before merging or else they tend to stick around.
Did you run any realistic tests to ensure that this saves memory and reduces file size? I can help with that.
We should have updates to the docs and examples to show the standalone version and the LoRA integration. It would be fine to do that in a separate PR after this one but ideally it'll be added here.
Regarding testing:

Python 3.9 tests are failing because the foo: bar | baz type annotation syntax is not yet supported. Please add a from __future__ import annotations import where necessary.

Moreover, let's add a test case to test_custom_models.py, like here:

peft/tests/test_custom_models.py

Line 71 in 6d03360

    
           ("Vanilla MLP 5 LoRA", "MLP", LoraConfig, {"target_modules": ["lin0"], "modules_to_save": ["lin1"]}),

This should result in nice bump in test coverage.

src/peft/peft_model.py

BenjaminBossan · 2025-02-13T14:30:46Z

src/peft/peft_model.py

+                    if target_layer in self.modules_to_save:
+                        raise ValueError(
+                            "The embedding layer is already marked to be trained fully, either specify "
+                            f'`modules_to_save=[..., "{target_layer}", ...]` or `trainable_tokens=x` but not both.'


Replace x in the message with target_layer?

BenjaminBossan · 2025-02-13T14:33:00Z

src/peft/tuners/custom_tokens/config.py

+
+@dataclass
+class CustomTokensConfig(PeftConfig):
+    token_indices: List[int] = field(default_factory=list)


Let's add a help here too and a docstring for the config.

BenjaminBossan · 2025-02-13T14:36:00Z

src/peft/tuners/custom_tokens/layer.py

+        values = torch.rand(
+            (self.num_trainable_embeddings * self.base_layer.weight.shape[-1],)
+        )  # we initialize the values from a normal distribution N(0, 1), as in https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html


Maybe I'm missing something, but could we not take the values from the actual embedding matrix and use a copy of those to initialize the weights?

I think I'm wrong, since this would add the same value twice, right? So to keep the initial values, this would have to be zeros.

I'm wondering: If a user extends the embedding for new tokens and then uses the custom token tuner, after training time when they load the model, they would need to ensure that when they resize the embedding again, the seed is exactly the same right? Or else they would need to save the original embedding, but that would mean much larger file sizes for the adapter, which we want to avoid.

This is okay I guess, but not super user friendly. For instance, with modules_to_save, we don't have this issue as the adapter weights contain all the info we need (but of course it's a full copy of the original weights, so quite large).

In an ideal world, the extra params for the custom tokens would replace to params of the original dict, so that users won't have to worry about restoring those. Not sure if that's possible. As an alternative, I wonder if we can save a checksum of the weights that are being replaced as a buffer and then, when loading, we can raise an error if the checksum does not match?

I changed the initialization of the delta values to zero. I don't think that it is important to initialize them randomly as they're probably used in different contexts so random is not important. This is also nice because most tests assume that initializing a PEFT model does not change the parameters.

@githubnemo I agree.

As Reparameterized PEFT aims train the delta's, initializing at zero-may help when the user accidentally initializes tokens which are not in their post-training/FT corpora. This may also be a safer option with 'Token Drag'.

BenjaminBossan · 2025-02-13T14:42:42Z

src/peft/tuners/custom_tokens/layer.py

+            orig_weights += self.sparse_delta_tokens[active_adapter]
+
+            if safe_merge and not torch.isfinite(orig_weights).all():
+                raise ValueError(


If this fails, the original weights have still been mutated, right? The idea of safe_merge is that if it fails, the model stays in its original state. It is acceptable if that means we need to create a copy in case of safe_merge=True, but for safe_merge=False, copies should be avoided.

BenjaminBossan · 2025-02-13T15:15:47Z

tests/test_custom_tokens.py

+        output_mod = peft_model.forward(output_hidden_states=True, **X)
+        output_org = original_model.forward(output_hidden_states=True, **X)


Suggested change

output_mod = peft_model.forward(output_hidden_states=True, **X)

output_org = original_model.forward(output_hidden_states=True, **X)

output_mod = peft_model(output_hidden_states=True, **X)

output_org = original_model(output_hidden_states=True, **X)

I was also confused about _org. Maybe change to _orig?

BenjaminBossan · 2025-02-13T15:17:17Z

tests/test_custom_tokens.py

+            "input_ids": torch.tensor([[0, 1, 2, 3]]),
+            "attention_mask": torch.tensor([[1, 1, 1, 1]]),
+        }
+        output_trn = peft_model.forward(output_hidden_states=True, **X)


BenjaminBossan · 2025-02-13T15:18:36Z

tests/test_custom_tokens.py

+        assert not torch.allclose(W_mod[:, :3], W_org[:, :3])
+        assert torch.allclose(W_mod[:, 3:], W_org[:, 3:])
+
+    def test_combined_with_lora_usage(self, model, tokenizer, tmp_path):


Would it make sense to refactor the test to avoid most duplication? Especially if we plan on supporting other PEFT methods too.

Parametrized peft_config. The model is not parametrized as we can also test such combinations in testing common at a later point.

src/peft/peft_model.py

BenjaminBossan · 2025-02-13T15:35:53Z

tests/test_custom_tokens.py

+
+    def test_stand_alone_usage(self, model, tokenizer, tmp_path):
+        original_model = copy.deepcopy(model)
+        peft_config = CustomTokensConfig(target_modules=["embed_tokens"], token_indices=[0, 1, 2])


Just wondering: Would the test cover corner cases a little better if token_indices were not consecutive tokens starting at 0? So e.g. [1, 3] instead?

BenjaminBossan · 2025-02-13T16:37:35Z

tests/test_custom_tokens.py

+        assert torch.allclose(W_mod, W_trn)
+
+        assert not torch.allclose(W_mod[:, :3], W_org[:, :3])
+        assert torch.allclose(W_mod[:, 3:], W_org[:, 3:])


We should also ensure that there are tests that cover:

multiple targets for custom tokens

having multiple custom token tuners at once

githubnemo · 2025-02-13T16:39:39Z

Naming: I wonder if "custom tokens" is the right name for the feature. WDYT about "extra tokens", "trainable tokens", or "additional tokens"? Or even something like "sparse embedding update" or so? Let's just hold on a for a sec and ensure we find the best name, as we can't change it once the feature is out.

Agreed. A part of me wants it to be more general but I think that TrainableTokens is general enough without being too specific. I'll change it.

I'm just wondering out loud about whether a sparse matrix is the best way to implement this. If we have a high embedding dimension, the matrix will contain a lot of items, not sure if this could be inefficient. If it's implemented as a dense matrix (of course, only for the relevant columns), would that be possible? Perhaps via usage of index_add or scatter_add. I haven't investigated this option, just throwing some ideas out there.

Naïvely I would expect that the sparse implementation is able to cope with this but I agree, it is not certain that this is the best way for implementing it (or the best single way, depending on the conditions). Let's skip this discussion before we don't have a benchmark in place.

Let's try to address as many TODOs as possible before merging or else they tend to stick around.

Yep. Most of the TODOs are points where I was unsure about how to proceed before the initial review(s). Getting on these now.

Did you run any realistic tests to ensure that this saves memory and reduces file size? I can help with that.

Nope, just functional tests. It would be great if you could do a bit of benchmarking, especially with the points from above regarding efficiency with larger embedding sizes.

We should have updates to the docs and examples to show the standalone version and the LoRA integration. It would be fine to do that in a separate PR after this one but ideally it'll be added here.

Yes, adding it as a to do item in the PR description.

BenjaminBossan · 2025-02-13T16:47:14Z

Agreed. A part of me wants it to be more general but I think that TrainableTokens is general enough without being too specific. I'll change it.

👍

It would be great if you could do a bit of benchmarking, especially with the points from above regarding efficiency with larger embedding sizes.

I'll do a comparison with what would be the current approach, adding the embedding to modules_to_save. I plan to check that tomorrow.

Merge onto the base weights only after checks have completed.

Refactor PEFT method as parameter and use non-consecutive indices for testing the layer modification

In the future we might do this for `modules_to_save` as well.

BenjaminBossan

Thanks for the updates, this is almost good to go. I have some smaller comments, but those should not require big changes. Apart from those, two points:

Do we know what happens when embedding weights are tied?
Please add the GPU test that I shared with you to test_gpu_examples.py

BenjaminBossan · 2025-02-24T13:33:38Z

src/peft/peft_model.py

+            if isinstance(peft_config.trainable_token_indices, dict):
+                target_layers = peft_config.trainable_token_indices
+            else:
+                target_layers = {"embedding": peft_config.trainable_token_indices}


How did you determine that embedding is the best default name? Subjectively, I'd say that embed_tokens is more common.

src/peft/peft_model.py

BenjaminBossan · 2025-02-24T13:36:05Z

src/peft/peft_model.py

+            # `ModulesToSaveWrapper`. There are some places in the PEFT code base where the modules to save
+            # wrapper is applied based on this attribute which would lead to conflicts.


I don't get the 2nd sentence, is it relevant here?

src/peft/tuners/trainable_tokens/model.py

BenjaminBossan · 2025-02-24T16:21:34Z

src/peft/tuners/lora/config.py

@@ -273,6 +273,13 @@ class LoraConfig(PeftConfig):
            parameter when you want to apply LoRA to the ColumnParallelLinear and RowParallelLinear layers of megatron.
        megatron_core (`Optional[str]`):
            The core module from Megatron to use, defaults to `"megatron.core"`.
+        trainable_token_indices (`Optional[Union[List[int], dict[str, List[int]]]]`)
+            Lets you specify which token indices to selectively fine-tune without requiring to re-train the whole
+            embedding matrix using the `peft.TrainableTokensModel` method. You can either specify a list of indices


using the peft.TrainableTokensModel method.

I wonder if that's important info or can just be dropped.

I like pointers like these because I am incentivized to follow up on this without looking at the code and learning something new.

Makes sense.

BenjaminBossan · 2025-02-24T16:39:41Z

src/peft/utils/other.py

+):
+    """Wraps modules that are supposed to be re-trained either normally, i.e. marking them to require gradients and
+    saving them alongside other modules, or with certain methods that go alongside PEFT methods, such as retraining
+    specific token indices using sparse matrices.


"sparse" no longer fits

tests/test_custom_models.py

BenjaminBossan · 2025-02-24T16:52:59Z

tests/test_trainable_tokens.py

+        output_load = peft_model.forward(output_hidden_states=True, **X)
+        output_orig = original_model.forward(output_hidden_states=True, **X)


Suggested change

output_load = peft_model.forward(output_hidden_states=True, **X)

output_orig = original_model.forward(output_hidden_states=True, **X)

output_load = peft_model(output_hidden_states=True, **X)

output_orig = original_model(output_hidden_states=True, **X)

tests/test_trainable_tokens.py

BenjaminBossan

I did a pass on the testing and doc changes specifically. Overall looks good, just some smaller comments.

BenjaminBossan · 2025-02-25T10:28:31Z

docs/source/developer_guides/lora.md

@@ -272,6 +272,49 @@ trainer = Trainer(
 )
 ```

+## Efficiently train tokens alongside LoRA
+
+Sometimes it is necessary to not only change some layer's weights but to add new tokens as well. With larger models this can be a memory-costly endeavour. PEFT LoRA adapters support the `trainable_token_indices` parameter which allows tuning of specific tokens alongside fine-tuning of specific layers with LoRA. This method only trains the tokens you specify and leaves all other tokens untouched which saves memory and doesn't throw away learned context of existing token embeddings in contrast to when training the whole embedding matrix. Under the hood this method uses the [`~TrainableTokenLayer`].


Suggested change

Sometimes it is necessary to not only change some layer's weights but to add new tokens as well. With larger models this can be a memory-costly endeavour. PEFT LoRA adapters support the `trainable_token_indices` parameter which allows tuning of specific tokens alongside fine-tuning of specific layers with LoRA. This method only trains the tokens you specify and leaves all other tokens untouched which saves memory and doesn't throw away learned context of existing token embeddings in contrast to when training the whole embedding matrix. Under the hood this method uses the [`~TrainableTokenLayer`].

Sometimes it is necessary to not only change some layer's weights but to add new tokens as well. With larger models this can be a memory-costly endeavour. PEFT LoRA adapters support the `trainable_token_indices` parameter which allows tuning of specific tokens alongside fine-tuning of other layers with LoRA. This method only trains the tokens you specify and leaves all other tokens untouched. This saves memory and doesn't throw away learned context of existing token embeddings in contrast to training the whole embedding matrix. Under the hood this method uses the [`~TrainableTokenLayer`].

A bit more readable, WDYT?

BenjaminBossan · 2025-02-25T10:35:16Z

docs/source/developer_guides/lora.md

+tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
+
+# make room for new tokens in the embedding matrix
+base_model.resize_token_embeddings(len(tokenizer))


I wonder if we should change this to:

Suggested change

base_model.resize_token_embeddings(len(tokenizer))

base_model.resize_token_embeddings(max(len(tokenizer), base_model.model.embed_tokens.num_embeddings)

For this specific model, it makes no difference. However, for some models the embedding matrix is actually larger than the vocab size (e.g. so that it's size is a multiple of some power of 2). See e.g. the Qwen models. Thus, len(tokenizer.vocab) could be smaller than the embedding size, even after adding new tokens. In that case, transformers actually shrinks the embedding, which is not a good idea most of the time.

Yep, good point. It is not widely known, I think, that this is a thing. Makes it even more important.

docs/source/developer_guides/lora.md

BenjaminBossan · 2025-02-25T10:58:37Z

docs/source/developer_guides/lora.md

+peft_model = get_peft_model(base_model, lora_config)
+
+# proceed to train the model like normal
+[...]


Some results from my training script (all with LoRA rank 32 and bfloat16):

modules_to_save=['embed_tokens']

cuda memory avg: 15038MB
cuda memory max: 16316MB
total time: 10.81s
file size of checkpoint: 302.0MB

LoRA on embedding

cuda memory avg: 14056MB
cuda memory max: 15581MB
total time: 9.75s
file size of checkpoint: 306.4MB

Trainable tokens (6 indices)

cuda memory avg: 14039MB
cuda memory max: 15562MB
total time: 9.02s
file size of checkpoint: 52.1MB

It's not a huge saving in terms of VRAM, but it can make a difference.

BenjaminBossan · 2025-02-25T11:01:31Z

docs/source/package_reference/trainable_tokens.md

+# Trainable Tokens
+
+The Trainable Tokens method provides a way to target specific token embeddings for fine-tuning without resorting to
+training the full embedding matrix or using a low-rank adapter. It is based on the initial implementation from


Suggested change

training the full embedding matrix or using a low-rank adapter. It is based on the initial implementation from

training the full embedding matrix or using an adapter on the embedding matrix. It is based on the initial implementation from

To make it less LoRA specific.

BenjaminBossan · 2025-02-25T11:05:46Z

docs/source/package_reference/trainable_tokens.md

+
+Some preliminary benchmarks acquired with [this script](https://github.com/huggingface/peft/blob/main/scripts/train_memory.py)
+suggest that for `gemma-2-2b` (which has a rather large embedding matrix) you can save 4.8GiB VRAM with Trainable Tokens
+over fully fine-tuning. While LoRA will use even less memory (-6.3GiB total over fine-tuning) it might also target


Suggested change

over fully fine-tuning. While LoRA will use even less memory (-6.3GiB total over fine-tuning) it might also target

over fully fine-tuning the embedding matrix. While LoRA will use even less memory (-6.3GiB total over fine-tuning) it might also target

I ran the check again (all with LoRA rank 32 and bfloat16):

modules_to_save=['embed_tokens']

cuda memory avg: 9621MB
cuda memory max: 10880MB
total time: 11.78s
file size of checkpoint: 1149.4MB

LoRA on embedding

cuda memory avg: 5245MB
cuda memory max: 6988MB
total time: 9.60s
file size of checkpoint: 1180.9MB

Trainable tokens (6 indices)

cuda memory avg: 5117MB
cuda memory max: 6890MB
total time: 10.28s
file size of checkpoint: 24.4MB

So LoRA on embedding vs trainable tokens is pretty much on par when it comes to VRAM.

Is this gemma2 2b again?

BenjaminBossan · 2025-02-25T11:11:02Z

tests/test_gpu_examples.py

@@ -1550,6 +1550,98 @@ def on_optimizer_step(self, args, state, control, **kwargs):
            # assert loss is not None
            assert trainer.state.log_history[-1]["train_loss"] is not None

+    @pytest.mark.single_gpu_tests


You put this test into wrong class, which results in it trying to load a GPTQ quantized base model. Please put it in the previous test class at line ~1393.

BenjaminBossan · 2025-02-25T11:12:16Z

tests/test_gpu_examples.py

+            )
+
+            model = AutoModelForCausalLM.from_pretrained(
+                self.causal_lm_model_id,


This attribute is undefined, you can use "facebook/opt-350m".

That's from being in the wrong test class.

BenjaminBossan · 2025-02-25T11:12:22Z

tests/test_gpu_examples.py

+            )
+
+            # add 2 new tokens
+            tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)


githubnemo · 2025-02-25T18:38:25Z

Thanks for the review :) Addressed your comments.

Weight-tying will be handled in a follow-up PR: #2399

BenjaminBossan

Great work, really thorough PR and it should be helpful to many users. I have noticed a few minor issues still with the docs, up to you if you want to fix them. Anyway, feel free to merge once the CI is green.

BenjaminBossan · 2025-02-26T10:09:31Z

docs/source/package_reference/trainable_tokens.md

+
+Note that this method does not add tokens for you, you have to add tokens to the tokenizer yourself and resize the
+embedding matrix of the model accordingly. This method will only re-train the embeddings for the tokens you specify.
+This method can also be used in conjunction with LoRA layers! See [`~peft.LoraConfig.trainable_token_indices`].


Link does not appear to be working :/

BenjaminBossan · 2025-02-26T10:10:38Z

docs/source/developer_guides/lora.md

@@ -272,6 +272,50 @@ trainer = Trainer(
 )
 ```

+## Efficiently train tokens alongside LoRA
+
+Sometimes it is necessary to not only change some layer's weights but to add new tokens as well. With larger models this can be a memory-costly endeavour. PEFT LoRA adapters support the `trainable_token_indices` parameter which allows tuning of other tokens alongside fine-tuning of specific layers with LoRA. This method only trains the tokens you specify and leaves all other tokens untouched. This saves memory and doesn't throw away learned context of existing token embeddings in contrast to when training the whole embedding matrix. Under the hood this method uses the [`~TrainableTokenLayer`].


This link is also broken :/

BenjaminBossan · 2025-02-26T10:11:59Z

src/peft/tuners/lora/config.py

@@ -273,6 +273,13 @@ class LoraConfig(PeftConfig):
            parameter when you want to apply LoRA to the ColumnParallelLinear and RowParallelLinear layers of megatron.
        megatron_core (`Optional[str]`):
            The core module from Megatron to use, defaults to `"megatron.core"`.
+        trainable_token_indices (`Optional[Union[List[int], dict[str, List[int]]]]`)
+            Lets you specify which token indices to selectively fine-tune without requiring to re-train the whole
+            embedding matrix using the `peft.TrainableTokensModel` method. You can either specify a list of indices


Makes sense.

BenjaminBossan · 2025-02-26T10:18:11Z

docs/source/developer_guides/lora.md

+Sometimes it is necessary to not only change some layer's weights but to add new tokens as well. With larger models this can be a memory-costly endeavour. PEFT LoRA adapters support the `trainable_token_indices` parameter which allows tuning of other tokens alongside fine-tuning of specific layers with LoRA. This method only trains the tokens you specify and leaves all other tokens untouched. This saves memory and doesn't throw away learned context of existing token embeddings in contrast to when training the whole embedding matrix. Under the hood this method uses the [`~TrainableTokenLayer`].
+
+```py
+# for layer 'embedding'


Suggested change

# for layer 'embedding'

# for layer 'embed_tokens'

Marcus Gawronsky and others added 17 commits March 6, 2024 14:56

Initial commit

b7c5b4a

Merge branch 'main' into HEAD

23b7e91

First draft implementation of NewTokensWrapper

5c91ad3

This change makes it possible to combine the `CustomTokens` tuner with LoRA (and potentially other) tuners.

Adapt old code to new structure

431904c

Further porting CustomTokensModel

5999bbf

Particularly interesting is the method for enabling adapters which now needs to check for `AuxiliaryTrainingWrapper` instead of `ModulesToSaveWrapper`. This is something that ought to be done for each tuner that does this type of enabling.

Add preliminary tests for CustomTokens

eac3b73

This will probably be moved to somewhere else but these are necessary for development so they can live here for now.

Re-arange layer targeting

1bbccb4

It turns out that it is more common than I thought for the embedding layer to be called something else so we need to support dictionary inputs to the `trainable_token_indices` parameter.

Fix negation error

6a66754

It was too late to make that change.

Tie state dict weight name unique to method

2a796f9

Make CustomTokens layer not change base weights

d752ba9

Formatting and todos

1fd512a

Fix usage of _set_trainable

41f1d7e

Handle saving and loading of aux. training modules

5a05a7f

In theory there are now two parts that handle modules to save so a next step is to see if there are conflicts between the two.

Fix missing import and add TODO

8ccab8d

Make style + import fixes

62ee44b

Add missing copyright

a38e384

Remove outdated comment

4e0c92c

Update ruff & make style

1697776

*sigh*

BenjaminBossan requested changes Feb 13, 2025

View reviewed changes

BenjaminBossan reviewed Feb 13, 2025

View reviewed changes

nemo added 5 commits February 13, 2025 17:50

Rename CustomTokens -> TrainableTokens

5ec7b36

Address comment regarding safe merge

4783655

Merge onto the base weights only after checks have completed.

Review comments

e282daf

Addressing smaller review comments

f75aaa5

Refactor test a bit

9a24176

Refactor PEFT method as parameter and use non-consecutive indices for testing the layer modification

nemo added 4 commits February 21, 2025 18:28

Address review comments

95fe484

More divergence needed

6f8dcab

Add test making sure that state dict is lean

ca3b678

Make _set_trainable model check optional for now

1b80613

In the future we might do this for `modules_to_save` as well.

githubnemo marked this pull request as ready for review February 24, 2025 12:45

BenjaminBossan requested changes Feb 24, 2025

View reviewed changes

marcusinthesky mentioned this pull request Feb 24, 2025

CUSTOM_TOKEN Tuner #1541

Closed

3 tasks

nemo added 2 commits February 24, 2025 19:47

Add documentation and GPU test

846d5b0

Make style

64fa947

BenjaminBossan requested changes Feb 25, 2025

View reviewed changes

nemo added 15 commits February 25, 2025 17:39

Move GPU test to existing tests, remove prototype

408627e

Add checks for weight-tying

daf5229

Change default to embed_tokens

daedb42

Add test for modules_to_save + trainable tokens

f68ea02

Add basic merge_and_unload test

ef1c6c0

Clearer description what Trainable Tokens does

6289abd

Outdated comment

55972ba

Property to exclude original_module from state dict

42cf9ca

Remove now deprecated test

15201ef

Style

9502c21

Fix deprecated comment

c0433be

Add explanation of why

aa4ffe3

Use implicit forward calls instead

9ac4298

Stronger training test

04ef6ee

Review comments

c6ebd3d

githubnemo mentioned this pull request Feb 25, 2025

Trainable Tokens: Support for Weight Tying #2399

Draft

3 tasks

BenjaminBossan approved these changes Feb 26, 2025

View reviewed changes

Address doc comments

0a187d8

githubnemo merged commit f51203f into huggingface:main Feb 26, 2025
13 of 14 checks passed

		output_mod = peft_model.forward(output_hidden_states=True, **X)
		output_org = original_model.forward(output_hidden_states=True, **X)

		# `ModulesToSaveWrapper`. There are some places in the PEFT code base where the modules to save
		# wrapper is applied based on this attribute which would lead to conflicts.

		output_load = peft_model.forward(output_hidden_states=True, **X)
		output_orig = original_model.forward(output_hidden_states=True, **X)

	base_model.resize_token_embeddings(len(tokenizer))
	base_model.resize_token_embeddings(max(len(tokenizer), base_model.model.embed_tokens.num_embeddings)

	training the full embedding matrix or using a low-rank adapter. It is based on the initial implementation from
	training the full embedding matrix or using an adapter on the embedding matrix. It is based on the initial implementation from

	over fully fine-tuning. While LoRA will use even less memory (-6.3GiB total over fine-tuning) it might also target
	over fully fine-tuning the embedding matrix. While LoRA will use even less memory (-6.3GiB total over fine-tuning) it might also target

Standalone Custom Tokens Tuner and integrated into LoRA #2376

Standalone Custom Tokens Tuner and integrated into LoRA #2376

Conversation

githubnemo commented Feb 13, 2025 • edited Loading

What is this

How to use this

Implementation details

Open tasks

HuggingFaceDocBuilderDev commented Feb 13, 2025

marcusinthesky commented Feb 13, 2025

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

githubnemo commented Feb 13, 2025 • edited Loading

BenjaminBossan commented Feb 13, 2025

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

githubnemo commented Feb 25, 2025

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

githubnemo commented Feb 13, 2025 •

edited

Loading

githubnemo commented Feb 13, 2025 •

edited

Loading