Fix training #774

michaelbenayoun · 2025-01-31T11:34:18Z

What does this PR do?

Updates the NeuronTrainer to match Transformers version
Fixes the way mixed-precision training is handled. It was breaking multiple trainings such as Llama. We do not use XLA_USE_BF16 and XLA_DOWNCAST_BF16 flags anymore following instructions from here
Fixes support for gradient clipping: now it always happens between the reduction of the gradients across the devices and the optimizer step. The gradient norm is also now always reported when logging.
Adds support for ignore_index in the parallel_cross_entropy_loss. There was a big issue in training when using TP, the model was not learning. After investigating, it was linked to the input being padded and the vanilla parallel_cross_entropy from neuronx_distributed not supporting ignore_index:
- First, loss.mean() does not work in this case because the loss for the ignored tokens is not zeroed.
- Second, the ignored tokens contributed to the gradient, which effectively destroys training.

For now DP + TP can lead to compilation issues with SDK 2.20, but they seem to be gone with SDK 2.21.

Tests performed

Llama (HuggingFaceTB/SmolLM2-135M-Instruct) can overfit with dp=1 tp=1
Llama (HuggingFaceTB/SmolLM2-135M-Instruct) + LoRA can overfit with dp=1 tp=1
Llama (meta-llama/Llama-3.2-1B) can overfit with dp=1 tp=2
Llama + LoRA (meta-llama/Llama-3.2-1B) can overfit with dp=1 tp=2
Llama (meta-llama/Llama-3.2-1B) can overfit with dp=4 tp=2 (Only tested on SDK 2.21, otherwise compiler error)
Actual training of Llama (meta-llama/Llama-3.2-1B) dp=4 tp=2 on SDK 2.21 and compared to GPUs

To be done in following PRs

Add AdamW_FP32Params for a more stable training in mixed-precision

tengomucho

I find difficult to understand what your changes have accomplished, can you give more details about that please?
Also, would you mind pointing to a test (or example) that now works after your changes?

optimum/neuron/accelerate/accelerator.py

HuggingFaceDocBuilderDev · 2025-02-06T17:44:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dacorvo

Several failures in the CI:

style check,
errors in the training code directly related to some of the changes (_PARALLEL_CROSS_ENTROPY_SHOULD_PRESERVE_INPUT is not found).

Maybe this should be rebased on the 2.21.1 branch once it is merged.

optimum/neuron/trainers.py

JingyaHuang

Not too much to comment about, kudos for finding the fix!!

(will approve when the CIs pass)

JingyaHuang · 2025-02-21T09:56:50Z

optimum/neuron/distributed/utils.py

@@ -426,6 +434,37 @@ def _peft_tuner_embedding_to_parallel_embedding(
    return parent, parallel_linear


+class ParallelEmbeddingsFixed(layers.ParallelEmbedding):
+    # TODO: remove when updating to neuronx_distributed >= 0.10.0


What about raising an error when the version>= 0.10.0 (to not forget removing it).

michaelbenayoun · 2025-02-21T11:28:30Z

Several failures in the CI:

style check,

errors in the training code directly related to some of the changes (_PARALLEL_CROSS_ENTROPY_SHOULD_PRESERVE_INPUT is not found).

Maybe this should be rebased on the 2.21.1 branch once it is merged.

I am working on fixing the CI failures.
If you want you can merge the 2.21.1 branch, then I will rebase and adapt: some changes in the PR are linked to 2.20, so if we are going to move to 2.21 very soon it does not make sense to add them.

dacorvo · 2025-02-21T11:48:27Z

I bumped the SDK version to 2.21.1: you can now rebase your branch and drop the 2.20 specifics.

tengomucho reviewed Jan 31, 2025

View reviewed changes

optimum/neuron/accelerate/accelerator.py Outdated Show resolved Hide resolved

michaelbenayoun added 9 commits February 11, 2025 15:26

WIP fix training loss

4284963

WIP fix training loss

38ef8f1

Small fix on prepare_dataloder

0b7f6d5

Fix grad norm

39ab969

Fix training loop

a7f7695

Temporary change of this file to debug thing

b67f265

Fix mixed precision

fe231bf

Restore script for tutorial

26baa0f

Fix style

a1412e2

michaelbenayoun force-pushed the align_for_training branch from 3fb0438 to a1412e2 Compare February 11, 2025 14:32

Fix style

39e54e1

michaelbenayoun requested review from tengomucho, dacorvo and JingyaHuang February 11, 2025 14:44

michaelbenayoun added 7 commits February 12, 2025 15:19

Fix dtype casting issue with TP

0639f91

[WIP]

e1b2591

[WIP]

3da78b4

[WIP]

11375d7

Remove LinearWithAsyncCommunicationFixed

48f0116

Add docstring for parallel_cross_entropy

d97c835

Cleanup

56fd4c3

michaelbenayoun changed the title ~~Fixes and updates training code for Transformers 4.48.1~~ Fix training Feb 20, 2025

dacorvo reviewed Feb 21, 2025

View reviewed changes

optimum/neuron/trainers.py Outdated Show resolved Hide resolved

JingyaHuang reviewed Feb 21, 2025

View reviewed changes

michaelbenayoun added 2 commits February 21, 2025 13:34

Remove empty spaces

1b4080b

Merge branch 'main' into align_for_training

fc34abc

michaelbenayoun added 4 commits February 21, 2025 13:42

Remove SDK 2.20 specifics

e3623a0

Styling

3aaa023

Fixes

1807f71

Remove unused comment

2c2a101

michaelbenayoun mentioned this pull request Feb 21, 2025

ErrorMessage "TypeError: _TrainerForNeuron.compute_loss() got an unexpected keyword argument 'return_outputs' #787

Open

4 tasks

Fix

a59744d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix training #774

Fix training #774

michaelbenayoun commented Jan 31, 2025 •

edited

Loading

tengomucho left a comment

HuggingFaceDocBuilderDev commented Feb 6, 2025

dacorvo left a comment

JingyaHuang left a comment

JingyaHuang Feb 21, 2025

michaelbenayoun commented Feb 21, 2025

dacorvo commented Feb 21, 2025

Fix training #774

Are you sure you want to change the base?

Fix training #774

Conversation

michaelbenayoun commented Jan 31, 2025 • edited Loading

What does this PR do?

Tests performed

To be done in following PRs

tengomucho left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 6, 2025

dacorvo left a comment

Choose a reason for hiding this comment

JingyaHuang left a comment

Choose a reason for hiding this comment

JingyaHuang Feb 21, 2025

Choose a reason for hiding this comment

michaelbenayoun commented Feb 21, 2025

dacorvo commented Feb 21, 2025

michaelbenayoun commented Jan 31, 2025 •

edited

Loading