-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix training #774
base: main
Are you sure you want to change the base?
Fix training #774
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find difficult to understand what your changes have accomplished, can you give more details about that please?
Also, would you mind pointing to a test (or example) that now works after your changes?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
3fb0438
to
a1412e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several failures in the CI:
- style check,
- errors in the training code directly related to some of the changes (_PARALLEL_CROSS_ENTROPY_SHOULD_PRESERVE_INPUT is not found).
Maybe this should be rebased on the 2.21.1 branch once it is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too much to comment about, kudos for finding the fix!!
(will approve when the CIs pass)
optimum/neuron/distributed/utils.py
Outdated
@@ -426,6 +434,37 @@ def _peft_tuner_embedding_to_parallel_embedding( | |||
return parent, parallel_linear | |||
|
|||
|
|||
class ParallelEmbeddingsFixed(layers.ParallelEmbedding): | |||
# TODO: remove when updating to neuronx_distributed >= 0.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about raising an error when the version>= 0.10.0 (to not forget removing it).
I am working on fixing the CI failures. |
I bumped the SDK version to 2.21.1: you can now rebase your branch and drop the 2.20 specifics. |
What does this PR do?
NeuronTrainer
to match Transformers versionXLA_USE_BF16
andXLA_DOWNCAST_BF16
flags anymore following instructions from hereignore_index
in theparallel_cross_entropy_loss
. There was a big issue in training when using TP, the model was not learning. After investigating, it was linked to the input being padded and the vanillaparallel_cross_entropy
fromneuronx_distributed
not supportingignore_index
:loss.mean()
does not work in this case because the loss for the ignored tokens is not zeroed.For now DP + TP can lead to compilation issues with SDK 2.20, but they seem to be gone with SDK 2.21.
Tests performed
HuggingFaceTB/SmolLM2-135M-Instruct
) can overfit with dp=1 tp=1HuggingFaceTB/SmolLM2-135M-Instruct
) + LoRA can overfit with dp=1 tp=1meta-llama/Llama-3.2-1B
) can overfit with dp=1 tp=2meta-llama/Llama-3.2-1B
) can overfit with dp=1 tp=2meta-llama/Llama-3.2-1B
) can overfit with dp=4 tp=2 (Only tested on SDK 2.21, otherwise compiler error)meta-llama/Llama-3.2-1B
) dp=4 tp=2 on SDK 2.21 and compared to GPUsTo be done in following PRs
AdamW_FP32Params
for a more stable training in mixed-precision