Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug about num_update_steps_per_epoch in function _inner_training_loop #36297

Open
onenotell opened this issue Feb 20, 2025 · 1 comment
Open
Labels

Comments

@onenotell
Copy link

onenotell commented Feb 20, 2025

System Info

python 3.10
transformers 4.48.3

Reproduction

In the _inner_training_loop method of /usr/local/lib/python3.10/dist-packages/transformers/trainer.py, the calculation logic for num_update_steps_per_epoch is inconsistent, which leads to the following issues:

When calculating max_steps, the logic

num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps

rounds down the number of steps.
Before the training loop, the calculation

total_updates = steps_in_epoch // args.gradient_accumulation_steps + 1

rounds up the number of steps. When fetching data during training,

num_batches = args.gradient_accumulation_steps if update_step != (total_updates - 1) else remainder

trains the last batch, which does not have enough data for one full gradient_accumulation_steps, causing do_sync_step to be set to True and updating global_step. This results in the total number of training steps exceeding the previously calculated max_steps, causing training termination before the entire dataset is fully trained.

This issue is particularly noticeable with small datasets. For example, with the following configuration, the training actually finished after only 6.8 epochs:

dataset length: 91
epochs:10
batchsize: 10
GPU:2
GradientAccumulation:2

[INFO|trainer.py:2369] 2025-02-20 05:31:54,186 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-02-20 05:31:54,187 >> Num examples = 91
[INFO|trainer.py:2371] 2025-02-20 05:31:54,187 >> Num Epochs = 10
[INFO|trainer.py:2372] 2025-02-20 05:31:54,187 >> Instantaneous batch size per device = 10
[INFO|trainer.py:2375] 2025-02-20 05:31:54,187 >> Total train batch size (w. parallel, distributed & accumulation) = 40
[INFO|trainer.py:2376] 2025-02-20 05:31:54,187 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2377] 2025-02-20 05:31:54,187 >> Total optimization steps = 20
[INFO|trainer.py:2378] 2025-02-20 05:31:54,229 >> Number of trainable parameters = 20,185,088
{'loss': 4.8291, 'grad_norm': 0.7818735241889954, 'learning_rate': 4.9692208514878444e-05, 'epoch': 0.4, 'num_input_tokens_seen': 2240}
{'loss': 4.5171, 'grad_norm': 1.987959384918213, 'learning_rate': 2.5e-05, 'epoch': 3.4, 'num_input_tokens_seen': 18880}
{'loss': 3.8988, 'grad_norm': 1.4728916883468628, 'learning_rate': 0.0, 'epoch': 6.8, 'num_input_tokens_seen': 38000}

{'train_runtime': 4008.0113, 'train_samples_per_second': 0.227, 'train_steps_per_second': 0.005, 'train_tokens_per_second': 6.786, 'train_loss': 4.223528718948364, 'epoch': 6.8, 'num_input_tokens_seen': 38000}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [1:06:47<00:00, 200.40s/it]

***** train metrics *****
epoch = 6.8
num_input_tokens_seen = 38000
total_flos = 1505672GF
train_loss = 4.2235
train_runtime = 1:06:48.01
train_samples_per_second = 0.227
train_steps_per_second = 0.005
train_tokens_per_second = 6.786

2、When the dataset is small and gradient_accumulation_steps is large, and there is insufficient data for a full epoch, training does not trigger

do_sync_step = (step + 1) % args.gradient_accumulation_steps == 0 or (step + 1) == steps_in_epoch

As a result, do_sync_step remains False and global_step stays at 0, causing the model to fail to train properly. Example configuration:

dataset num: 91
epochs:10
batchsize: 4
GPU:2
GradientAccumulation:16

}
[INFO|trainer.py:2369] 2025-02-20 07:26:40,274 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-02-20 07:26:40,533 >> Num examples = 91
[INFO|trainer.py:2371] 2025-02-20 07:26:40,939 >> Num Epochs = 10
[INFO|trainer.py:2372] 2025-02-20 07:26:41,320 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2375] 2025-02-20 07:26:42,179 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2376] 2025-02-20 07:26:42,498 >> Gradient Accumulation steps = 16
[INFO|trainer.py:2377] 2025-02-20 07:26:43,023 >> Total optimization steps = 10
[INFO|trainer.py:2378] 2025-02-20 07:26:43,810 >> Number of trainable parameters = 20,185,088
0%| | 0/10 [00:00<?, ?it/s][INFO|trainer.py:2643] 2025-02-20 07:38:24,660 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 515.6494, 'train_samples_per_second': 1.765, 'train_steps_per_second': 0.019, 'train_tokens_per_second': 47.164, 'train_loss': 45420.82977294922, 'epoch': 0, 'num_input_tokens_seen': 44128}
0%| | 0/10 [08:20<?, ?it/s]
[INFO|trainer.py:3910] 2025-02-20 07:38:31,120 >> Saving model checkpoint to /workspace/models/trained/DeepSeek-R1-Distill-Qwen-7B-007/adapter

***** train metrics *****
epoch = 0
num_input_tokens_seen = 44128
total_flos = 1748481GF
train_loss = 45420.8298
train_runtime = 0:08:35.64
train_samples_per_second = 1.765
train_steps_per_second = 0.019
train_tokens_per_second = 47.164

3、When the dataset is small and gradient_accumulation_steps is large, other configurations may trigger additional issues, but further examples are not provided.

4、There is also a training recovery logic related to this calculation.

        epochs_trained = int(self.state.global_step // num_update_steps_per_epoch)
        if not args.ignore_data_skip:
            steps_trained_in_current_epoch = self.state.global_step % (num_update_steps_per_epoch)
            steps_trained_in_current_epoch *= args.gradient_accumulation_steps
        else:
            steps_trained_in_current_epoch = 0

During training resume from checkpoint, epochs_trained depends on num_update_steps_per_epoch, and the calculated epoch may be greater than the actual epoch number in the checkpoint. Similarly, steps_trained_in_current_epoch is also inaccurate.

Expected behavior

Please confirm the above issues.
Thank you very much!

@onenotell onenotell added the bug label Feb 20, 2025
@Rocketknight1
Copy link
Member

cc @muellerzr @SunMarc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants