Bug about num_update_steps_per_epoch in function _inner_training_loop #36297

onenotell · 2025-02-20T09:00:46Z

System Info

python 3.10
transformers 4.48.3

Reproduction

In the _inner_training_loop method of /usr/local/lib/python3.10/dist-packages/transformers/trainer.py, the calculation logic for num_update_steps_per_epoch is inconsistent, which leads to the following issues:

When calculating max_steps, the logic

num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps

rounds down the number of steps.
Before the training loop, the calculation

total_updates = steps_in_epoch // args.gradient_accumulation_steps + 1

rounds up the number of steps. When fetching data during training,

num_batches = args.gradient_accumulation_steps if update_step != (total_updates - 1) else remainder

trains the last batch, which does not have enough data for one full gradient_accumulation_steps, causing do_sync_step to be set to True and updating global_step. This results in the total number of training steps exceeding the previously calculated max_steps, causing training termination before the entire dataset is fully trained.

This issue is particularly noticeable with small datasets. For example, with the following configuration, the training actually finished after only 6.8 epochs:

dataset length： 91
epochs：10
batchsize： 10
GPU：2
GradientAccumulation：2

[INFO|trainer.py:2369] 2025-02-20 05:31:54,186 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-02-20 05:31:54,187 >> Num examples = 91
[INFO|trainer.py:2371] 2025-02-20 05:31:54,187 >> Num Epochs = 10
[INFO|trainer.py:2372] 2025-02-20 05:31:54,187 >> Instantaneous batch size per device = 10
[INFO|trainer.py:2375] 2025-02-20 05:31:54,187 >> Total train batch size (w. parallel, distributed & accumulation) = 40
[INFO|trainer.py:2376] 2025-02-20 05:31:54,187 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2377] 2025-02-20 05:31:54,187 >> Total optimization steps = 20
[INFO|trainer.py:2378] 2025-02-20 05:31:54,229 >> Number of trainable parameters = 20,185,088
{'loss': 4.8291, 'grad_norm': 0.7818735241889954, 'learning_rate': 4.9692208514878444e-05, 'epoch': 0.4, 'num_input_tokens_seen': 2240}
{'loss': 4.5171, 'grad_norm': 1.987959384918213, 'learning_rate': 2.5e-05, 'epoch': 3.4, 'num_input_tokens_seen': 18880}
{'loss': 3.8988, 'grad_norm': 1.4728916883468628, 'learning_rate': 0.0, 'epoch': 6.8, 'num_input_tokens_seen': 38000}

{'train_runtime': 4008.0113, 'train_samples_per_second': 0.227, 'train_steps_per_second': 0.005, 'train_tokens_per_second': 6.786, 'train_loss': 4.223528718948364, 'epoch': 6.8, 'num_input_tokens_seen': 38000}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [1:06:47<00:00, 200.40s/it]

***** train metrics *****
epoch = 6.8
num_input_tokens_seen = 38000
total_flos = 1505672GF
train_loss = 4.2235
train_runtime = 1:06:48.01
train_samples_per_second = 0.227
train_steps_per_second = 0.005
train_tokens_per_second = 6.786

2、When the dataset is small and gradient_accumulation_steps is large, and there is insufficient data for a full epoch, training does not trigger

do_sync_step = (step + 1) % args.gradient_accumulation_steps == 0 or (step + 1) == steps_in_epoch

As a result, do_sync_step remains False and global_step stays at 0, causing the model to fail to train properly. Example configuration:

dataset num： 91
epochs：10
batchsize： 4
GPU：2
GradientAccumulation：16

}
[INFO|trainer.py:2369] 2025-02-20 07:26:40,274 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-02-20 07:26:40,533 >> Num examples = 91
[INFO|trainer.py:2371] 2025-02-20 07:26:40,939 >> Num Epochs = 10
[INFO|trainer.py:2372] 2025-02-20 07:26:41,320 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2375] 2025-02-20 07:26:42,179 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2376] 2025-02-20 07:26:42,498 >> Gradient Accumulation steps = 16
[INFO|trainer.py:2377] 2025-02-20 07:26:43,023 >> Total optimization steps = 10
[INFO|trainer.py:2378] 2025-02-20 07:26:43,810 >> Number of trainable parameters = 20,185,088
0%| | 0/10 [00:00<?, ?it/s][INFO|trainer.py:2643] 2025-02-20 07:38:24,660 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 515.6494, 'train_samples_per_second': 1.765, 'train_steps_per_second': 0.019, 'train_tokens_per_second': 47.164, 'train_loss': 45420.82977294922, 'epoch': 0, 'num_input_tokens_seen': 44128}
0%| | 0/10 [08:20<?, ?it/s]
[INFO|trainer.py:3910] 2025-02-20 07:38:31,120 >> Saving model checkpoint to /workspace/models/trained/DeepSeek-R1-Distill-Qwen-7B-007/adapter

***** train metrics *****
epoch = 0
num_input_tokens_seen = 44128
total_flos = 1748481GF
train_loss = 45420.8298
train_runtime = 0:08:35.64
train_samples_per_second = 1.765
train_steps_per_second = 0.019
train_tokens_per_second = 47.164

3、When the dataset is small and gradient_accumulation_steps is large, other configurations may trigger additional issues, but further examples are not provided.

4、There is also a training recovery logic related to this calculation.

        epochs_trained = int(self.state.global_step // num_update_steps_per_epoch)
        if not args.ignore_data_skip:
            steps_trained_in_current_epoch = self.state.global_step % (num_update_steps_per_epoch)
            steps_trained_in_current_epoch *= args.gradient_accumulation_steps
        else:
            steps_trained_in_current_epoch = 0

During training resume from checkpoint, epochs_trained depends on num_update_steps_per_epoch, and the calculated epoch may be greater than the actual epoch number in the checkpoint. Similarly, steps_trained_in_current_epoch is also inaccurate.

Expected behavior

Please confirm the above issues.
Thank you very much!

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-02-20T13:53:42Z

cc @muellerzr @SunMarc

onenotell added the bug label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug about num_update_steps_per_epoch in function _inner_training_loop #36297

Bug about num_update_steps_per_epoch in function _inner_training_loop #36297

onenotell commented Feb 20, 2025 •

edited

Loading

Rocketknight1 commented Feb 20, 2025

Bug about num_update_steps_per_epoch in function _inner_training_loop #36297

Bug about num_update_steps_per_epoch in function _inner_training_loop #36297

Comments

onenotell commented Feb 20, 2025 • edited Loading

System Info

Reproduction

Expected behavior

Rocketknight1 commented Feb 20, 2025

onenotell commented Feb 20, 2025 •

edited

Loading