You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the _inner_training_loop method of /usr/local/lib/python3.10/dist-packages/transformers/trainer.py, the calculation logic for num_update_steps_per_epoch is inconsistent, which leads to the following issues:
trains the last batch, which does not have enough data for one full gradient_accumulation_steps, causing do_sync_step to be set to True and updating global_step. This results in the total number of training steps exceeding the previously calculated max_steps, causing training termination before the entire dataset is fully trained.
This issue is particularly noticeable with small datasets. For example, with the following configuration, the training actually finished after only 6.8 epochs:
3、When the dataset is small and gradient_accumulation_steps is large, other configurations may trigger additional issues, but further examples are not provided.
4、There is also a training recovery logic related to this calculation.
epochs_trained = int(self.state.global_step // num_update_steps_per_epoch)
if not args.ignore_data_skip:
steps_trained_in_current_epoch = self.state.global_step % (num_update_steps_per_epoch)
steps_trained_in_current_epoch *= args.gradient_accumulation_steps
else:
steps_trained_in_current_epoch = 0
During training resume from checkpoint, epochs_trained depends on num_update_steps_per_epoch, and the calculated epoch may be greater than the actual epoch number in the checkpoint. Similarly, steps_trained_in_current_epoch is also inaccurate.
Expected behavior
Please confirm the above issues.
Thank you very much!
The text was updated successfully, but these errors were encountered:
System Info
python 3.10
transformers 4.48.3
Reproduction
In the _inner_training_loop method of /usr/local/lib/python3.10/dist-packages/transformers/trainer.py, the calculation logic for num_update_steps_per_epoch is inconsistent, which leads to the following issues:
When calculating max_steps, the logic
num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
rounds down the number of steps.
Before the training loop, the calculation
total_updates = steps_in_epoch // args.gradient_accumulation_steps + 1
rounds up the number of steps. When fetching data during training,
num_batches = args.gradient_accumulation_steps if update_step != (total_updates - 1) else remainder
trains the last batch, which does not have enough data for one full gradient_accumulation_steps, causing do_sync_step to be set to True and updating global_step. This results in the total number of training steps exceeding the previously calculated max_steps, causing training termination before the entire dataset is fully trained.
This issue is particularly noticeable with small datasets. For example, with the following configuration, the training actually finished after only 6.8 epochs:
dataset length: 91
epochs:10
batchsize: 10
GPU:2
GradientAccumulation:2
[INFO|trainer.py:2369] 2025-02-20 05:31:54,186 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-02-20 05:31:54,187 >> Num examples = 91
[INFO|trainer.py:2371] 2025-02-20 05:31:54,187 >> Num Epochs = 10
[INFO|trainer.py:2372] 2025-02-20 05:31:54,187 >> Instantaneous batch size per device = 10
[INFO|trainer.py:2375] 2025-02-20 05:31:54,187 >> Total train batch size (w. parallel, distributed & accumulation) = 40
[INFO|trainer.py:2376] 2025-02-20 05:31:54,187 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2377] 2025-02-20 05:31:54,187 >> Total optimization steps = 20
[INFO|trainer.py:2378] 2025-02-20 05:31:54,229 >> Number of trainable parameters = 20,185,088
{'loss': 4.8291, 'grad_norm': 0.7818735241889954, 'learning_rate': 4.9692208514878444e-05, 'epoch': 0.4, 'num_input_tokens_seen': 2240}
{'loss': 4.5171, 'grad_norm': 1.987959384918213, 'learning_rate': 2.5e-05, 'epoch': 3.4, 'num_input_tokens_seen': 18880}
{'loss': 3.8988, 'grad_norm': 1.4728916883468628, 'learning_rate': 0.0, 'epoch': 6.8, 'num_input_tokens_seen': 38000}
{'train_runtime': 4008.0113, 'train_samples_per_second': 0.227, 'train_steps_per_second': 0.005, 'train_tokens_per_second': 6.786, 'train_loss': 4.223528718948364, 'epoch': 6.8, 'num_input_tokens_seen': 38000}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [1:06:47<00:00, 200.40s/it]
***** train metrics *****
epoch = 6.8
num_input_tokens_seen = 38000
total_flos = 1505672GF
train_loss = 4.2235
train_runtime = 1:06:48.01
train_samples_per_second = 0.227
train_steps_per_second = 0.005
train_tokens_per_second = 6.786
2、When the dataset is small and gradient_accumulation_steps is large, and there is insufficient data for a full epoch, training does not trigger
do_sync_step = (step + 1) % args.gradient_accumulation_steps == 0 or (step + 1) == steps_in_epoch
As a result, do_sync_step remains False and global_step stays at 0, causing the model to fail to train properly. Example configuration:
dataset num: 91
epochs:10
batchsize: 4
GPU:2
GradientAccumulation:16
}
[INFO|trainer.py:2369] 2025-02-20 07:26:40,274 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-02-20 07:26:40,533 >> Num examples = 91
[INFO|trainer.py:2371] 2025-02-20 07:26:40,939 >> Num Epochs = 10
[INFO|trainer.py:2372] 2025-02-20 07:26:41,320 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2375] 2025-02-20 07:26:42,179 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2376] 2025-02-20 07:26:42,498 >> Gradient Accumulation steps = 16
[INFO|trainer.py:2377] 2025-02-20 07:26:43,023 >> Total optimization steps = 10
[INFO|trainer.py:2378] 2025-02-20 07:26:43,810 >> Number of trainable parameters = 20,185,088
0%| | 0/10 [00:00<?, ?it/s][INFO|trainer.py:2643] 2025-02-20 07:38:24,660 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 515.6494, 'train_samples_per_second': 1.765, 'train_steps_per_second': 0.019, 'train_tokens_per_second': 47.164, 'train_loss': 45420.82977294922, 'epoch': 0, 'num_input_tokens_seen': 44128}
0%| | 0/10 [08:20<?, ?it/s]
[INFO|trainer.py:3910] 2025-02-20 07:38:31,120 >> Saving model checkpoint to /workspace/models/trained/DeepSeek-R1-Distill-Qwen-7B-007/adapter
***** train metrics *****
epoch = 0
num_input_tokens_seen = 44128
total_flos = 1748481GF
train_loss = 45420.8298
train_runtime = 0:08:35.64
train_samples_per_second = 1.765
train_steps_per_second = 0.019
train_tokens_per_second = 47.164
3、When the dataset is small and gradient_accumulation_steps is large, other configurations may trigger additional issues, but further examples are not provided.
4、There is also a training recovery logic related to this calculation.
During training resume from checkpoint, epochs_trained depends on num_update_steps_per_epoch, and the calculated epoch may be greater than the actual epoch number in the checkpoint. Similarly, steps_trained_in_current_epoch is also inaccurate.
Expected behavior
Please confirm the above issues.
Thank you very much!
The text was updated successfully, but these errors were encountered: