Training Schedule / Curriculum #17

voegtlel · 2024-10-10T16:23:35Z

Allow defining a training schedule. I.e. shifting weights of datasets during training per step.

One example could be:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    blend:
      - weight: 1
        path: ds1
      - weight:
           __module__: megatron.energon 
           __class__: WeightSchedule
           linear:  # Maybe "linear" or "step"?
             0: 100  # At iteration 0 (i.e. 0 items yielded on each rank), the weight is 100
             100: 10  # At iteration 100, the weight is 10
             1000: 0  # At iteration 1000 (and onwards), the weight is 0
        path: ds2

Also for epochized_blend:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    epochized_blend:
      - repetitions: 1
        path: ds1
      - repetitions: 2
        weight:  # Combination with "weight" to fade in / out a dataset? The outer repetitions still hold (except when weight becomes 0).
           __module__: megatron.energon 
           __class__: WeightSchedule
           linear:  # Maybe "linear" or "step"?
             0: 100  # At iteration 0 (i.e. 0 items yielded on each rank), the weight is 100
             100: 10  # At iteration 100, the weight is 10
             1000: 0  # At iteration 1000 (and onwards), the weight is 0
        path: ds2

With an outer schedule:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    sequential_schedule:  # Or just "schedule" or just "sequential" or "curriculum"?
    # Does this need an option to end iterating at the end of a stage? Otherwise, the shufbuf will mix stages.
    # I guess, inside we cannot handle "blend", but only "blend_epochized" or a dataset directly.
      # This is stage 1 of the training, until the repetitions are done.
      - epochized_blend:  # Blend the first part consisting of these datasets
        - repetitions: 1
          path: ds1
        - repetitions: 2
          path: ds2
      # This is stage 2 of the training, until the repetitions are done.
      - weight: 1
        path: ds3

Discussion:

Schedule is depending on the number of dataset iterations. This may not equal the number of gradient updates, e.g. for gradient accumulation. Should we make gradacc / steps_per_iter configurable?
maybe make it rather type: linear instead of linear: and step:? Should unify this with typical lr-schedulers.

The text was updated successfully, but these errors were encountered:

voegtlel changed the title ~~Training Schedule~~ Training Schedule / Curriculum Oct 10, 2024

voegtlel added enhancement New feature or request and removed enhancement New feature or request labels Oct 10, 2024

philipp-fischer added the P2 label Feb 7, 2025

voegtlel added P1 and removed P2 labels Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Schedule / Curriculum #17

Training Schedule / Curriculum #17

voegtlel commented Oct 10, 2024 •

edited

Loading

Training Schedule / Curriculum #17

Training Schedule / Curriculum #17

Comments

voegtlel commented Oct 10, 2024 • edited Loading

voegtlel commented Oct 10, 2024 •

edited

Loading