Support ZBVZeroBubbleSchedule #817

H-Huang · 2025-02-04T19:01:33Z

This is dependent on the changes in this pytorch stack: pytorch/pytorch#146217

Add support for running ZBVZeroBubbleSchedule and v-shaped CSV schedules in torchtitan

Fixes #774

wconstab · 2025-02-06T16:42:12Z

train.py

-                    elif is_last_stage:
-                        losses = []
-                        pp_schedule.step(target=labels, losses=losses)
+                    targets = labels if has_last_stage else None


kinda nit picking but i feel like if the stage object inside the schedule already knows that it is first or last, we can avoid having the logic in the training loop too.

otoh it seems nice to be explicit at the train.py layer on whether we are asking to compute loss or not.

thoughts?
@tianyu-l

It feels nice when we only explicitly pass in meaningful targets/losses when we are not sure if they'll be properly accessed, so I'm OK with these if-else statements.

But how different is input_ids? Can we just unify everything into pp_schedule.step(input_ids, target=targets, losses=losses)
and pass input_ids = None when not has_first_stage?

We can't do input_ids=None right now since we have logic that automatically splits all *args into microbatches. For example if the user wants to do step(tensors, None) that would be split up into microbatches of (tensor1, None), (tensor2, None), ... We could update the splitting logic but not sure if it is worth it

tianyu-l

GPU CI failed, not sure if it is due to the reason I commented.

tianyu-l · 2025-02-07T20:11:37Z

train.py

+                    targets = labels if has_last_stage else None
+                    losses = [] if has_last_stage else None
+                    if has_first_stage:
+                        pp_schedule.step(input_ids, target=targets, losses=losses)


what if a schedule has has_last_stage = True and has_first_stage = False for the output layer -- will it miss the chance to feed in losses?

Oops yeah, that was the issue. Updated it and will let the CI run again

tianyu-l

lgtm, had a nit comment

train.py

Co-authored-by: tianyu-l <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2025

H-Huang requested a review from wconstab February 4, 2025 19:01

H-Huang force-pushed the support_zbv branch from 60ea9f2 to 93ea810 Compare February 6, 2025 15:16

wconstab reviewed Feb 6, 2025

View reviewed changes

tianyu-l reviewed Feb 7, 2025

View reviewed changes

Support ZBVZeroBubbleSchedule

51939f7

H-Huang force-pushed the support_zbv branch from 93ea810 to 51939f7 Compare February 7, 2025 20:45

wconstab approved these changes Feb 8, 2025

View reviewed changes

tianyu-l approved these changes Feb 9, 2025

View reviewed changes

train.py Outdated Show resolved Hide resolved

Update train.py

76974bf

Co-authored-by: tianyu-l <[email protected]>

H-Huang merged commit 3996b63 into pytorch:main Feb 10, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ZBVZeroBubbleSchedule #817

Support ZBVZeroBubbleSchedule #817

H-Huang commented Feb 4, 2025

wconstab Feb 6, 2025

tianyu-l Feb 7, 2025

H-Huang Feb 7, 2025

tianyu-l left a comment

tianyu-l Feb 7, 2025

H-Huang Feb 7, 2025

tianyu-l left a comment

Support ZBVZeroBubbleSchedule #817

Support ZBVZeroBubbleSchedule #817

Conversation

H-Huang commented Feb 4, 2025

wconstab Feb 6, 2025

Choose a reason for hiding this comment

tianyu-l Feb 7, 2025

Choose a reason for hiding this comment

H-Huang Feb 7, 2025

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

tianyu-l Feb 7, 2025

Choose a reason for hiding this comment

H-Huang Feb 7, 2025

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment