You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
c.f. the tests in AI-Hypercomputer/torchprime#146, we can train on 2 slices with 2.6 stable, but any nightly version I tested all fails. The proximate cause is that tracing the training loop triggers MegaScale device discovery a second time (likely within flash attention). We've fixed this once in #8609, but it's failing again.
🐛 Bug
c.f. the tests in AI-Hypercomputer/torchprime#146, we can train on 2 slices with 2.6 stable, but any nightly version I tested all fails. The proximate cause is that tracing the training loop triggers MegaScale device discovery a second time (likely within flash attention). We've fixed this once in #8609, but it's failing again.
To Reproduce
Run checks on the PR AI-Hypercomputer/torchprime#146 with a nightly docker image of your choice.
Expected behavior
Training works on 2 slices.
Additional context
We fixes one source of double MegaScale discovery in #8609
The text was updated successfully, but these errors were encountered: