Training on multi-slice v6e discovers MegaScale twice again #8812

tengyifei · 2025-03-10T07:53:37Z

🐛 Bug

c.f. the tests in AI-Hypercomputer/torchprime#146, we can train on 2 slices with 2.6 stable, but any nightly version I tested all fails. The proximate cause is that tracing the training loop triggers MegaScale device discovery a second time (likely within flash attention). We've fixed this once in #8609, but it's failing again.

To Reproduce

Run checks on the PR AI-Hypercomputer/torchprime#146 with a nightly docker image of your choice.

Expected behavior

Training works on 2 slices.

Additional context

We fixes one source of double MegaScale discovery in #8609

bhavya01 · 2025-03-14T18:12:45Z

This should be fixed by #8819 but testing is currently blocked due to lack of a healthy cluster.

ysiraichi added the bug Something isn't working label Mar 10, 2025

bhavya01 self-assigned this Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on multi-slice v6e discovers MegaScale twice again #8812

Training on multi-slice v6e discovers MegaScale twice again #8812

tengyifei commented Mar 10, 2025

bhavya01 commented Mar 14, 2025

Training on multi-slice v6e discovers MegaScale twice again #8812

Training on multi-slice v6e discovers MegaScale twice again #8812

Comments

tengyifei commented Mar 10, 2025

🐛 Bug

To Reproduce

Expected behavior

Additional context

bhavya01 commented Mar 14, 2025