Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on multi-slice v6e discovers MegaScale twice again #8812

Open
tengyifei opened this issue Mar 10, 2025 · 1 comment
Open

Training on multi-slice v6e discovers MegaScale twice again #8812

tengyifei opened this issue Mar 10, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@tengyifei
Copy link
Collaborator

🐛 Bug

c.f. the tests in AI-Hypercomputer/torchprime#146, we can train on 2 slices with 2.6 stable, but any nightly version I tested all fails. The proximate cause is that tracing the training loop triggers MegaScale device discovery a second time (likely within flash attention). We've fixed this once in #8609, but it's failing again.

To Reproduce

Run checks on the PR AI-Hypercomputer/torchprime#146 with a nightly docker image of your choice.

Expected behavior

Training works on 2 slices.

Additional context

We fixes one source of double MegaScale discovery in #8609

@ysiraichi ysiraichi added the bug Something isn't working label Mar 10, 2025
@bhavya01 bhavya01 self-assigned this Mar 10, 2025
@bhavya01
Copy link
Collaborator

This should be fixed by #8819 but testing is currently blocked due to lack of a healthy cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants