Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessGroupNCCL,Manager: surface async abort errors correctly #147

Merged
merged 1 commit into from
Mar 26, 2025

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Mar 21, 2025

This adds a mechanism for propagating abort errors to Manager via ProcessGroup.errored().

Without this change we don't wait for all operations to complete in Manager.should_commit and even if an abort occurs we don't detect it as they're running asynchronously on the GPU and NCCL abort causes it to return successfully.

This intentionally adds a synchronization point in should_commit.

This also includes a small fix in the resiliency tests to avoid stream dependencies between different worker threads.

Test plan:

pytest -o log_cli=1 torchft/process_group_test.py -v -s -x -k 'NormalNcclMultiPgTest'
pytest torchft/manager_test.py

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 21, 2025
@d4l3k d4l3k force-pushed the d4l3k/async_err branch from b06100f to 8b2581f Compare March 21, 2025 21:28
@d4l3k d4l3k force-pushed the d4l3k/async_err branch from 8b2581f to 0e8bf01 Compare March 21, 2025 22:28
@d4l3k d4l3k merged commit 2b3cd8d into main Mar 26, 2025
7 checks passed
@d4l3k d4l3k deleted the d4l3k/async_err branch March 26, 2025 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants