Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds more collectives to ProcessGroups #108

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

allenwang28
Copy link
Contributor

What does this PR do?

Continuation of work for #97.

This PR:

  • Adds more collectives:
    • allreduce_coalesced
    • alltoall_base
    • barrier
    • reduce_scatter
    • send/recv
  • Extends process_group_test.py to accommodate the above collectives

Concerns and possible follow up actions

Missing ops in backends

Notably allgather_into_tensor_coalesced and reduce_scatter_tensor_coalesced were not added in this PR.

It was part of the plan, but started seeing an error such as:

E       AttributeError: 'torch._C._distributed_c10d.ProcessGroupNCCL' object has no attribute 'allgather_into_tensor_coalesced'

This was confusing, but surely enough this is true, if we do dir(pg) we see:

['NCCLConfig', 'Options', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_add_ephemeral_timeout', '_allgather_base', '_end_coalescing', '_get_backend_name', '_get_sequence_number_for_group', '_group_end', '_group_start', '_is_initialized', '_pybind11_conduit_v1_', '_reduce_scatter_base', '_set_default_timeout', '_set_sequence_number_for_group', '_shutdown', '_start_coalescing', '_verify_work_timeout', 'abort', 'allgather', 'allgather_coalesced', 'allreduce', 'allreduce_coalesced', 'alltoall', 'alltoall_base', 'barrier', 'bound_device_id', 'broadcast', 'comm_split_count', 'deregister_mem_pool', 'eager_connect_single_device', 'gather', 'monitored_barrier', 'name', 'options', 'perform_nocolor_split', 'rank', 'recv', 'recv_anysource', 'reduce', 'reduce_scatter', 'register_mem_pool', 'scatter', 'send', 'size', 'supports_splitting', 'uid']

After some digging I realized it was because ProcessGroupNCCL (and other pg backends) inherit collectives that are defined here, which does not include allgather_into_tensor_coalesced or reduce_scatter_tensor_coalesced. This is confusing since we know that e.g. allgather_into_tensor_coalesced is implemented for the ProcessGroupNCCL backend.

This will likely require changes in PyTorch to enable, so we defer to a future PR.

Test complexity, runtime and coverage

Tests are taking longer and could benefit from more coverage.

process_group_test.py currently tests all of the collectives, often including the creation/teardown GlooProcessGroups and NCCLProcessGroups. This takes awhile, but e.g. send/recv needs to be modeled in this way to work correctly so we added tests for gloo and NCCL (test_gloo_send_recv, test_baby_gloo_send_recv, and test_baby_nccl_send_recv_2gpu).

This style of testing is fairly comprehensive and representative, but it's not straightforward how to get this coverage on all collectives while balancing the test runtime. This PR alone increases the execution time of this test file from 15s to 32.66s which is concerning.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants