Adds more collectives to ProcessGroups #108

allenwang28 · 2025-02-11T22:58:04Z

What does this PR do?

Continuation of work for #97.

This PR:

Adds more collectives:
- allreduce_coalesced
- alltoall_base
- barrier
- reduce_scatter
- send/recv
Extends process_group_test.py to accommodate the above collectives

Concerns and possible follow up actions

Missing ops in backends

Notably allgather_into_tensor_coalesced and reduce_scatter_tensor_coalesced were not added in this PR.

It was part of the plan, but started seeing an error such as:

E       AttributeError: 'torch._C._distributed_c10d.ProcessGroupNCCL' object has no attribute 'allgather_into_tensor_coalesced'

This was confusing, but surely enough this is true, if we do dir(pg) we see:

['NCCLConfig', 'Options', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_add_ephemeral_timeout', '_allgather_base', '_end_coalescing', '_get_backend_name', '_get_sequence_number_for_group', '_group_end', '_group_start', '_is_initialized', '_pybind11_conduit_v1_', '_reduce_scatter_base', '_set_default_timeout', '_set_sequence_number_for_group', '_shutdown', '_start_coalescing', '_verify_work_timeout', 'abort', 'allgather', 'allgather_coalesced', 'allreduce', 'allreduce_coalesced', 'alltoall', 'alltoall_base', 'barrier', 'bound_device_id', 'broadcast', 'comm_split_count', 'deregister_mem_pool', 'eager_connect_single_device', 'gather', 'monitored_barrier', 'name', 'options', 'perform_nocolor_split', 'rank', 'recv', 'recv_anysource', 'reduce', 'reduce_scatter', 'register_mem_pool', 'scatter', 'send', 'size', 'supports_splitting', 'uid']

After some digging I realized it was because ProcessGroupNCCL (and other pg backends) inherit collectives that are defined here, which does not include allgather_into_tensor_coalesced or reduce_scatter_tensor_coalesced. This is confusing since we know that e.g. allgather_into_tensor_coalesced is implemented for the ProcessGroupNCCL backend.

This will likely require changes in PyTorch to enable, so we defer to a future PR.

Test complexity, runtime and coverage

Tests are taking longer and could benefit from more coverage.

process_group_test.py currently tests all of the collectives, often including the creation/teardown GlooProcessGroups and NCCLProcessGroups. This takes awhile, but e.g. send/recv needs to be modeled in this way to work correctly so we added tests for gloo and NCCL (test_gloo_send_recv, test_baby_gloo_send_recv, and test_baby_nccl_send_recv_2gpu).

This style of testing is fairly comprehensive and representative, but it's not straightforward how to get this coverage on all collectives while balancing the test runtime. This PR alone increases the execution time of this test file from 15s to 32.66s which is concerning.

…ce_scatter test

… the test suite

allenwang28 added 13 commits February 6, 2025 12:39

initial commit for reduce_scatter

d076a54

fixes reduce_scatter function signature, refactors test and adds redu…

a425493

…ce_scatter test

fixes test

5190414

adds explicit NotImplementedError to reduce_scatter in gloo, simplify…

45fac86

… the test suite

Merge branch 'main' into collectives

afcb7f6

fix tests after merge

dc448ec

initial commit

f6bf132

Added all collectives, starting tests now

97d2d49

fixes allreduce test, add and debug allgather_into_tensor_coalesced

7e71b17

Merge branch 'main' into coll

80c8c34

added all tests, need to hammer out failures

eb8535c

fixes the tests

83bd664

linter

1b7dde5

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 11, 2025

allenwang28 added 2 commits February 11, 2025 15:10

test simplification

8d3d98e

more cleanup

29470f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds more collectives to ProcessGroups #108

Adds more collectives to ProcessGroups #108

allenwang28 commented Feb 11, 2025

Adds more collectives to ProcessGroups #108

Are you sure you want to change the base?

Adds more collectives to ProcessGroups #108

Conversation

allenwang28 commented Feb 11, 2025

What does this PR do?

Concerns and possible follow up actions

Missing ops in backends

Test complexity, runtime and coverage