Adds `reduce_scatter` into `torchft` #102

allenwang28 · 2025-02-06T21:55:19Z

What does this PR do?

Partially addresses #97 by adding reduce_scatter into torchft.

Concretely, this consists of a few pieces:

Introducing reduce_scatter into the ProcessGroup following the signature [here](https://github.com/pytorch/pytorch/blob/11f69808c64a65c68a4452250ba7719dcff27c78/torch/csrc/distributed/c10d/PyProcessGroup.hpp#L203
In ProcessGroup* we essentially follow the behavior of other collectives:
- In ProcessGroupWrapper, it depends on the parent implementation
- In ProcessGroupDummy, it writes from the first input into output
- In ProcessGroupBaby, it asserts inputs and moves underlying storage into shared memory
Add ReduceScatterOptions in _PickleSafeOptions
Introduces reduce_scatter as an option in _test_pg, however this necessitated a new function (named _should_run_collective) which was needed as e.g. GLOO does not support reduce_scatter. This function essentially takes the collective, backend and device and copies the logic of the published supported collective matrix.

Tests

Presubmits, and:

$ pytest torchft/process_group_test.py 
============================================= test session starts =============================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/allencwang/workspace/torchft
configfile: pytest.ini
plugins: typeguard-2.13.3
collected 16 items                                                                                            

torchft/process_group_test.py ................                                                          [100%]

============================================= 16 passed in 31.44s =============================================
[rank0]:[W206 14:54:24.777939032 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Next steps

The logic of _should_run_collective is a bit confusing, as it allows "non defined backends" like ErrorSwallowing* through, to mimic the old behavior before this change. Testing here could become a bit unwieldy as we add more collectives and so a future step could be to refactor the testing.

One nice change could be to parameterize tests by the collective. This will make potentially failing collectives more explicit and will reduce the time it takes to run individual tests. Likely can do this in the next PR.

…ce_scatter test

d4l3k · 2025-02-07T00:35:46Z

torchft/process_group_test.py

+                return True
+            return False
+        else:  # cpu
+            if collective_str in ["reduce_scatter", "all_to_all"]:


oh wow -- didn't realize we don't support these on Gloo, good to know! cc @c-p-i-o

ye, we miss many APIs on Gloo.

this approach seems nice and explicit. but is it possible to instead just try: the test, and except: some specific NYI error? (i'm not sure if we raise a consistent type of NYI exception from backends?)

d4l3k · 2025-02-07T00:36:18Z

torchft/process_group_test.py

+        device = example_tensor.device
+        if type(device) is torch.device:
+            device = device.type
+    except NotImplementedError as e:


why are we getting a NotImplementedError? which backend is this?

This is just for the ErrorSwallowingProcessGroupWrapper. I have a follow up PR to refactor the tests to get rid of this entire function though!

d4l3k

LGTM, thanks for adding this!

allenwang28 added 2 commits February 6, 2025 12:39

initial commit for reduce_scatter

d076a54

fixes reduce_scatter function signature, refactors test and adds redu…

a425493

…ce_scatter test

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 6, 2025

fixes test

5190414

d4l3k reviewed Feb 7, 2025

View reviewed changes

allenwang28 marked this pull request as ready for review February 7, 2025 16:55

d4l3k approved these changes Feb 7, 2025

View reviewed changes

allenwang28 mentioned this pull request Feb 7, 2025

Refactors process_group_tests.py #103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds `reduce_scatter` into `torchft` #102

Adds `reduce_scatter` into `torchft` #102

allenwang28 commented Feb 6, 2025 •

edited

Loading

d4l3k Feb 7, 2025

fegin Feb 7, 2025

wconstab Feb 7, 2025

d4l3k Feb 7, 2025

allenwang28 Feb 7, 2025

d4l3k left a comment

Adds reduce_scatter into torchft #102

Are you sure you want to change the base?

Adds reduce_scatter into torchft #102

Conversation

allenwang28 commented Feb 6, 2025 • edited Loading

What does this PR do?

Tests

Next steps

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

fegin Feb 7, 2025

Choose a reason for hiding this comment

wconstab Feb 7, 2025

Choose a reason for hiding this comment

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

allenwang28 Feb 7, 2025

Choose a reason for hiding this comment

d4l3k left a comment

Choose a reason for hiding this comment

Adds `reduce_scatter` into `torchft` #102

Adds `reduce_scatter` into `torchft` #102

allenwang28 commented Feb 6, 2025 •

edited

Loading