Refactors `process_group_tests.py` #103

allenwang28 · 2025-02-07T20:11:01Z

What does this PR do?

As part of #97, this PR refactors process_group_test:

Renames up _test_pg to run_collectives and extending it to accept a given list of collectives by name.
Breaks up ProcessGroupTest into three tests: GlooTest, NCCLTests and DummyTests:
- GlooTest logically tests every test using gloo, NCCLTest with NCCL, etc.
- This allows some niceties, like marking once that we want to skip all NCCL tests
Adds shutdown() and garbage collection etc. to avoid extraneous messages & warnings like

Traceback (most recent call last):
  File "/home/allencwang/workspace/torchft/torchft/process_group.py", line 824, in _future_handler
    cmd = future_queue.get(timeout=timedelta(seconds=10.0))
  File "/home/allencwang/workspace/torchft/torchft/multiprocessing.py", line 45, in get
    raise RuntimeError(f"process is not alive {self._p.exitcode}")
RuntimeError: process is not alive -15
[rank0]:[W207 10:50:12.933128109 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Why is this needed?

As part of #102, I noticed that there were some mismatches between which collectives ran on which backends (matrix is here). Therefore this logical grouping of tests by backend allows us to define which collectives should be tested explicitly

d4l3k · 2025-02-07T20:38:31Z

torchft/process_group_test.py

@@ -391,18 +394,19 @@ def test_error_swallowing_process_group_wrapper(self) -> None:
        wrapper = ErrorSwallowingProcessGroupWrapper(pg)
        self.assertIs(wrapper.parent, pg)

-        works = _test_pg(wrapper)
+        works = run_collective(pg=wrapper, collective="allreduce")


this seems like a pretty big decrease in coverage?

Added functionality back

d4l3k · 2025-02-07T20:42:41Z

torchft/process_group_test.py

+    shape: torch.Size = example_tensor.shape
+    dtype: torch.dtype = example_tensor.dtype
+    coll = getattr(pg, collective)
+    args_list = _build_args(pg=pg, collective=collective, example_tensor=example_tensor)


What's the intention behind pulling this out? I'm not really convinced that this makes it all that much cleaner

In some ways I think I'd prefer if we got rid of the arg generation and instead flatten this out with direct calls i.e.

if collective == "allreduce": work = pg.allreduce(...) work.wait() ...

I agree, I've removed the arg generation and included it in place for run_collective

d4l3k · 2025-02-07T20:44:23Z

torchft/process_group_test.py

+        pg = ProcessGroupBabyNCCL(timeout=timedelta(seconds=10))
+        try:
+            pg.configure(self.store_addr, 0, 1)


This seems really slow -- how fast does this run? Launching the subprocess is pretty slow so would actually prefer to run these all on the same PG

If you want prettier printing we can use subtests?

i.e.

for collective in collectives: with self.subTest(collective=collective): ...

Good callout, with parameterized it took ~36s, without it took ~16s. I've removed parameterized.

Refactors process_group_tests to run collectives parameterized

4c3b78e

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 7, 2025

allenwang28 added 2 commits February 7, 2025 12:13

minor adjustment, move check_tensors out of the loop

5f73001

linter

0c7ac68

allenwang28 marked this pull request as ready for review February 7, 2025 20:28

d4l3k requested changes Feb 7, 2025

View reviewed changes

allenwang28 added 5 commits February 7, 2025 13:18

removes arg generation

c2840b4

add the ability to run a set of collectives

435449c

linters

ec8ae32

rename input_tensors to tensors_to_check

f3ed6a4

slight cleanup

c95fa35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactors `process_group_tests.py` #103

Refactors `process_group_tests.py` #103

allenwang28 commented Feb 7, 2025 •

edited

Loading

d4l3k Feb 7, 2025

allenwang28 Feb 7, 2025

d4l3k Feb 7, 2025

allenwang28 Feb 7, 2025

d4l3k Feb 7, 2025

allenwang28 Feb 7, 2025 •

edited

Loading

Refactors process_group_tests.py #103

Are you sure you want to change the base?

Refactors process_group_tests.py #103

Conversation

allenwang28 commented Feb 7, 2025 • edited Loading

What does this PR do?

Why is this needed?

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

allenwang28 Feb 7, 2025

Choose a reason for hiding this comment

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

allenwang28 Feb 7, 2025

Choose a reason for hiding this comment

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

allenwang28 Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Refactors `process_group_tests.py` #103

Refactors `process_group_tests.py` #103

allenwang28 commented Feb 7, 2025 •

edited

Loading

allenwang28 Feb 7, 2025 •

edited

Loading