Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactors process_group_tests.py #103

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

allenwang28
Copy link

@allenwang28 allenwang28 commented Feb 7, 2025

What does this PR do?

As part of #97, this PR refactors process_group_test:

  • Renames up _test_pg to run_collectives and extending it to accept a given list of collectives by name.
  • Breaks up ProcessGroupTest into three tests: GlooTest, NCCLTests and DummyTests:
    • GlooTest logically tests every test using gloo, NCCLTest with NCCL, etc.
    • This allows some niceties, like marking once that we want to skip all NCCL tests
  • Adds shutdown() and garbage collection etc. to avoid extraneous messages & warnings like
Traceback (most recent call last):
  File "/home/allencwang/workspace/torchft/torchft/process_group.py", line 824, in _future_handler
    cmd = future_queue.get(timeout=timedelta(seconds=10.0))
  File "/home/allencwang/workspace/torchft/torchft/multiprocessing.py", line 45, in get
    raise RuntimeError(f"process is not alive {self._p.exitcode}")
RuntimeError: process is not alive -15
[rank0]:[W207 10:50:12.933128109 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Why is this needed?

As part of #102, I noticed that there were some mismatches between which collectives ran on which backends (matrix is here). Therefore this logical grouping of tests by backend allows us to define which collectives should be tested explicitly

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 7, 2025
@allenwang28 allenwang28 marked this pull request as ready for review February 7, 2025 20:28
@@ -391,18 +394,19 @@ def test_error_swallowing_process_group_wrapper(self) -> None:
wrapper = ErrorSwallowingProcessGroupWrapper(pg)
self.assertIs(wrapper.parent, pg)

works = _test_pg(wrapper)
works = run_collective(pg=wrapper, collective="allreduce")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a pretty big decrease in coverage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added functionality back

shape: torch.Size = example_tensor.shape
dtype: torch.dtype = example_tensor.dtype
coll = getattr(pg, collective)
args_list = _build_args(pg=pg, collective=collective, example_tensor=example_tensor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intention behind pulling this out? I'm not really convinced that this makes it all that much cleaner

In some ways I think I'd prefer if we got rid of the arg generation and instead flatten this out with direct calls i.e.

if collective == "allreduce":
    work = pg.allreduce(...)

work.wait()
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I've removed the arg generation and included it in place for run_collective

Comment on lines +176 to +178
pg = ProcessGroupBabyNCCL(timeout=timedelta(seconds=10))
try:
pg.configure(self.store_addr, 0, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really slow -- how fast does this run? Launching the subprocess is pretty slow so would actually prefer to run these all on the same PG

If you want prettier printing we can use subtests?

i.e.

for collective in collectives:
    with self.subTest(collective=collective):
        ...

Copy link
Author

@allenwang28 allenwang28 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, with parameterized it took ~36s, without it took ~16s. I've removed parameterized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants