Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ring-based reduce-scatter pipelining, ATen implementation #2950

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Sep 17, 2024

ATen implementation of a GEMM+Reduce-scatter operation, with a decomposition of the Reduce-scatter into a ring algorithm. This reproduces a technique present in TransformerEngine for achieving comm/compute overlap in Megatron.

Experiment

The following nSight profile clearly shows that we achieve overlap:
Screenshot 2024-09-18 at 12 24 07
setup: DGX 8*V100 32GB
params: backend=NCCL with coalescence, M=K=N=2048, S=8, number_of_streams=3

Arbitrary number of steps

The algorithm we provide is slightly more general than the classical case, as it allows decomposing reduce-scatter into an arbitrary large number of steps, not only num_devices_ steps as in the classical algorithm. More precisely, the parameter S (which stands for the number of steps, meaning, the number of interleaved comms and compute), which is classically equal to num_devices_ for the ring algorithm, is only assumed to be a multiple of num_devices_ in our version. This is an important parameter as it gives more flexibility for the size of interleaved chunks and therefore could lead to better overlap -- but thorough perf analysis remains to be done. If S>num_devices_, only a fraction S/num_devices_ of the buffer is computed and communicated to the peers at each iteration.

Coalescing

NCCL

ProcessGroupNCCL provides two methods startCoalescing and endCoalescing, which internally correspond to ncclGroupStart and ncclGroupEnd, see doc here. Those calls group p2p calls that need to be progressed together -- one global work handle returned by endCoalescing needs to be progressed. This has the following main advantages

  • calls are progressed concurrently
  • since NICs are two-sided, a send and a recv calls need to be coalesced to achieve full BW
  • If not coalesced, we can easily reach a deadlock if the send/recv pairs are not ordered correctly. E.g.,
    rank0:
send to rank 1
recv from rank 1 

rank1:

send to rank 0
recv from rank 0 

This situation created a deadlock because no rank can receive before it has sent.

It is in general preferable to coalesce send/recv calls. The only drawback is that we don't have a fine-grain control on synchronicity, in other words, we can only synchronize with the bulked communication as a whole.

Remark:

note that NCCL doesn't support tag in send/recv.

UCC

ProcessGroupUCC does not implement coalesced groups for now. It should not be a problem to achieve full bidirectional BW with two send/recv though. However, having more than two ops in a batch will be suboptimal. Adding UCC coalescing was discussed but not added to POR for now because lacking a good use case.

Reducing memory foot print

Further optimizations, that we leave as TODO, are possible to reduce the footprint of the work buffers src_buffer_ and dst_buffer_:

  • One slice can be shared between src and dst, corresponding to the "local" send/recv at the last iteration
  • Probably possible to reduce the outermost dimension with the number of streams.

@samnordmann
Copy link
Collaborator Author

!build

@samnordmann
Copy link
Collaborator Author

!build

@samnordmann
Copy link
Collaborator Author

I am seeing issues with UCC, both locally and in the CI. I need to investigate. In the meantime, I disable UCC from being tested

@samnordmann
Copy link
Collaborator Author

!build

@cowanmeg
Copy link
Collaborator

Not needed for the PR, but can we run this with larger sizes and compare it against matmul+reduce scatter baseline? I'm curious if we can see any time savings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants