ring-based reduce-scatter pipelining, ATen implementation #2950

samnordmann · 2024-09-17T15:13:04Z

ATen implementation of a GEMM+Reduce-scatter operation, with a decomposition of the Reduce-scatter into a ring algorithm. This reproduces a technique present in TransformerEngine for achieving comm/compute overlap in Megatron.

Experiment

The following nSight profile clearly shows that we achieve overlap:

setup: DGX 8*V100 32GB
params: backend=NCCL with coalescence, M=K=N=2048, S=8, number_of_streams=3

Arbitrary number of steps

The algorithm we provide is slightly more general than the classical case, as it allows decomposing reduce-scatter into an arbitrary large number of steps, not only num_devices_ steps as in the classical algorithm. More precisely, the parameter S (which stands for the number of steps, meaning, the number of interleaved comms and compute), which is classically equal to num_devices_ for the ring algorithm, is only assumed to be a multiple of num_devices_ in our version. This is an important parameter as it gives more flexibility for the size of interleaved chunks and therefore could lead to better overlap -- but thorough perf analysis remains to be done. If S>num_devices_, only a fraction S/num_devices_ of the buffer is computed and communicated to the peers at each iteration.

Coalescing

NCCL

ProcessGroupNCCL provides two methods startCoalescing and endCoalescing, which internally correspond to ncclGroupStart and ncclGroupEnd, see doc here. Those calls group p2p calls that need to be progressed together -- one global work handle returned by endCoalescing needs to be progressed. This has the following main advantages

calls are progressed concurrently
since NICs are two-sided, a send and a recv calls need to be coalesced to achieve full BW
If not coalesced, we can easily reach a deadlock if the send/recv pairs are not ordered correctly. E.g.,
rank0:

send to rank 1
recv from rank 1

rank1:

send to rank 0
recv from rank 0

This situation created a deadlock because no rank can receive before it has sent.

It is in general preferable to coalesce send/recv calls. The only drawback is that we don't have a fine-grain control on synchronicity, in other words, we can only synchronize with the bulked communication as a whole.

Remark:

note that NCCL doesn't support tag in send/recv.

UCC

ProcessGroupUCC does not implement coalesced groups for now. It should not be a problem to achieve full bidirectional BW with two send/recv though. However, having more than two ops in a batch will be suboptimal. Adding UCC coalescing was discussed but not added to POR for now because lacking a good use case.

Reducing memory foot print

Further optimizations, that we leave as TODO, are possible to reduce the footprint of the work buffers src_buffer_ and dst_buffer_:

One slice can be shared between src and dst, corresponding to the "local" send/recv at the last iteration
Probably possible to reduce the outermost dimension with the number of streams.

…s blocking) with 3 ranks which all first post send and then recv

…da_gdrcopy.so.0: undefined symbol: gdr_get_info_v2

samnordmann · 2024-09-17T16:01:44Z

!build

samnordmann · 2024-09-18T13:20:14Z

!build

samnordmann · 2024-09-18T13:55:40Z

I am seeing issues with UCC, both locally and in the CI. I need to investigate. In the meantime, I disable UCC from being tested

samnordmann · 2024-09-18T13:57:23Z

!build

cowanmeg · 2024-09-18T17:36:39Z

Not needed for the PR, but can we run this with larger sizes and compare it against matmul+reduce scatter baseline? I'm curious if we can see any time savings

samnordmann added 22 commits September 17, 2024 06:25

duplication of test and class

ffc7b19

WIP ATen ring RS with hang

d9c842f

reproducer for hang in send/recv

4afcda1

still hanging

86a0967

not hanging in nccl, hanging in ucc

0977cc5

hangs for both. Conclusion: must be posted in the right order

e34ff63

hang with all backends (nccl hangs at posting... which therefore seem…

0743101

…s blocking) with 3 ranks which all first post send and then recv

no hang with alternating order of posting send and recv

c36b8cc

hangs when all rank first post recv

532955e

error with ucc: symbol lookup error: /usr/local/ucx/lib/ucx/libuct_cu…

b828755

…da_gdrcopy.so.0: undefined symbol: gdr_get_info_v2

working with S=1

ae31616

add multistreaming and number_of_iterations

017b36f

fixed for 4 devices. moving coalescing outside the main for loop

47c21f4

working chkpt with non trivial S

1512842

number_of_repetition

d8dbc80

clean

af789bb

clean

0487a80

refactor test case classes

94d683d

lintrunner

fd134a4

remove hang reproducer with send/recv p2p

1fe3dfb

clean

ee8f59c

fix stream round robin

dff90d0

samnordmann added 4 commits September 18, 2024 02:30

fix recv rank, and add option whether to use coalescence

f81d903

lintrunner

b1cf578

fix number of streams

b56c1b9

use option to change backend and not coalesce

b7e7d56

do not test UCC

582613a

samnordmann requested review from naoyam, wujingyue, nsarka and cowanmeg September 18, 2024 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ring-based reduce-scatter pipelining, ATen implementation #2950

ring-based reduce-scatter pipelining, ATen implementation #2950

samnordmann commented Sep 17, 2024 •

edited

Loading

samnordmann commented Sep 17, 2024

samnordmann commented Sep 18, 2024

samnordmann commented Sep 18, 2024

samnordmann commented Sep 18, 2024

cowanmeg commented Sep 18, 2024

ring-based reduce-scatter pipelining, ATen implementation #2950

Are you sure you want to change the base?

ring-based reduce-scatter pipelining, ATen implementation #2950

Conversation

samnordmann commented Sep 17, 2024 • edited Loading

Experiment

Arbitrary number of steps

Coalescing

NCCL

Remark:

UCC

Reducing memory foot print

samnordmann commented Sep 17, 2024

samnordmann commented Sep 18, 2024

samnordmann commented Sep 18, 2024

samnordmann commented Sep 18, 2024

cowanmeg commented Sep 18, 2024

samnordmann commented Sep 17, 2024 •

edited

Loading