-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Distributed Muon #16
Comments
Hi, @Niccolo-Ajroldi This is a brilliant question and thanks for pointing it out! Let me try to explain it in here: 1. DP based ZeRO1 Distributed Optimizer. a. Non-Distributed Optimizer DP0: Params: P0-P5, momentum M0-M5, master weight MP0-MP5, gradient G0_dp0-G5_dp0 Since it is non-distributed optimizer, P, M, MP are identical and kept in sync. Gradients are different because different DP used different data, so they need communication (e.g. all-reduce) to sync. Since all matrices are full, they can just directly perform updates. M, MP are all full and cost a lot of memory b. ZeRO1 DP0: Params: P0-P5, gradient G0_dp0-G5_dp0, momentum M0, M1, m2(1st half), master weight MP0, MP1, mp2(1st half) You might think that every optimizer state (M and MP) is sharded across DP. It is not. ZeRO1 actually first concats all of them into a list, flattens, then shards across DP. 2. Adapting ZeRO1 to Distributed Muon a. Operations DP0: DP1: b. Analysis So some friends suggested even smarter partitioning to avoid split a single matrix. If you are carefully handle this, it will be close to 1.0 |
See a proof of concept implementation and more discussions in here: NVIDIA/Megatron-LM#1428 |
Amazing answer, love it! My main misunderstanding was that I thought that under ZeRO-1 every optimizer state is sharded across all DP, but as u pointed out it's not. Your comment makes it is very clear, thank you! |
Thanks a lot for your interest and this is a great question! This has been discussed a bit on X as well. Your understanding is not 'mis', as it might be right under some parallel settings. It's just we found our Megatron-LM's ZeRO-1 design is perfectly suitable for Distributed Muon! |
This is more of a question than an issue.
Question
I can't really wrap my head around how DP Gather is incurring in a communication cost lower than a classical AllGather op.
My understanding is that if each parameter weight matrix is sharded across all the devices (in ZeRO-1 style), then each device has to collect all the remaining shards across other devices, hence having to resort to an AllGather call.
Here is a small example:
Consider a model with 2 parameter weight matrices: p1, p2 with corresponding moments m1, m2 and grads g1, g2
We shard the optimizer state across 2 devices. After ReduceScatter on the gradients and after applying momentum, each device holds:
DP1: p11, p21, m11, m21, g'11, g'21
DP1: p12, p22, m12, m22, g'12, g'22
where pij is the j-th shard of parameter i, optimizer state is p and m, g' indicates the gradient after application of momentum (following algorithm 1 notation)
In order to perform Newton-Schulz and compute the update for p11 and p21, DP1 needs to collect all the remaining gradient shards. Since it already has g11 and g21, it will send those to DP2 and receive g12 and g22 from it. How is this different from a normal AllGather op?
Any help in understanding is greatly appreciated!
The text was updated successfully, but these errors were encountered: