redesign for faster cpu/gpu synch #1869

awni · 2025-02-15T01:04:58Z

Notes on important changes:

All eval_cpu are now logically split in two parts and look a lot like how eval_gpu looks.
- Setup which does allocate the output, sets flags, strides etc
- Dispatch the kernel. This just puts a task in the streams thread for now. In the future we may make it multi threaded.
- Like the GPU the CPU task now registers inputs and outputs with the command encoder add_input_array, add_output_array
- The CPU task is also required to register temporaries add_temporary
All task setup for any stream (CPU/GPU/etc) is now done on the main thread. Concretely eval_cpu and eval_gpu are now called on the main thread. It didn't make much sense from a perf standpoint to have sub-threads for this and it was overly complicated from a synchronization standpoint (as one would need to synchronize for both setup and completion). Furthermore this really simplifies things like thread safety of the eval moving forward. I think it's overall a nice change. In-fact there is now some room to pipeline CPU setup with CPU work so potential for speedups there.
Use fence for synchronization within an eval across streams. Use event for synchronization outside the eval loop or across multiple eval loops (e.g. when using async_eval).
The way donation is done has changed. There is no longer a requirement to move the data shared pointer out of the array when donating (deleted the array::move_shared_buffer API). This was a bit tedious.. so instead eval driver handles it by not retaining input buffers that got donated to their respective outputs.

CPU-only benchmarks:

Bench	Pre	Post
MNIST layers=10, compiled	0.662	0.659
MNIST layers=10, not compiled	0.785	0.740

angeloskath

So far it looks great. I read about half way through. I didn't read the important bits yet, neither the changes in event.h nor transforms.cpp and fence.h. The API changes look really great imho and unify cpu and gpu primitive implementations.

mlx/backend/cpu/copy.cpp

mlx/distributed/ring/ring.cpp

angeloskath · 2025-02-26T07:15:41Z

mlx/backend/cpu/eval.cpp

+  // Remove the output if it was donated to by an input
+  if (auto it = buffers.find(arr.data_shared_ptr()); it != buffers.end()) {
+    buffers.erase(it);
+  }


Shouldn't this also be done for the siblings? Otherwise we guarantee they won't be donated later on. Sth like the following in the loop ought to do it.

if (buffers.find(s.data_shared_ptr()) == buffers.end()) { buffers.insert(s.data_shared_ptr()); }

Unfortunately not so simple. You'll notice right now siblings are also never donated in the metal task submission.

The problem is that siblings can have 0 references after being detached (because they may not have parents in the graph) and in that case it's a bug to not hold their buffers.

If you recall the optimization that was a bit buggy I did a while ago (#1858) also fixed this issue.

I actually have a diff on top of this branch that tries to resolve this by doing something similar to #1858 which is easier now because the task submission is on the main thread, so we don't have to worry about race conditions when checking the use-count. But there is still the matter of handling async_eval correctly (references to user held arrays can be deleted before we are done with the computation).

mlx/backend/cpu/inverse.cpp

mlx/backend/cpu/masked_mm.cpp

angeloskath · 2025-02-26T07:36:33Z

mlx/backend/cpu/encoder.h

+      return;
+    }
+    scheduler::enqueue(stream_, [arrays = std::move(arrays)]() {});
+  }


Not a problem necessarily but add_temporaries on the CPU "CommandEncoder" has to be after dispatch while on the GPU it can be before. One solution would be for the CommandEncoder to collect temporaries and then cpu::eval to dispatch the task that keeps them in memory possibly together with the rest of the buffers.

This would require cpu::eval to know about the CommandEncoder but it is already the case in the metal side so I don't think this is a problem and it will allow us to use add_temporary where it makes intuitive sense and reduce the number of tasks in the queue.

I think it's a very good suggestion as it's actually really not good that add_temporary has to be called at just the right spot and otherwise it could fail silently.

awni · 2025-02-26T20:01:41Z

This is ready for review. I left some notes in the top on the important changes. I think that should guide the files that need special attention. Namely:

fence.h and implementations
transforms.cpp (eval loop)
cpu/eval.*
cpu/encoder.*

I'll share some benchmarks shortly.

jagrit06

Looks great! I still need to go through the fence and event parts in detail, will do that soon!

jagrit06 · 2025-02-26T22:47:39Z

mlx/backend/cpu/conv.cpp

+                    wt_stride_O = wt.strides()[0],
+                    wt_stride_H = wt.strides()[1],
+                    wt_stride_W = wt.strides()[2],
+                    wt_stride_C = wt.strides()[3],
+
+                    out_stride_N = out.strides()[0],
+                    out_stride_H = out.strides()[1],
+                    out_stride_W = out.strides()[2],
+                    out_stride_O = out.strides()[3],
+
+                    padding,
+                    wt_strides,


It looks like we end up making 2 copies of the wt_strides here
These are the slower CPU conv kernels, so its not an urgent change though

It's not two copies.. the naming here is a bit unfortunate:

wt_strides is the strides for applying the kernel.

wt.strides() and wt_stride_* are the strides of the underlying array

This is as it was.. it is due for a bit of a clean up though.

awni · 2025-02-27T15:57:48Z

Fast synchronization works well with the redesign. On an M1 max, synch latencies are 10x less which is the same as when doing custom synchronization within the primitive.

mpirun -np 2 python benchmarks/python/synchronize_bench.py

All Reduce: time per iteration 0.232273 (ms)
All gather: time per iteration 0.234606 (ms)

MLX_METAL_FAST_SYNCH=1 mpirun -np 2 python benchmarks/python/synchronize_bench.py

All Reduce: time per iteration 0.019380 (ms)
All gather: time per iteration 0.021714 (ms)

awni · 2025-02-27T20:35:36Z

For the GPU benchmarks are not noticeably different:

Generation speed is unchanged:

mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --prompt "Write a story about Einstein" -m 512

Pre: Generation: 512 tokens, 129.661 tokens-per-sec
Post: Generation: 512 tokens, 129.765 tokens-per-sec

Transformer training speed / memory use unchanged:
Pre: Iter 30: Train loss 7.969, It/sec 6.254, Peak memory 5.525 (GB)
Post: Iter 30: Train loss 7.981, It/sec 6.253, Peak memory 5.525 (GB)

awni · 2025-02-27T20:58:26Z

I added a limit to the number of outstanding tasks (set to 100, possibly could go lower here with minimal perf loss) in the CPU task encoder to avoid memory blow up. Basically keeps the eval from running to far ahead of the actual work while allowing some pipelining. For example for the following:

import mlx.core as mx

mx.set_default_device(mx.cpu)
a = mx.ones((2048, 2048))
b = mx.ones((2048, 2048))
for _ in range(1000):
    a = mx.matmul(a, b)
mx.eval(a)
print(mx.metal.get_peak_memory()/2**20)

Instead of using 16GB it uses 1.6GB with the limit.

Its pretty similar now to the way we limit the number of outstanding command buffers.

awni force-pushed the faster_cpu_gpu_synch branch 9 times, most recently from 3833c82 to 7e94ecb Compare February 23, 2025 19:13

awni force-pushed the faster_cpu_gpu_synch branch from 5d8e1a0 to 04094c2 Compare February 25, 2025 21:12

angeloskath reviewed Feb 26, 2025

View reviewed changes

awni added 13 commits February 26, 2025 06:15

redesign for faster cpu/gpu synch

453b215

load + more async CPU

827ad9a

use command encoder API and move more ops to use it

215fd70

make fence back-end generic + CPU only fence

6aae6b9

faster build

41d2231

fix async eval

e819dc0

fixes + handle temporaries

9a12f08

fix / improve cpu conv

b157711

remove unused status, fix siblings

6a90ae2

fix extensions

ee2aa26

fix

90b9aab

fix no cpu build

45d5e44

format

c7af7b6

awni force-pushed the faster_cpu_gpu_synch branch from 720b7ac to f995477 Compare February 26, 2025 14:45

comments

64cf095

awni force-pushed the faster_cpu_gpu_synch branch from f995477 to 64cf095 Compare February 26, 2025 15:20

awni requested review from barronalex and jagrit06 February 26, 2025 20:01

jagrit06 reviewed Feb 26, 2025

View reviewed changes

awni marked this pull request as ready for review February 27, 2025 00:34

fix perf regression, remove unecessary abort

13a3ac7

awni added 2 commits February 27, 2025 13:17

fix events, task limit cpu

fd75983

fix waiting

6c853c5

awni force-pushed the faster_cpu_gpu_synch branch from 9e2fb4c to 6c853c5 Compare February 28, 2025 17:55

angeloskath mentioned this pull request Feb 28, 2025

RMS norm without scaling #1915

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redesign for faster cpu/gpu synch #1869

redesign for faster cpu/gpu synch #1869

awni commented Feb 15, 2025 •

edited

Loading

angeloskath left a comment

angeloskath Feb 26, 2025

awni Feb 26, 2025

angeloskath Feb 26, 2025

awni Feb 26, 2025

awni commented Feb 26, 2025

jagrit06 left a comment

jagrit06 Feb 26, 2025

awni Feb 27, 2025

awni commented Feb 27, 2025

awni commented Feb 27, 2025

awni commented Feb 27, 2025

redesign for faster cpu/gpu synch #1869

Are you sure you want to change the base?

redesign for faster cpu/gpu synch #1869

Conversation

awni commented Feb 15, 2025 • edited Loading

angeloskath left a comment

Choose a reason for hiding this comment

angeloskath Feb 26, 2025

Choose a reason for hiding this comment

awni Feb 26, 2025

Choose a reason for hiding this comment

angeloskath Feb 26, 2025

Choose a reason for hiding this comment

awni Feb 26, 2025

Choose a reason for hiding this comment

awni commented Feb 26, 2025

jagrit06 left a comment

Choose a reason for hiding this comment

jagrit06 Feb 26, 2025

Choose a reason for hiding this comment

awni Feb 27, 2025

Choose a reason for hiding this comment

awni commented Feb 27, 2025

awni commented Feb 27, 2025

awni commented Feb 27, 2025

awni commented Feb 15, 2025 •

edited

Loading