Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redesign for faster cpu/gpu synch #1869

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

redesign for faster cpu/gpu synch #1869

wants to merge 17 commits into from

Conversation

awni
Copy link
Member

@awni awni commented Feb 15, 2025

Notes on important changes:

  • All eval_cpu are now logically split in two parts and look a lot like how eval_gpu looks.
    • Setup which does allocate the output, sets flags, strides etc
    • Dispatch the kernel. This just puts a task in the streams thread for now. In the future we may make it multi threaded.
    • Like the GPU the CPU task now registers inputs and outputs with the command encoder add_input_array, add_output_array
    • The CPU task is also required to register temporaries add_temporary
  • All task setup for any stream (CPU/GPU/etc) is now done on the main thread. Concretely eval_cpu and eval_gpu are now called on the main thread. It didn't make much sense from a perf standpoint to have sub-threads for this and it was overly complicated from a synchronization standpoint (as one would need to synchronize for both setup and completion). Furthermore this really simplifies things like thread safety of the eval moving forward. I think it's overall a nice change. In-fact there is now some room to pipeline CPU setup with CPU work so potential for speedups there.
  • Use fence for synchronization within an eval across streams. Use event for synchronization outside the eval loop or across multiple eval loops (e.g. when using async_eval).
  • The way donation is done has changed. There is no longer a requirement to move the data shared pointer out of the array when donating (deleted the array::move_shared_buffer API). This was a bit tedious.. so instead eval driver handles it by not retaining input buffers that got donated to their respective outputs.

CPU-only benchmarks:

Bench Pre Post
MNIST layers=10, compiled 0.662 0.659
MNIST layers=10, not compiled 0.785 0.740

@awni awni force-pushed the faster_cpu_gpu_synch branch 9 times, most recently from 3833c82 to 7e94ecb Compare February 23, 2025 19:13
@awni awni force-pushed the faster_cpu_gpu_synch branch from 5d8e1a0 to 04094c2 Compare February 25, 2025 21:12
Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it looks great. I read about half way through. I didn't read the important bits yet, neither the changes in event.h nor transforms.cpp and fence.h. The API changes look really great imho and unify cpu and gpu primitive implementations.

// Remove the output if it was donated to by an input
if (auto it = buffers.find(arr.data_shared_ptr()); it != buffers.end()) {
buffers.erase(it);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also be done for the siblings? Otherwise we guarantee they won't be donated later on. Sth like the following in the loop ought to do it.

if (buffers.find(s.data_shared_ptr()) == buffers.end()) {
  buffers.insert(s.data_shared_ptr());
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not so simple. You'll notice right now siblings are also never donated in the metal task submission.

The problem is that siblings can have 0 references after being detached (because they may not have parents in the graph) and in that case it's a bug to not hold their buffers.

If you recall the optimization that was a bit buggy I did a while ago (#1858) also fixed this issue.

I actually have a diff on top of this branch that tries to resolve this by doing something similar to #1858 which is easier now because the task submission is on the main thread, so we don't have to worry about race conditions when checking the use-count. But there is still the matter of handling async_eval correctly (references to user held arrays can be deleted before we are done with the computation).

return;
}
scheduler::enqueue(stream_, [arrays = std::move(arrays)]() {});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem necessarily but add_temporaries on the CPU "CommandEncoder" has to be after dispatch while on the GPU it can be before. One solution would be for the CommandEncoder to collect temporaries and then cpu::eval to dispatch the task that keeps them in memory possibly together with the rest of the buffers.

This would require cpu::eval to know about the CommandEncoder but it is already the case in the metal side so I don't think this is a problem and it will allow us to use add_temporary where it makes intuitive sense and reduce the number of tasks in the queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a very good suggestion as it's actually really not good that add_temporary has to be called at just the right spot and otherwise it could fail silently.

@awni awni force-pushed the faster_cpu_gpu_synch branch from 720b7ac to f995477 Compare February 26, 2025 14:45
@awni awni force-pushed the faster_cpu_gpu_synch branch from f995477 to 64cf095 Compare February 26, 2025 15:20
@awni
Copy link
Member Author

awni commented Feb 26, 2025

This is ready for review. I left some notes in the top on the important changes. I think that should guide the files that need special attention. Namely:

  • fence.h and implementations
  • transforms.cpp (eval loop)
  • cpu/eval.*
  • cpu/encoder.*

I'll share some benchmarks shortly.

@awni awni requested review from barronalex and jagrit06 February 26, 2025 20:01
Copy link
Member

@jagrit06 jagrit06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I still need to go through the fence and event parts in detail, will do that soon!

Comment on lines +148 to +159
wt_stride_O = wt.strides()[0],
wt_stride_H = wt.strides()[1],
wt_stride_W = wt.strides()[2],
wt_stride_C = wt.strides()[3],

out_stride_N = out.strides()[0],
out_stride_H = out.strides()[1],
out_stride_W = out.strides()[2],
out_stride_O = out.strides()[3],

padding,
wt_strides,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we end up making 2 copies of the wt_strides here
These are the slower CPU conv kernels, so its not an urgent change though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not two copies.. the naming here is a bit unfortunate:

  • wt_strides is the strides for applying the kernel.
  • wt.strides() and wt_stride_* are the strides of the underlying array

This is as it was.. it is due for a bit of a clean up though.

@awni awni marked this pull request as ready for review February 27, 2025 00:34
@awni
Copy link
Member Author

awni commented Feb 27, 2025

Fast synchronization works well with the redesign. On an M1 max, synch latencies are 10x less which is the same as when doing custom synchronization within the primitive.

mpirun -np 2 python benchmarks/python/synchronize_bench.py 

All Reduce: time per iteration 0.232273 (ms)
All gather: time per iteration 0.234606 (ms)

MLX_METAL_FAST_SYNCH=1 mpirun -np 2 python benchmarks/python/synchronize_bench.py

All Reduce: time per iteration 0.019380 (ms)
All gather: time per iteration 0.021714 (ms)

@awni
Copy link
Member Author

awni commented Feb 27, 2025

For the GPU benchmarks are not noticeably different:

Generation speed is unchanged:

mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --prompt "Write a story about Einstein" -m 512

Pre: Generation: 512 tokens, 129.661 tokens-per-sec
Post: Generation: 512 tokens, 129.765 tokens-per-sec

Transformer training speed / memory use unchanged:
Pre: Iter 30: Train loss 7.969, It/sec 6.254, Peak memory 5.525 (GB)
Post: Iter 30: Train loss 7.981, It/sec 6.253, Peak memory 5.525 (GB)

@awni
Copy link
Member Author

awni commented Feb 27, 2025

I added a limit to the number of outstanding tasks (set to 100, possibly could go lower here with minimal perf loss) in the CPU task encoder to avoid memory blow up. Basically keeps the eval from running to far ahead of the actual work while allowing some pipelining. For example for the following:

import mlx.core as mx

mx.set_default_device(mx.cpu)
a = mx.ones((2048, 2048))
b = mx.ones((2048, 2048))
for _ in range(1000):
    a = mx.matmul(a, b)
mx.eval(a)
print(mx.metal.get_peak_memory()/2**20)

Instead of using 16GB it uses 1.6GB with the limit.

Its pretty similar now to the way we limit the number of outstanding command buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants