Use a heap for small sizes #1911

awni · 2025-02-27T22:53:44Z

Use a heap for small sizes until it is full. The heap size is 1MB which holds up to 4096 buffers which seems like a pretty reasonable number to start with.

A benchmark:

for _ in range(5):
    arrs = []
    arrs = [mx.array(1.0) for _ in range(4096)]
    del arrs
tic = time.time()
for _ in range(10):
    arrs = [mx.array(1.0) for _ in range(4096)]
    del arrs
toc = time.time()
print(toc - tic)

With:

mx.metal.set_wired_limit(2**30)

Pre: 0.0332 s
Post: 0.0108 s

With:

mx.metal.set_cache_limit(0)

Pre: 0.0935 s
Post: 0.0673 s

angeloskath

Nice!

Just to be clear the benefit in the first case comes from not inserting and removing to/from the residency set. And in the second case also from possibly faster de-/allocation of new buffers.

Presumably if benchmarked with <=256 instead of 4096 the difference would be even bigger in the second case.

awni · 2025-02-28T19:32:52Z

Just to be clear the benefit in the first case comes from not inserting and removing to/from the residency set.

Exactly, it turns out it's kind of expensive to add stuff to the residency set (it's mostly the committing / requesting residency which we do eagerly, as opposed to just adding it to the set).

And in the second case also from possibly faster de-/allocation of new buffers.

Exactly.

angeloskath · 2025-02-28T19:36:08Z

Do you think it makes sense to just do it once every N buffer creations? And simply make sure we have done it before running eval?

awni · 2025-02-28T19:46:50Z

Do you think it makes sense to just do it once every N buffer creations? And simply make sure we have done it before running eval?

I think there are some things to investigate there. But it's not entirely obvious if/what to optimize. The residency set in practice works like so:

Ahead of time we request the whole model to be resident. This part could probably be faster with a 1/N strategy but the latency of requesting residency is minor since a) it only gets included in time-to-first token and b) committing the residency set should be a small fraction of the overall model-loading / prompt processing (should be confirmed with more careful measurement).
During inference we move stuff in and out of the residency set as it gets allocated/freed. We don't do a lot of allocations during inference. But in some cases there are a lot of scalars (like the RoPE offset). For small models (0.5B) we actually notice the latency of moving them in and out of the residency in this case (the motivation for this PR). Potentially another way to solve this is by keeping the allocator cache resident (at least during an eval or something).

use a heap for small sizes

4d6a002

awni requested review from angeloskath, barronalex and jagrit06 February 27, 2025 22:54

check if VM

ac5d5f0

awni force-pushed the heap_for_small_buffers branch from ca76c4a to ac5d5f0 Compare February 28, 2025 17:05

angeloskath approved these changes Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a heap for small sizes #1911

Use a heap for small sizes #1911

awni commented Feb 27, 2025 •

edited

Loading

angeloskath left a comment

awni commented Feb 28, 2025

angeloskath commented Feb 28, 2025

awni commented Feb 28, 2025 •

edited

Loading

Use a heap for small sizes #1911

Are you sure you want to change the base?

Use a heap for small sizes #1911

Conversation

awni commented Feb 27, 2025 • edited Loading

angeloskath left a comment

Choose a reason for hiding this comment

awni commented Feb 28, 2025

angeloskath commented Feb 28, 2025

awni commented Feb 28, 2025 • edited Loading

awni commented Feb 27, 2025 •

edited

Loading

awni commented Feb 28, 2025 •

edited

Loading