Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a heap for small sizes #1911

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Use a heap for small sizes #1911

wants to merge 2 commits into from

Conversation

awni
Copy link
Member

@awni awni commented Feb 27, 2025

Use a heap for small sizes until it is full. The heap size is 1MB which holds up to 4096 buffers which seems like a pretty reasonable number to start with.

A benchmark:

for _ in range(5):
    arrs = []
    arrs = [mx.array(1.0) for _ in range(4096)]
    del arrs
tic = time.time()
for _ in range(10):
    arrs = [mx.array(1.0) for _ in range(4096)]
    del arrs
toc = time.time()
print(toc - tic)

With:

mx.metal.set_wired_limit(2**30)

Pre: 0.0332 s
Post: 0.0108 s

With:

mx.metal.set_cache_limit(0)

Pre: 0.0935 s
Post: 0.0673 s

@awni awni force-pushed the heap_for_small_buffers branch from ca76c4a to ac5d5f0 Compare February 28, 2025 17:05
Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Just to be clear the benefit in the first case comes from not inserting and removing to/from the residency set. And in the second case also from possibly faster de-/allocation of new buffers.

Presumably if benchmarked with <=256 instead of 4096 the difference would be even bigger in the second case.

@awni
Copy link
Member Author

awni commented Feb 28, 2025

Just to be clear the benefit in the first case comes from not inserting and removing to/from the residency set.

Exactly, it turns out it's kind of expensive to add stuff to the residency set (it's mostly the committing / requesting residency which we do eagerly, as opposed to just adding it to the set).

And in the second case also from possibly faster de-/allocation of new buffers.

Exactly.

@angeloskath
Copy link
Member

Do you think it makes sense to just do it once every N buffer creations? And simply make sure we have done it before running eval?

@awni
Copy link
Member Author

awni commented Feb 28, 2025

Do you think it makes sense to just do it once every N buffer creations? And simply make sure we have done it before running eval?

I think there are some things to investigate there. But it's not entirely obvious if/what to optimize. The residency set in practice works like so:

  • Ahead of time we request the whole model to be resident. This part could probably be faster with a 1/N strategy but the latency of requesting residency is minor since a) it only gets included in time-to-first token and b) committing the residency set should be a small fraction of the overall model-loading / prompt processing (should be confirmed with more careful measurement).

  • During inference we move stuff in and out of the residency set as it gets allocated/freed. We don't do a lot of allocations during inference. But in some cases there are a lot of scalars (like the RoPE offset). For small models (0.5B) we actually notice the latency of moving them in and out of the residency in this case (the motivation for this PR). Potentially another way to solve this is by keeping the allocator cache resident (at least during an eval or something).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants