[Attention] WIP MLA with chunked prefill #12639

LucasWilkinson · 2025-02-01T04:43:16Z

Merge #12807 first

Note this implementation uses alot of runtime memory due to up-projecting the full context, may need to turn down --gpu-memory-utilization

More benchmarking needed to know if this should be on by default (due to the memory concerns im leaning towards no)

Shout to @pathorn for the assistance with this PR

github-actions · 2025-02-01T04:43:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-02-06T05:22:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-02-07T03:56:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lucas Wilkinson <[email protected]>

Co-authored-by: Patrick Horn <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

tlrmchlsmth · 2025-02-07T19:24:09Z

vllm/engine/arg_utils.py

+        # Default to `gpu_memory_utilization` of 0.9 if not specified
+        gpu_memory_utilization = self.gpu_memory_utilization if \
+            self.gpu_memory_utilization is not None else 0.9
+        # For models using MLA and chunked prefill, lower the default to 0.85
+        # to account for the extra memory required to up-project the MLA cache
+        if self.gpu_memory_utilization is None and \
+            (self.enable_chunked_prefill and model_config.use_mla):
+            gpu_memory_utilization = 0.85


My understanding of gpu_memory_utilization is that all of vLLM's memory usage including weights, activations, kv cache, and any extra space needed for MLA should fit within this budget.

A user is explicitly specifying a gpu_memory_utilization, they wouldn't want an MLA model to exceed that limit. I think a better way to handle the extra memory util due to MLA could be in the worker's determine_num_available_blocks method.

Do you know why the profile_run doesn't already account for this memory footprint?

Do you know why the profile_run doesn't already account for this memory footprint?

because it depends on the context len in cache for each seq in the request and we dont profile with max context len requests

im trying to cap the amount of memory used by chunking contexts longer than a certain amount

tlrmchlsmth · 2025-02-07T19:25:01Z

vllm/attention/backends/utils.py

+logger = logging.getLogger(__name__)

 logger = logging.getLogger(__name__)


Fixup duplicate logger = logging.getLogger(__name__)

tlrmchlsmth · 2025-02-07T19:33:06Z

csrc/cuda_utils.h

+
+namespace cuda_utils {
+
+template <typename T>
+HOST_DEVICE_INLINE constexpr std::enable_if_t<std::is_integral_v<T>, T>
+ceil_div(T a, T b) {
+  return (a + b - 1) / b;
+}
+


This is already in csrc/core/math.hpp without the HOST_DEVICE_INLINE. Does it make sense for it to be one function?

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson changed the title ~~[Attention] WIP MLA with chunked prefill~~ [WIP][Attention] WIP MLA with chunked prefill Feb 1, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from f939824 to 77be9af Compare February 4, 2025 21:15

pathorn mentioned this pull request Feb 6, 2025

Implement chunked prefill for Triton MLA attention backend #12800

Open

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 77be9af to bf6a400 Compare February 6, 2025 02:27

mergify bot added the needs-rebase label Feb 6, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch 2 times, most recently from 463e453 to c542cc4 Compare February 6, 2025 05:24

mergify bot added v1 and removed needs-rebase labels Feb 6, 2025

LucasWilkinson changed the title ~~[WIP][Attention] WIP MLA with chunked prefill~~ [Attention] WIP MLA with chunked prefill Feb 6, 2025

LucasWilkinson marked this pull request as ready for review February 6, 2025 05:49

LucasWilkinson requested review from tlrmchlsmth, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners February 6, 2025 05:49

mergify bot added the needs-rebase label Feb 7, 2025

LucasWilkinson and others added 7 commits February 7, 2025 16:42

chunked mla

ddcf293

Signed-off-by: Lucas Wilkinson <[email protected]>

add gather cache kernel

80574e0

Signed-off-by: Lucas Wilkinson <[email protected]>

wip

a8c2e67

Signed-off-by: Lucas Wilkinson <[email protected]>

wip running

555f6d3

Co-authored-by: Patrick Horn <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]>

more cleanup

f776a9a

Signed-off-by: Lucas Wilkinson <[email protected]>

better defaults

7409c7b

Signed-off-by: Lucas Wilkinson <[email protected]>

increase MLA gpu_memory_utilization default

c2d5468

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 727b265 to c2d5468 Compare February 7, 2025 16:44

mergify bot removed the needs-rebase label Feb 7, 2025

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

wip

5daa921

Signed-off-by: Lucas Wilkinson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Attention] WIP MLA with chunked prefill #12639

[Attention] WIP MLA with chunked prefill #12639

LucasWilkinson commented Feb 1, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 1, 2025

mergify bot commented Feb 6, 2025

mergify bot commented Feb 7, 2025

tlrmchlsmth Feb 7, 2025

LucasWilkinson Feb 7, 2025

tlrmchlsmth Feb 7, 2025

tlrmchlsmth Feb 7, 2025

		logger = logging.getLogger(__name__)

		logger = logging.getLogger(__name__)

[Attention] WIP MLA with chunked prefill #12639

Are you sure you want to change the base?

[Attention] WIP MLA with chunked prefill #12639

Conversation

LucasWilkinson commented Feb 1, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 1, 2025

mergify bot commented Feb 6, 2025

mergify bot commented Feb 7, 2025

tlrmchlsmth Feb 7, 2025

Choose a reason for hiding this comment

LucasWilkinson Feb 7, 2025

Choose a reason for hiding this comment

tlrmchlsmth Feb 7, 2025

Choose a reason for hiding this comment

tlrmchlsmth Feb 7, 2025

Choose a reason for hiding this comment

LucasWilkinson commented Feb 1, 2025 •

edited by github-actions bot

Loading