-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Attention] WIP MLA with chunked prefill #12639
base: main
Are you sure you want to change the base?
[Attention] WIP MLA with chunked prefill #12639
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
f939824
to
77be9af
Compare
77be9af
to
bf6a400
Compare
This pull request has merge conflicts that must be resolved before it can be |
463e453
to
c542cc4
Compare
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Co-authored-by: Patrick Horn <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
727b265
to
c2d5468
Compare
# Default to `gpu_memory_utilization` of 0.9 if not specified | ||
gpu_memory_utilization = self.gpu_memory_utilization if \ | ||
self.gpu_memory_utilization is not None else 0.9 | ||
# For models using MLA and chunked prefill, lower the default to 0.85 | ||
# to account for the extra memory required to up-project the MLA cache | ||
if self.gpu_memory_utilization is None and \ | ||
(self.enable_chunked_prefill and model_config.use_mla): | ||
gpu_memory_utilization = 0.85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of gpu_memory_utilization
is that all of vLLM's memory usage including weights, activations, kv cache, and any extra space needed for MLA should fit within this budget.
A user is explicitly specifying a gpu_memory_utilization
, they wouldn't want an MLA model to exceed that limit. I think a better way to handle the extra memory util due to MLA could be in the worker's determine_num_available_blocks
method.
Do you know why the profile_run doesn't already account for this memory footprint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why the profile_run doesn't already account for this memory footprint?
because it depends on the context len in cache for each seq in the request and we dont profile with max context len requests
im trying to cap the amount of memory used by chunking contexts longer than a certain amount
logger = logging.getLogger(__name__) | ||
|
||
logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixup duplicate logger = logging.getLogger(__name__)
|
||
namespace cuda_utils { | ||
|
||
template <typename T> | ||
HOST_DEVICE_INLINE constexpr std::enable_if_t<std::is_integral_v<T>, T> | ||
ceil_div(T a, T b) { | ||
return (a + b - 1) / b; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already in csrc/core/math.hpp
without the HOST_DEVICE_INLINE
. Does it make sense for it to be one function?
Signed-off-by: Lucas Wilkinson <[email protected]>
Merge #12807 first
Note this implementation uses alot of runtime memory due to up-projecting the full context, may need to turn down
--gpu-memory-utilization
More benchmarking needed to know if this should be on by default (due to the memory concerns im leaning towards no)
Shout to @pathorn for the assistance with this PR