-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about cascade inference #789
Comments
It only refers to decode attention kernel, not end-to-end results. |
Thank you. Is this optimization mainly aimed at the decoding stage? How is the benefit for the prefill stage? |
Yes, and it doesn't work for attention variants such as MLA (even for decoding), which exhibit very high operational intensity (128) in decoding stage. |
Does it has any benefit for prefill MHA? Seems the effect is little? because if the sequence is long enough, the kernel is compute bound? |
How about if a lot of requests sharing the same long prefix needs to do prefill? Seems it can save a lot of computation because the shared part only needs to compute once. I notice in sglang, this feature is not used. In that case, sglang will process one prefill request first and then process the rest. |
Depending on the query length, once query length (operational intensity) reaches the ridge point of GPU's roofline model (usually not large, ~300) , the benefit is gone. |
I think prefix-caching has already done such optimizations. Did you enable prefix-caching in sglang? |
Yes, sglang prefix caching is enabled. I mean in the prefill stage, when there are 4 sequences in the same batch, and they share the same long prefix. |
I have another question, the cascade api launches 5 kernels for a batch decoding.
My question is, is this implementation efficient enough? These two stages are executed sequentially. Will this lead to insufficient SM occupancy? Also, can these two stages be fused and executed in one kernel ? Sorry for so many questions. Hope for your reply. Thank you! @yzh119 |
Apparently not, and we should optimize it, actually all of them can be fused into a single kernel, if you have interest, I can guide you to implement this (I don't have enough bandwidth at this moment). |
I'm very glad to do that ! I just take a glance at the The main idea seems to be as follows:
Is that the right way to do that? You can give some instructions when you're free. Thank you. |
https://flashinfer.ai/2024/02/02/cascade-inference.html
Hi, I notice this blog posted a year ago.
I wonder what situation does the
Evaluations
part refer to.Is it for prefill stage ? or decoding stage? Or for both phase?
The text was updated successfully, but these errors were encountered: