Question about cascade inference #789

sleepwalker2017 · 2025-02-05T10:41:54Z

https://flashinfer.ai/2024/02/02/cascade-inference.html

Hi, I notice this blog posted a year ago.

I wonder what situation does the Evaluations part refer to.

Is it for prefill stage ? or decoding stage? Or for both phase?

The text was updated successfully, but these errors were encountered:

yzh119 · 2025-02-05T16:41:44Z

It only refers to decode attention kernel, not end-to-end results.

sleepwalker2017 · 2025-02-06T02:12:45Z

It only refers to decode attention kernel, not end-to-end results.

Thank you.

Is this optimization mainly aimed at the decoding stage?

How is the benefit for the prefill stage?

yzh119 · 2025-02-06T02:41:52Z

Is this optimization mainly aimed at the decoding stage?

Yes, and it doesn't work for attention variants such as MLA (even for decoding), which exhibit very high operational intensity (128) in decoding stage.

sleepwalker2017 · 2025-02-10T03:03:46Z

Is this optimization mainly aimed at the decoding stage?

Yes, and it doesn't work for attention variants such as MLA (even for decoding), which exhibit very high operational intensity (128) in decoding stage.

Does it has any benefit for prefill MHA? Seems the effect is little? because if the sequence is long enough, the kernel is compute bound?

sleepwalker2017 · 2025-02-10T03:12:49Z

Is this optimization mainly aimed at the decoding stage?

Yes, and it doesn't work for attention variants such as MLA (even for decoding), which exhibit very high operational intensity (128) in decoding stage.

How about if a lot of requests sharing the same long prefix needs to do prefill? Seems it can save a lot of computation because the shared part only needs to compute once.

I notice in sglang, this feature is not used. In that case, sglang will process one prefill request first and then process the rest.

yzh119 · 2025-02-10T03:28:48Z

Does it has any benefit for prefill MHA? Seems the effect is little? because if the sequence is long enough, the kernel is compute bound?

Depending on the query length, once query length (operational intensity) reaches the ridge point of GPU's roofline model (usually not large, ~300) , the benefit is gone.

yzh119 · 2025-02-10T03:31:27Z

How about if a lot of requests sharing the same long prefix needs to do prefill? Seems it can save a lot of computation because the shared part only needs to compute once.

I notice in sglang, this feature is not used. In that case, sglang will process one prefill request first and then process the rest.

I think prefix-caching has already done such optimizations. Did you enable prefix-caching in sglang?

sleepwalker2017 · 2025-02-10T06:24:54Z

How about if a lot of requests sharing the same long prefix needs to do prefill? Seems it can save a lot of computation because the shared part only needs to compute once.

I notice in sglang, this feature is not used. In that case, sglang will process one prefill request first and then process the rest.

I think prefix-caching has already done such optimizations. Did you enable prefix-caching in sglang?

Yes, sglang prefix caching is enabled.

I mean in the prefill stage, when there are 4 sequences in the same batch, and they share the same long prefix.
In that case, sglang will do prefill for one request and cache the kv cache, and then do prefill for the other 3 requests, thus the computation for the shared prefix will be saved.

sleepwalker2017 · 2025-02-10T10:21:17Z

I have another question, the cascade api launches 5 kernels for a batch decoding.

stage 1: 2 kernels for unique parts: attention + merge
stage 2: 2 for shared parts: attention + merge
stage 3: merge shared and unique results.

My question is, is this implementation efficient enough?

These two stages are executed sequentially. Will this lead to insufficient SM occupancy?

Also, can these two stages be fused and executed in one kernel ?
The fusion is not done because it is too complex to implement?
If it is too difficult to implement, can multiple streams be used to make the two stages execute at the same time?
It seems that they are independent stages.

Sorry for so many questions. Hope for your reply. Thank you! @yzh119

yzh119 · 2025-02-13T07:26:59Z

My question is, is this implementation efficient enough?

Apparently not, and we should optimize it, actually all of them can be fused into a single kernel, if you have interest, I can guide you to implement this (I don't have enough bandwidth at this moment).

sleepwalker2017 · 2025-02-13T09:51:35Z

My question is, is this implementation efficient enough?

Apparently not, and we should optimize it, actually all of them can be fused into a single kernel, if you have interest, I can guide you to implement this (I don't have enough bandwidth at this moment).

I'm very glad to do that !

I just take a glance at the BatchPrefillWithPagedKVCacheKernel kernel, seems the code is clear.
(I still need some time to fully understand it. )

The main idea seems to be as follows:

calculate the needed grid_size and block size for both parts
do kernel level synchronization, seems to use "cooperative groups"?
merge partial results after that

Is that the right way to do that?

You can give some instructions when you're free. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about cascade inference #789

Question about cascade inference #789

sleepwalker2017 commented Feb 5, 2025

yzh119 commented Feb 5, 2025

sleepwalker2017 commented Feb 6, 2025 •

edited

Loading

yzh119 commented Feb 6, 2025 •

edited

Loading

sleepwalker2017 commented Feb 10, 2025 •

edited

Loading

sleepwalker2017 commented Feb 10, 2025

yzh119 commented Feb 10, 2025 •

edited

Loading

yzh119 commented Feb 10, 2025

sleepwalker2017 commented Feb 10, 2025

sleepwalker2017 commented Feb 10, 2025

yzh119 commented Feb 13, 2025

sleepwalker2017 commented Feb 13, 2025 •

edited

Loading

Question about cascade inference #789

Question about cascade inference #789

Comments

sleepwalker2017 commented Feb 5, 2025

yzh119 commented Feb 5, 2025

sleepwalker2017 commented Feb 6, 2025 • edited Loading

yzh119 commented Feb 6, 2025 • edited Loading

sleepwalker2017 commented Feb 10, 2025 • edited Loading

sleepwalker2017 commented Feb 10, 2025

yzh119 commented Feb 10, 2025 • edited Loading

yzh119 commented Feb 10, 2025

sleepwalker2017 commented Feb 10, 2025

sleepwalker2017 commented Feb 10, 2025

yzh119 commented Feb 13, 2025

sleepwalker2017 commented Feb 13, 2025 • edited Loading

sleepwalker2017 commented Feb 6, 2025 •

edited

Loading

yzh119 commented Feb 6, 2025 •

edited

Loading

sleepwalker2017 commented Feb 10, 2025 •

edited

Loading

yzh119 commented Feb 10, 2025 •

edited

Loading

sleepwalker2017 commented Feb 13, 2025 •

edited

Loading