Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] FlashInfer v0.2 to v0.3 #675

Open
2 of 15 tasks
yzh119 opened this issue Dec 17, 2024 · 3 comments
Open
2 of 15 tasks

[Roadmap] FlashInfer v0.2 to v0.3 #675

yzh119 opened this issue Dec 17, 2024 · 3 comments
Labels

Comments

@yzh119
Copy link
Collaborator

yzh119 commented Dec 17, 2024

Milestones

Our tentative roadmap includes the following milestones:


We welcome your feedback and suggestions!
Let us know what features you'd like to see in FlashInfer.

@johnnynunez
Copy link

johnnynunez commented Jan 21, 2025

Initial support blackwell:
#747
10.0 blackwell b100/b200
12.0 blackwell rtx50
super: flex attention

@AgrawalAmey
Copy link

Looking forward to Pod-Attention support!

@AgrawalAmey
Copy link

To add more context, we have the following piece of code in mneomsyne codebase:

def _arrange_sequences_for_execution(
        self,
        seq_schedule_metadata_list: List[SequenceScheduleMetadata],
    ) -> List[SequenceScheduleMetadata]:
        """
        We need to arrange sequences in a way that allows us to perform
        attention computation in an efficient manner. Due to poor handling of mixed batches
        in attention kernels. We need to split the first split the sequences into prefill and decode:
        | prefill seqs | decode seqs |

        Secondly, when we mix sequences of different lengths, the attention kernel parallelization
        heuristics fail, and results in high latency. Thus, we need to further split the sequences:
        | long seqs | short seqs |

        Furthermore, within each group, we can have kvp sequences. Some of these kvp
        sequences might not require kv cache to be saved. So, within each group, we need to further
        organize sequences as follows:
        | non kvp seqs | kvp seqs w/ save_kv_cache | kvp seqs w/o save_kv_cache |
        """

In essence, we create make 4 different instances of flashinfer prefill attention wrapper and call the kernel 4 times 😢 cc @yzh119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants