Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVMGPUVectorDistribute] VectorDistribution support for unaligned shapes #20144

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Groverkss
Copy link
Contributor

@Groverkss Groverkss commented Mar 3, 2025

This PR adds support to perform statically tiled codegen on dynamic shapes in vector distribute pipeline.
Basically, it could honor lowering configs on dynamic shapes using masking.

Some side-effect changes:

  • Currently block dynamic dimension pass, change the dimensionality of the generics without performing a projection of the lowering config that was provided higher up in the pipeline. Moreover, the requirement to do this becomes less now as we can tile generally on the dynamic dimension with the changes here -- unless Im missing something here.

This builds on the following PRs -- hence putting to draft:

future work:

Original Author: @manupak

manupak and others added 11 commits February 26, 2025 02:41
  Also, keeping it disabled by default until lowering config
  projection is fixed.
* enable masking in generic vectorization
* add two runs of resolve type to fold tensor.dim in rank reducing
  type.

Signed-off-by: Manupa Karunaratne <[email protected]>
masked compute.

Signed-off-by: Manupa Karunaratne <[email protected]>
masked cases.

Signed-off-by: Manupa Karunaratne <[email protected]>
* only enable masking in vectorization in vector distribute

Signed-off-by: Manupa Karunaratne <[email protected]>
and add code not to run on ops where lowering config
is set.

Signed-off-by: Manupa Karunaratne <[email protected]>
Signed-off-by: Manupa Karunaratne <[email protected]>
Signed-off-by: Manupa Karunaratne <[email protected]>
@AmosLewis
Copy link
Contributor

Just test llama3 benchmark on input size length 128/2048 by locally rebased this PR on 3.3.0rc20250310. Improve performance around 1ms.
128: 37.2 ms->36.4 ms.
2048: 174 ms -> 173 ms.

@Groverkss
Copy link
Contributor Author

Groverkss commented Mar 10, 2025

Just test llama3 benchmark on input size length 128/2048 by locally rebased this PR on 3.3.0rc20250310. Improve performance around 1ms. 128: 37.2 ms->36.4 ms. 2048: 174 ms -> 173 ms.

This should have no effect on any benchmarks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants