Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misaligned AOTI input; potential perf gains by fixing? #1424

Open
Jack-Khuu opened this issue Dec 14, 2024 · 1 comment
Open

Misaligned AOTI input; potential perf gains by fixing? #1424

Jack-Khuu opened this issue Dec 14, 2024 · 1 comment
Labels
actionable Items in the backlog waiting for an appropriate impl/fix bug Something isn't working Compile / AOTI Issues related to AOT Inductor and torch compile triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Jack-Khuu
Copy link
Contributor

🐛 Describe the bug

Picked up in #1367, and worked around via pytorch/pytorch#143236, it appears the input to the torchchat AOTI runner is not 16 byte aligned.

While the PR from pytorch/pytorch eases this constraint, this may be indicative of potential perf losses (common of misalignment)

hattip to @malfet for suggesting line of investigation

Versions

bb72b09

@Jack-Khuu Jack-Khuu added bug Something isn't working actionable Items in the backlog waiting for an appropriate impl/fix Compile / AOTI Issues related to AOT Inductor and torch compile labels Dec 14, 2024
@mikekgfb
Copy link
Contributor

I think most tensors are allocated aligned, but slicing and views may break that in Python/Pytorch. One solution might be to literally clone every input in flat_inputs. As an experiment, likely logits = model(x_sliced.clone(), ip_sliced.clone() might do the trick - but that would saddle cases where AOTI is not used with the cost of clone() also.

This might at least give a sense whether there's an unalignment tax for this instance. (We slice the vector, so we get one element? The cost may not be that horrible, unless we change the code for all dimensions and generate worse code? But then we would not know the unalignment tax either, because just removing the attribute will do the damage regardless of actual alignment?)

Long term, the right place may be just around the actual control transfer to C/C++ -- you'd want to do this in the Python caller (or a C/C++ service function that does clones all Pytorch arrays as arguments). If we push it in the callee, then every caller, even those that already provide aligned arguments will pay. (Or at least have to evaluate an alignment check and conditional...)

Either way, the payoff may be significant for large matrices -- copy is O(n^2), matmul O(n^3) for a [n,n] *[n,n] GEMM.
(Some BLAS libraries do a copy of input arguments proactively to have nice alignment etc.)

This might be one of these code optimization things that @swolchok and @desertfire might crush with their Inductor expertise. (Also, how would torch.compile with JIT inductor handle these cases?)

@Jack-Khuu Jack-Khuu added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable Items in the backlog waiting for an appropriate impl/fix bug Something isn't working Compile / AOTI Issues related to AOT Inductor and torch compile triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants