Misaligned AOTI input; potential perf gains by fixing? #1424

Jack-Khuu · 2024-12-14T01:11:30Z

🐛 Describe the bug

Picked up in #1367, and worked around via pytorch/pytorch#143236, it appears the input to the torchchat AOTI runner is not 16 byte aligned.

While the PR from pytorch/pytorch eases this constraint, this may be indicative of potential perf losses (common of misalignment)

hattip to @malfet for suggesting line of investigation

Versions

bb72b09

The text was updated successfully, but these errors were encountered:

mikekgfb · 2024-12-14T02:25:13Z

I think most tensors are allocated aligned, but slicing and views may break that in Python/Pytorch. One solution might be to literally clone every input in flat_inputs. As an experiment, likely logits = model(x_sliced.clone(), ip_sliced.clone() might do the trick - but that would saddle cases where AOTI is not used with the cost of clone() also.

This might at least give a sense whether there's an unalignment tax for this instance. (We slice the vector, so we get one element? The cost may not be that horrible, unless we change the code for all dimensions and generate worse code? But then we would not know the unalignment tax either, because just removing the attribute will do the damage regardless of actual alignment?)

Long term, the right place may be just around the actual control transfer to C/C++ -- you'd want to do this in the Python caller (or a C/C++ service function that does clones all Pytorch arrays as arguments). If we push it in the callee, then every caller, even those that already provide aligned arguments will pay. (Or at least have to evaluate an alignment check and conditional...)

Either way, the payoff may be significant for large matrices -- copy is O(n^2), matmul O(n^3) for a [n,n] *[n,n] GEMM.
(Some BLAS libraries do a copy of input arguments proactively to have nice alignment etc.)

This might be one of these code optimization things that @swolchok and @desertfire might crush with their Inductor expertise. (Also, how would torch.compile with JIT inductor handle these cases?)

Jack-Khuu added bug Something isn't working actionable Items in the backlog waiting for an appropriate impl/fix Compile / AOTI Issues related to AOT Inductor and torch compile labels Dec 14, 2024

Jack-Khuu added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misaligned AOTI input; potential perf gains by fixing? #1424

Misaligned AOTI input; potential perf gains by fixing? #1424

Jack-Khuu commented Dec 14, 2024

mikekgfb commented Dec 14, 2024

Misaligned AOTI input; potential perf gains by fixing? #1424

Misaligned AOTI input; potential perf gains by fixing? #1424

Comments

Jack-Khuu commented Dec 14, 2024

🐛 Describe the bug

Versions

mikekgfb commented Dec 14, 2024