-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BarrierOp in AssertOp lowering can cause deadlock #5632
Comments
Is this a manually costructed IR? I don't think we support tensors in reduce regions. |
It was obtained through Inductor. I have attached the extracted kernel to this comment. On running with: |
Okay, that's strange because the code that added a splat before calling |
Are you referring to 92a4fad? That's the change where Splat before assert was removed However, in this case, the Splat before the AssertOp is the result of ReorderBroadcast pass. Before:
After:
I think it's due to:
|
Oh I see, the tensor is coming from here: The good news is the assert cannot fail, and gets optimized out by llvm. We should probably have a verifier that there are no tensor ops inside reduce or scan regions, but it's not really a priority atm. cc @davidberard98 for the overflow sanitizer running at all, should have been fixed by pytorch/pytorch#139502 |
Describe the bug
Currently, in the lowering of AssertOp, a BarrierOp is present in the end .
However, this is malformed if the AssertOp is part of a basic block that is executed by only some threads, since it is incorrect to have a BarrierOp/__syncthreads on a divergent block ().
As an example, I have attached a TTIR file with an AssertOp under the region of a ReduceOp.
The ReduceOp lowering predicates this basic block on a thread varying condition:
Value threadIsNeeded = icmp_slt(threadId, i32_val(elems));
This can cause a deadlock since all threads may not be present at the site of the BarrierOp
specimen_2.ttir.txt
Environment details
Triton: pytorch-triton package
Version:
3.2.0+git0d4682f0
GPU: A100
The text was updated successfully, but these errors were encountered: