-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TMA support for circular buffering pass #2833
base: main
Are you sure you want to change the base?
Conversation
e223b8b
to
2491171
Compare
Awesome, detailed PR description. Thank you. |
* Add support for Hopper::electSync * Create ElectSync PredicateType * Make mbarrier synchronous * mbarrier waits for all threads in CTA * All threads issues arriveExpectTx to get mbarrier_token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments from a first pass. I haven't looked at tests yet.
// mbarrier_inval(mbarrier[loop_index]); | ||
// } | ||
// } | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement to the comment at line 34.
csrc/device_lower/pass/predicate.cpp
Outdated
@@ -209,6 +209,27 @@ class ConditionalFromPredicateModifier : public kir::ExprMutator { | |||
// here. | |||
return IrBuilder::create<Val>(true, DataType::Bool); | |||
} | |||
case PredicateType::ElectSync: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate PredicateType::ElectSync
predicate type? Should we just use whatever the original predicate type it has, and if the conditional happen to be tidx == 0 && tidy == 0 && tidz == 0
, we convert it to the elec_sync()
conditional? What do you think? @naoyam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I separated the ElectSync
PredicateType changes into #2923.
!build |
Summary
This PR adds support for TMA circular buffering. It is stacked on #2824 and #2825.
Tracking branch: #2773
Description
arrive_expected_tx
for next stage andmbarrier_wait
for current stage.mbarrier_wait
for remaining stages in the pipeline.Lowering Details
Description of changes in lowering passes.
Prologue
,Main
, andEpilogue
loops are created byTmaCircularBufferLoopCloner
which is a child class ofCircularBufferLoopCloner
.PrePrologue
andPostEpilogue
loops are created bycreateCpAsyncBulkFixtures
.cuTensorMapEncodeTiled
restricts the size of each box dimension to be<= 256
. You need to launch multiple load operations to load larger tiles.mbarriers
for each stage, so theexpected_transaction
bytes is multiplied by the number of TMA loads per stage.mbarrier_wait
for the stage.Loop Structure
Description of for-loop structure for circular buffering.
Overview Circular Buffer Structure:
Pre-prologue loop:
Prologue loop:
Main loop:
Epilogue loop:
Post-epilogue loop:
Detailed Pseudo-Code:
Pre-Prologue loop:
Prologue loop:
Main loop:
Epilogue loop:
Post-Epilogue loop:
Testing Setup