Add TMA support for circular buffering pass #2833

rdspring1 · 2024-08-22T00:35:50Z

Summary

This PR adds support for TMA circular buffering. It is stacked on #2824 and #2825.
Tracking branch: #2773

Description

In the circular buffer pass, clone operations to create the pre-prologue, prologue, main, epilogue, and post-epilogue for-loops.
Pre-Prologue allocates share memory and initializes mbarriers.
Prologue copies only the load operations.
Main loop copies the load and computation operations and adds arrive_expected_tx for next stage and mbarrier_wait for current stage.
Epilogue copies only the computation operations and adds mbarrier_wait for remaining stages in the pipeline.
Post-Epilogue invalidated mbarriers.

Lowering Details

Description of changes in lowering passes.

Prologue, Main, and Epilogue loops are created by TmaCircularBufferLoopCloner which is a child class of CircularBufferLoopCloner.
PrePrologue and PostEpilogue loops are created by createCpAsyncBulkFixtures.
The cuTensorMapEncodeTiled restricts the size of each box dimension to be <= 256. You need to launch multiple load operations to load larger tiles.
We only allocate mbarriers for each stage, so the expected_transaction bytes is multiplied by the number of TMA loads per stage.
The for-loop cloner must account for the nested for-loop structure used to launch multiple TMA loads before adding the mbarrier_wait for the stage.

Loop Structure

Description of for-loop structure for circular buffering.

Overview Circular Buffer Structure:

Pre-prologue loop:

Allocate shared memory for mbarriers and mbarrier tokens
Initialize mbarrier for all stages

Prologue loop:

if selected_thread:
- Issue cp async bulks for all but last stage

Main loop:

if selected_thread:
- Issue next cp async bulk for available stage
All threads wait until tma operation arrives
Copy body without
- shared memory allocations
- mbarrier_init exprs
- mbarrier_inval exprs

Epilogue loop:

All threads wait until tma operation arrives
Copy body without
- shared memory allocations
- issuing cp async bulk operations
- mbarrier_init exprs
- mbarrier_inval exprs

Post-epilogue loop:

if selected_thread:
Invalidated mbarrier for all stages

Detailed Pseudo-Code:

constexpr int64_t warp_size = 32;
bool first_warp = threadIdx.x < warp_size && threadIdx.y == 0 && threadIdx.z == 0;

Pre-Prologue loop:

__shared__ __mbarrier_t barriers[num_stages];
__shared__ __mbarrier_token_t tokens[num_stages];
if (first_warp && hopper::electSync()) {
  for (int64_t loop_index : irange(stages)) {
    mbarrier_init(mbarrier[loop_index], number_of_arrival_threads);
  }
}

Prologue loop:

for (int64_t loop_index : irange(stages-1)) {
  if (first_warp && hopper::electSync()) {
    tokens[loop_index] = mbarrier::arriveExpectTx(mbarrier[loop_index]);
    cpAsyncBulk(mbarriers[loop_index], ...);
  } else {
    token[load_stage] = mbarrier::arrive(mbarrier[load_stage]);
  }
}

Main loop:

for (int64_t loop_index : irange(N-(stages-1))) {
  current_stage = loop_index % stage_depth
  load_stage = (loop_index + (stage_depth - 1)) % stage_depth)
  if (first_warp && hopper::electSync()) {
    token[load_stage] =
      mbarrier::arriveExpectTx(mbarrier[load_stage], expected_transaction_size);
    cpAsyncBulk(mbarrier[load_stage], ...);
  } else {
    token[load_stage] = mbarrier::arrive(mbarrier[load_stage]);
  }
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}

Epilogue loop:

for (int64_t loop_index : irange(N-(stages-1), N)) {
  current_stage = loop_index % stage_depth
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}

Post-Epilogue loop:

if (first_warp && hopper::electSync()) {
  for (int64_t loop_index : irange(stages)) {
    mbarrier_inval(mbarrier[loop_index]);
  }
}

Testing Setup

2 to 4 pipeline stages.
(128, 500, 1024) outer dimension.
(128, 1024) inner dimension.

Single Dim including Unroll and Unswitch parallelizations.
Multiple Dim
Pointwise
Reduction
InnerPersistent
Matmul

csarofeen · 2024-09-08T16:19:13Z

Awesome, detailed PR description. Thank you.

* Add support for Hopper::electSync * Create ElectSync PredicateType * Make mbarrier synchronous * mbarrier waits for all threads in CTA * All threads issues arriveExpectTx to get mbarrier_token

jacobhinkle

Just some minor comments from a first pass. I haven't looked at tests yet.

csrc/device_lower/pass/allocation.cpp

csrc/device_lower/pass/circular_buffer.cpp

jacobhinkle · 2024-09-09T19:22:51Z

csrc/device_lower/pass/circular_buffer.cpp

+//     mbarrier_inval(mbarrier[loop_index]);
+//   }
+// }
+//


Nice improvement to the comment at line 34.

csrc/device_lower/pass/circular_buffer.cpp

csrc/executor.cpp

csrc/device_lower/pass/circular_buffer.cpp

zasdfgbnm · 2024-09-10T04:48:16Z

csrc/device_lower/pass/predicate.cpp

@@ -209,6 +209,27 @@ class ConditionalFromPredicateModifier : public kir::ExprMutator {
        // here.
        return IrBuilder::create<Val>(true, DataType::Bool);
      }
+      case PredicateType::ElectSync: {


Do we need a separate PredicateType::ElectSync predicate type? Should we just use whatever the original predicate type it has, and if the conditional happen to be tidx == 0 && tidy == 0 && tidz == 0, we convert it to the elec_sync() conditional? What do you think? @naoyam

FYI, I separated the ElectSync PredicateType changes into #2923.

rdspring1 · 2024-09-20T03:56:02Z

!build

rdspring1 added 2 commits August 21, 2024 18:25

Add allocation changes

12db3ee

Add Indexing changes

2491171

rdspring1 force-pushed the tma_cb_index branch from e223b8b to 2491171 Compare August 22, 2024 01:26

rdspring1 changed the title ~~Add TMA support for circular buffering pass and testing~~ Add TMA support for circular buffering pass Aug 22, 2024

rdspring1 requested review from zasdfgbnm, jacobhinkle and naoyam August 22, 2024 16:29

Add circular buffering pass and testing

6d8ad5f

rdspring1 force-pushed the tma_cb branch from 26aceb9 to 6d8ad5f Compare August 23, 2024 19:18

rdspring1 mentioned this pull request Aug 23, 2024

Indexing changes for TMA Circular Buffering #2825

Merged

rdspring1 mentioned this pull request Sep 3, 2024

Allocation changes for TMA Circular Buffering #2824

Merged

Base automatically changed from tma_cb_index to main September 5, 2024 02:22

rdspring1 added 5 commits September 8, 2024 15:35

Merge branch 'main' of https://github.com/nvidia/fuser into tma_cb

25c482d

predicate and mbarrier changes

2f8d9e9

* Add support for Hopper::electSync * Create ElectSync PredicateType * Make mbarrier synchronous * mbarrier waits for all threads in CTA * All threads issues arriveExpectTx to get mbarrier_token

add mbarrier_wait immediately

2a06157

skip expressions_allocated_in_main_loop

0c8858f

Ensure a full warp exists if there is elect sync predicate

f8123af

jacobhinkle reviewed Sep 9, 2024

View reviewed changes

rdspring1 added 2 commits September 9, 2024 13:17

comments≈

508d674

Merge branch 'main' of https://github.com/nvidia/fuser into tma_cb

d4c7938

zasdfgbnm reviewed Sep 10, 2024

View reviewed changes

csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved

zasdfgbnm reviewed Sep 10, 2024

View reviewed changes

naoyam mentioned this pull request Sep 11, 2024

Create ElectSync predicate type #2923

Merged

rdspring1 added 6 commits September 16, 2024 13:26

Add compatibility check for elect sync

ccfedfc

add test for elect sync compatibility

f29aa22

Use MBarrierArrive

f685a9b

comments

b84eb96

Add has_elect_sync_predicate to kernel_summary

c1fdec5

Merge branch 'main' of https://github.com/nvidia/fuser into tma_cb

4f011b5

rdspring1 force-pushed the tma_cb branch from 0c239af to 4f011b5 Compare September 20, 2024 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TMA support for circular buffering pass #2833

Add TMA support for circular buffering pass #2833

rdspring1 commented Aug 22, 2024 •

edited

Loading

csarofeen commented Sep 8, 2024

jacobhinkle left a comment

jacobhinkle Sep 9, 2024

zasdfgbnm Sep 10, 2024

rdspring1 Sep 10, 2024

rdspring1 commented Sep 20, 2024

Add TMA support for circular buffering pass #2833

Are you sure you want to change the base?

Add TMA support for circular buffering pass #2833

Conversation

rdspring1 commented Aug 22, 2024 • edited Loading

Summary

Description

Lowering Details

Loop Structure

Pre-prologue loop:

Prologue loop:

Main loop:

Epilogue loop:

Post-epilogue loop:

Pre-Prologue loop:

Prologue loop:

Main loop:

Post-Epilogue loop:

Testing Setup

csarofeen commented Sep 8, 2024

jacobhinkle left a comment

Choose a reason for hiding this comment

jacobhinkle Sep 9, 2024

Choose a reason for hiding this comment

zasdfgbnm Sep 10, 2024

Choose a reason for hiding this comment

rdspring1 Sep 10, 2024

Choose a reason for hiding this comment

rdspring1 commented Sep 20, 2024

rdspring1 commented Aug 22, 2024 •

edited

Loading