only do grid split when needed #2965

liqiangxl · 2024-09-19T02:23:47Z

Issue: In inner outer persistent scheduler, the last step is doing an outer reduction, the inner dim is parallelized by vectorization, bimdx, and gdimy. Current main branch always do three splits using vectorization, bdimx, and gdimy, however, the last split is not needed if vectorization * bdimx * gdimy >= inner dim, for example:

T0 
logical domain : (iS264{gridDim.y}, iS265{i1})
 contiguity: t t
  Split: iS265{i1} by factor 4
  Split: iS997{( ceilDiv(i1, 4) )} by factor blockDim.x 
  Split: iS999{( ceilDiv(( ceilDiv(i1, 4) ), blockDim.x) )} by factor gridDim.y

The last split is redundant if 4 * blockDim.x * gridDim.y >= i1
Fix:
Only split when vectorization * bdimx * gdimy < inner dim
Influence:
Removing this extra split saves one loop in the generated code.
Performance is increased in some cases but decreased in other cases, all changes are within 10%. see dashboard.

liqiangxl · 2024-09-19T02:25:18Z

!build --diff --pybench

liqiangxl · 2024-09-19T16:19:07Z

Hi @jjsjann123 , can you help taking a look of this CI fail? It didn't find any test error but exit with

Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit status 1

jjsjann123 · 2024-09-19T16:32:55Z

Hi @jjsjann123 , can you help taking a look of this CI fail? It didn't find any test error but exit with
Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit status 1

Since it mentioned about segfault and it's hopper matmul tests... I'm guessing it's fixed by this PR: #2963

jjsjann123 · 2024-09-19T16:33:11Z

!build --diff --pybench

liqiangxl · 2024-09-20T17:06:03Z

!build

csrc/scheduler/reduction_heuristic.h

csrc/scheduler/normalization_inner_outer.cpp

jjsjann123 · 2024-09-24T22:20:30Z

csrc/scheduler/normalization_inner_outer.cpp

+      if (rparams.combined_split_grid_inner_dim) {
+        outer_reduction_tv->split(
+            axisID, NamedScalar::getParallelDim(ParallelType::BIDy));
+      }


same as previous

But this one is the same as line 837-841.

(Note, I'm only nitpicking, we don't have to change it. I wanted to point it out in case there's some logic issue here.)

Thanks for pointing it out. Yes, they are same and the logic is fine. I repeated them in the if-else branches simplily becuase I want to keep both the code block in if and else branches as complete schedule proess.
current approach

if(multiple reductions per block){ schedule approach-1 } else { schedule approach-2 }

There are some common schedules (e.g. using BIDy) in approach-1 and approach-2. However, it is not prefered to split it out and using the following code:
Other options:

if(multiple reductions per block){ part of schedule approach-1 } else { part of schedule approach-2 } common schedule processes shared by approach-1 & approach-2

jjsjann123 · 2024-09-24T22:29:03Z

csrc/scheduler/normalization_inner_outer.cpp

-
-      outer_reduction_tv->split(
-          axisID, NamedScalar::getParallelDim(ParallelType::BIDy));
+      if (rparams.combined_split_grid_inner_dim) {


naive question. Even though we are skipping the split, I thought we still would need to specify the current IterDomain with ParallelType::BIDy?

Yes, we always specify it as ParallelType::BIDy using outer_reduction_tv->axis(axisID--)->parallelize(ParallelType::BIDy);

outer_reduction_tv->axis(axisID--)->parallelize(ParallelType::BIDy); is out of the if statement.

liqiangxl · 2024-09-30T14:54:13Z

!build --diff

liqiangxl · 2024-10-03T14:43:23Z

!build --diff

jjsjann123

LGTM, a minor question/comment.

csrc/scheduler/normalization_inner_outer.cpp

jjsjann123 · 2024-10-04T06:18:08Z

csrc/scheduler/normalization_inner_outer.cpp

+      if (rparams.combined_split_grid_inner_dim) {
+        outer_reduction_tv->split(
+            axisID, NamedScalar::getParallelDim(ParallelType::BIDy));
+      }


But this one is the same as line 837-841.

(Note, I'm only nitpicking, we don't have to change it. I wanted to point it out in case there's some logic issue here.)

jjsjann123 · 2024-10-04T06:20:18Z

Removing this extra split saves one loop in the generated code.
Performance is increased in some cases but decreased in other cases, all changes are within 10%.

what's with the perf regression here? Are those just small kernels with fluctuation?
If we are just removing one trivial loop, what's the reason for a potential negative perf impact?

liqiangxl · 2024-10-04T14:33:24Z

Removing this extra split saves one loop in the generated code.
Performance is increased in some cases but decreased in other cases, all changes are within 10%.

what's with the perf regression here? Are those just small kernels with fluctuation? If we are just removing one trivial loop, what's the reason for a potential negative perf impact?

Some are from large cases and repeatable, e.g. 16384 x 23040 reduced from 47% SOL to 42% SOL. Not sure why, we do saved one loop in both CUDA & PTX codes.

PTX info is also same

ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 255 registers, used 1 barriers, 16 bytes smem

We can leave this PR open and recheck after warp reduction & heuristics.

jjsjann123

hmmm. that perf regression is indeed strange. I think it's interesting to look at it but I'm not sure about what priority we should give this. (Yet another weird compiler behavior?)

Since you did verify the behavior in generated kernel, I'm stamping it to unblock you.

liqiangxl · 2024-10-07T12:40:08Z

!build

liqiangxl · 2024-10-07T13:31:03Z

!build

liqiangxl and others added 2 commits September 18, 2024 18:29

only do grid split when needed

7c307fd

Merge branch 'main' into llu/ln_bwd_outer_remove_extra_split

ef74c2c

liqiangxl marked this pull request as ready for review September 19, 2024 16:24

liqiangxl requested a review from jjsjann123 September 19, 2024 16:24

Merge branch 'main' into llu/ln_bwd_outer_remove_extra_split

7459cbb

Merge branch 'main' into llu/ln_bwd_outer_remove_extra_split

6602696

jjsjann123 reviewed Sep 24, 2024

View reviewed changes

liqiangxl added 2 commits September 30, 2024 14:44

comment

dc647a7

commen

31e4f53

Merge branch 'main' into llu/ln_bwd_outer_remove_extra_split

74b3f83

jjsjann123 reviewed Oct 4, 2024

View reviewed changes

jjsjann123 approved these changes Oct 7, 2024

View reviewed changes

Merge branch 'main' into llu/ln_bwd_outer_remove_extra_split

411e9ab

merge main

1190ed2

liqiangxl merged commit 2b9e9d6 into main Oct 7, 2024
10 of 11 checks passed

liqiangxl deleted the llu/ln_bwd_outer_remove_extra_split branch October 7, 2024 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only do grid split when needed #2965

only do grid split when needed #2965

liqiangxl commented Sep 19, 2024

liqiangxl commented Sep 19, 2024

liqiangxl commented Sep 19, 2024

jjsjann123 commented Sep 19, 2024

jjsjann123 commented Sep 19, 2024

liqiangxl commented Sep 20, 2024

jjsjann123 Sep 24, 2024

liqiangxl Sep 30, 2024

jjsjann123 Oct 4, 2024

liqiangxl Oct 4, 2024

jjsjann123 Sep 24, 2024

liqiangxl Sep 30, 2024

liqiangxl Sep 30, 2024

liqiangxl commented Sep 30, 2024

liqiangxl commented Oct 3, 2024

jjsjann123 left a comment

jjsjann123 Oct 4, 2024

jjsjann123 commented Oct 4, 2024

liqiangxl commented Oct 4, 2024

jjsjann123 left a comment

liqiangxl commented Oct 7, 2024

liqiangxl commented Oct 7, 2024

only do grid split when needed #2965

only do grid split when needed #2965

Conversation

liqiangxl commented Sep 19, 2024

liqiangxl commented Sep 19, 2024

liqiangxl commented Sep 19, 2024

jjsjann123 commented Sep 19, 2024

jjsjann123 commented Sep 19, 2024

liqiangxl commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liqiangxl commented Sep 30, 2024

liqiangxl commented Oct 3, 2024

jjsjann123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Oct 4, 2024

liqiangxl commented Oct 4, 2024

jjsjann123 left a comment

Choose a reason for hiding this comment

liqiangxl commented Oct 7, 2024

liqiangxl commented Oct 7, 2024