warp reduction in x and y dims #2966

liqiangxl · 2024-09-19T03:24:18Z

Issue
The first part of innerOuter persistent kernel is an inner reduction, the inner dim is parallelized by vectorization, bdimx, bdimy, and persistent. Both bdimx and bdimy are used becuase we want to re-use bdimy to parallel the outer dim and re-use bdimx to parallel the inner dim in the second part of the kernel. However, it keeps us away from using warp reduction since the current runtime function only supports reduction in bimdx.
Fix
(1) Add another warp reduction runtime function to support reduction in x and y dims.
(2) bdimx and bdimy are explicitly set to static and passed to warp reduction as template paras.
(3) we are launching a cooperative kernel, gdimy is also static
Influence
(1) For a case with vectorization=8, bdimx=64, bdimy=4, and persistent=13 inner reduction dim is scheduled as:

rS195{13}, rUS197{1}, rthreadIdx.z198{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(i1, 8) ), 64) ), 4) ), 13) ), 1) )}, rthreadIdx.y194{4}, rthreadIdx.x192{64}, rV190{8}

The heuristic ensures bdimz = inner dim size / vectorization / bdimx / bdimy / persistent == 1, this gives us a static warp reduction across bdimx and bdimy.
Performance:
A100 layer norm backward:

A100 RMS norm backward:

H100 layer norm backward
Local run, not included in CI.

Other hardware:
Local run, not included in CI.
link
Other options
(1) we can set bdimy to dynamic, then no need to use bdimz and can save one split. Then, to use warp reduction, needs to pad bdimy and ensure bdimx * padded_bdimy % 32 == 0. Aslo needs to revise the outer reduction part of the kernel, it was assuming bdimy is static.

liqiangxl · 2024-09-19T03:24:31Z

!build --diff --pybench

liqiangxl · 2024-10-01T16:51:49Z

!build

jjsjann123 · 2024-10-02T11:03:09Z

csrc/device_lower/utils.cpp

+    std::pair<IterDomain*, IterDomain*> reduction_dims =
+        std::make_pair(reduction_on_xdim, nullptr);
+    if (reduction_on_xdim->hasPaddingToMultipleOfWarp()) {
+      return std::optional<std::pair<IterDomain*, IterDomain*>>(reduction_dims);


nitpick: return std::make_pair(reduction_on_xdim, nullptr)

revised to

return std::optional<std::pair<IterDomain*, IterDomain*>>( std::make_pair(reduction_on_xdim, nullptr));

I think we still want to keep this std::optional since the function is getMaybeWarpReductionDim, otherwise, we need to remove Maybe in the function name and make other changes in its callers.

No I wasn't suggesting to change the signature. It should automatically convert that to an std::optional. Is there any benefit of having that explicit?

got you. I didn't realize it can auto convert to std::optional, changed.

jjsjann123 · 2024-10-02T11:03:40Z

csrc/device_lower/utils.cpp

+    if (reduction_on_xdim->extent()->isConstInt()) {
+      auto extent_value = reduction_on_xdim->extent()->evaluate();
+      if (extent_value % at::cuda::warp_size() == 0) {
+        return std::optional<std::pair<IterDomain*, IterDomain*>>(


jjsjann123 · 2024-10-02T11:03:50Z

csrc/device_lower/utils.cpp

+      if ((extent_x_value * extent_y_value) % at::cuda::warp_size() == 0) {
+        std::pair<IterDomain*, IterDomain*> reduction_dims =
+            std::make_pair(reduction_on_xdim, reduction_on_ydim);
+        return std::optional<std::pair<IterDomain*, IterDomain*>>(


jjsjann123 · 2024-10-02T11:13:07Z

csrc/scheduler/normalization_inner_outer.cpp

@@ -611,6 +612,13 @@ std::unique_ptr<ReductionParams> innerOuterPersistentHeuristic(
    rparams->block_dim_iter_dom = ParallelType::TIDy;
  } else {
    rparams->block_dim_inner_reduction_extra = ParallelType::TIDy;
+    rparams->static_bdimx = true;
+    rparams->static_bdimy = true;


I think here we need to have all block dimensions be static, since bdimz == 1?

It is used to parallelize rthreadIdx.z198{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(i1, 8) ), 64) ), 4) ), 13) ), 1) )}, it is a dynamic dim depends on i1. Although the heuristics ensures it equals 1, we still can't set it to static, unless we use the value of i1, then the whole kernel becomes static kernel and we lost the dynamic shape support.

Ha, I see it now. Thanks for bearing with me here.

jjsjann123 · 2024-10-02T11:13:19Z

csrc/scheduler/normalization_inner_outer.cpp

  rparams->lparams = LaunchParams(
      LaunchParams::UNINITIALIZED_VAL,
      iop.gdimy,
      LaunchParams::UNINITIALIZED_VAL,
      iop.bdimx,
      iop.bdimy,
-      LaunchParams::UNINITIALIZED_VAL);
+      iop.bdimz);


Naive question regarding iop.bdimz. Why do we add this in heuristic if it's always going to be 1?

It is redundant since we already have NVF_ERROR(iop.bdimz == 1, "bdimz must be 1.");
Removed this change.

jjsjann123 · 2024-10-02T11:28:49Z

runtime/warp.cu

@@ -88,4 +88,58 @@ __device__ void warpReduceTIDX(
  }
 }

+template <int BDIMX, int BDIMY, bool Aligned, typename T, typename Func>


Why do we need to have this template function with a slight different implementation for inter-warp reduction?

The existing version warpReduceTIDX is doing reduction in X dim, Y dim is used for iteration. It requires bdimx % 32 ==0
This new version warpReduceTIDXY is doing reduction in X & Y dims, it requires (bdimx * bdimy) % 32 ==0. So they are different. Also bdimx and bdimy are static values, then we can use template vars.

ah, my bad. I didn't realized that in warpReduceTIDX, we are still using threadIdx.y. Earlier I was suggesting that these two should be merged together, but that doesn't look like worth it.

jjsjann123 · 2024-10-04T06:05:48Z

The code change looks good to me.
One last question regarding the performance, with A100 layer norm backward, what's worst regression we are looking at? The axis doesn't give that.

liqiangxl · 2024-10-04T13:43:46Z

The code change looks good to me. One last question regarding the performance, with A100 layer norm backward, what's worst regression we are looking at? The axis doesn't give that.

double checked, it is 0.86x, link.

jjsjann123

LGTM

liqiangxl · 2024-10-07T12:39:43Z

!build

warp reduction in x and y dims

e90b8b6

jjsjann123 self-requested a review September 19, 2024 19:04

liqiangxl and others added 4 commits October 1, 2024 15:03

merge main

d1151fc

merge main

75c860b

check bdimz

780ea7d

Merge branch 'main' into llu/warp_redu_xy

f392b67

liqiangxl marked this pull request as ready for review October 2, 2024 02:04

jjsjann123 reviewed Oct 2, 2024

View reviewed changes

minor changes

bdd394b

auto convert to std::optional

512ba21

jjsjann123 approved these changes Oct 7, 2024

View reviewed changes

Merge branch 'main' into llu/warp_redu_xy

8395112

liqiangxl merged commit 615177d into main Oct 7, 2024
19 of 20 checks passed

liqiangxl deleted the llu/warp_redu_xy branch October 7, 2024 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warp reduction in x and y dims #2966

warp reduction in x and y dims #2966

liqiangxl commented Sep 19, 2024 •

edited

Loading

liqiangxl commented Sep 19, 2024

liqiangxl commented Oct 1, 2024

jjsjann123 Oct 2, 2024

liqiangxl Oct 3, 2024

jjsjann123 Oct 4, 2024

liqiangxl Oct 4, 2024

jjsjann123 Oct 2, 2024

liqiangxl Oct 3, 2024

jjsjann123 Oct 2, 2024

liqiangxl Oct 3, 2024

jjsjann123 Oct 2, 2024

liqiangxl Oct 3, 2024

jjsjann123 Oct 4, 2024

jjsjann123 Oct 2, 2024

liqiangxl Oct 3, 2024

jjsjann123 Oct 2, 2024

liqiangxl Oct 3, 2024

jjsjann123 Oct 4, 2024

jjsjann123 commented Oct 4, 2024

liqiangxl commented Oct 4, 2024

jjsjann123 left a comment

liqiangxl commented Oct 7, 2024

warp reduction in x and y dims #2966

warp reduction in x and y dims #2966

Conversation

liqiangxl commented Sep 19, 2024 • edited Loading

liqiangxl commented Sep 19, 2024

liqiangxl commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Oct 4, 2024

liqiangxl commented Oct 4, 2024

jjsjann123 left a comment

Choose a reason for hiding this comment

liqiangxl commented Oct 7, 2024

liqiangxl commented Sep 19, 2024 •

edited

Loading