Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only do grid split when needed #2965

Merged
merged 9 commits into from
Oct 7, 2024
Merged

Conversation

liqiangxl
Copy link
Collaborator

Issue: In inner outer persistent scheduler, the last step is doing an outer reduction, the inner dim is parallelized by vectorization, bimdx, and gdimy. Current main branch always do three splits using vectorization, bdimx, and gdimy, however, the last split is not needed if vectorization * bdimx * gdimy >= inner dim, for example:

T0 
logical domain : (iS264{gridDim.y}, iS265{i1})
 contiguity: t t
  Split: iS265{i1} by factor 4
  Split: iS997{( ceilDiv(i1, 4) )} by factor blockDim.x 
  Split: iS999{( ceilDiv(( ceilDiv(i1, 4) ), blockDim.x) )} by factor gridDim.y

The last split is redundant if 4 * blockDim.x * gridDim.y >= i1
Fix:
Only split when vectorization * bdimx * gdimy < inner dim
Influence:
Removing this extra split saves one loop in the generated code.
Performance is increased in some cases but decreased in other cases, all changes are within 10%. see dashboard.

@liqiangxl
Copy link
Collaborator Author

!build --diff --pybench

@liqiangxl
Copy link
Collaborator Author

Hi @jjsjann123 , can you help taking a look of this CI fail? It didn't find any test error but exit with

Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit status 1

@liqiangxl liqiangxl marked this pull request as ready for review September 19, 2024 16:24
@jjsjann123
Copy link
Collaborator

Hi @jjsjann123 , can you help taking a look of this CI fail? It didn't find any test error but exit with

Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit status 1

Since it mentioned about segfault and it's hopper matmul tests... I'm guessing it's fixed by this PR: #2963

@jjsjann123
Copy link
Collaborator

!build --diff --pybench

@liqiangxl
Copy link
Collaborator Author

!build

csrc/scheduler/reduction_heuristic.h Show resolved Hide resolved
csrc/scheduler/normalization_inner_outer.cpp Show resolved Hide resolved
if (rparams.combined_split_grid_inner_dim) {
outer_reduction_tv->split(
axisID, NamedScalar::getParallelDim(ParallelType::BIDy));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this one is the same as line 837-841.

(Note, I'm only nitpicking, we don't have to change it. I wanted to point it out in case there's some logic issue here.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. Yes, they are same and the logic is fine. I repeated them in the if-else branches simplily becuase I want to keep both the code block in if and else branches as complete schedule proess.
current approach

if(multiple reductions per block){
    schedule approach-1
} else {
    schedule approach-2
}

There are some common schedules (e.g. using BIDy) in approach-1 and approach-2. However, it is not prefered to split it out and using the following code:
Other options:

if(multiple reductions per block){
    part of schedule approach-1
} else {
    part of schedule approach-2
}
common schedule processes shared by approach-1 & approach-2


outer_reduction_tv->split(
axisID, NamedScalar::getParallelDim(ParallelType::BIDy));
if (rparams.combined_split_grid_inner_dim) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive question. Even though we are skipping the split, I thought we still would need to specify the current IterDomain with ParallelType::BIDy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we always specify it as ParallelType::BIDy using outer_reduction_tv->axis(axisID--)->parallelize(ParallelType::BIDy);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outer_reduction_tv->axis(axisID--)->parallelize(ParallelType::BIDy); is out of the if statement.

@liqiangxl
Copy link
Collaborator Author

!build --diff

@liqiangxl
Copy link
Collaborator Author

!build --diff

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a minor question/comment.

csrc/scheduler/normalization_inner_outer.cpp Show resolved Hide resolved
if (rparams.combined_split_grid_inner_dim) {
outer_reduction_tv->split(
axisID, NamedScalar::getParallelDim(ParallelType::BIDy));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this one is the same as line 837-841.

(Note, I'm only nitpicking, we don't have to change it. I wanted to point it out in case there's some logic issue here.)

@jjsjann123
Copy link
Collaborator

Removing this extra split saves one loop in the generated code.
Performance is increased in some cases but decreased in other cases, all changes are within 10%.

what's with the perf regression here? Are those just small kernels with fluctuation?
If we are just removing one trivial loop, what's the reason for a potential negative perf impact?

@liqiangxl
Copy link
Collaborator Author

Removing this extra split saves one loop in the generated code.
Performance is increased in some cases but decreased in other cases, all changes are within 10%.

what's with the perf regression here? Are those just small kernels with fluctuation? If we are just removing one trivial loop, what's the reason for a potential negative perf impact?

Some are from large cases and repeatable, e.g. 16384 x 23040 reduced from 47% SOL to 42% SOL. Not sure why, we do saved one loop in both CUDA & PTX codes.
image
image
PTX info is also same

ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 255 registers, used 1 barriers, 16 bytes smem

We can leave this PR open and recheck after warp reduction & heuristics.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. that perf regression is indeed strange. I think it's interesting to look at it but I'm not sure about what priority we should give this. (Yet another weird compiler behavior?)

Since you did verify the behavior in generated kernel, I'm stamping it to unblock you.

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl liqiangxl merged commit 2b9e9d6 into main Oct 7, 2024
10 of 11 checks passed
@liqiangxl liqiangxl deleted the llu/ln_bwd_outer_remove_extra_split branch October 7, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants