-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only do grid split when needed #2965
Conversation
!build --diff --pybench |
Hi @jjsjann123 , can you help taking a look of this CI fail? It didn't find any test error but exit with
|
Since it mentioned about segfault and it's hopper matmul tests... I'm guessing it's fixed by this PR: #2963 |
!build --diff --pybench |
!build |
if (rparams.combined_split_grid_inner_dim) { | ||
outer_reduction_tv->split( | ||
axisID, NamedScalar::getParallelDim(ParallelType::BIDy)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as previous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this one is the same as line 837-841.
(Note, I'm only nitpicking, we don't have to change it. I wanted to point it out in case there's some logic issue here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. Yes, they are same and the logic is fine. I repeated them in the if-else branches simplily becuase I want to keep both the code block in if and else branches as complete schedule proess.
current approach
if(multiple reductions per block){
schedule approach-1
} else {
schedule approach-2
}
There are some common schedules (e.g. using BIDy) in approach-1 and approach-2. However, it is not prefered to split it out and using the following code:
Other options:
if(multiple reductions per block){
part of schedule approach-1
} else {
part of schedule approach-2
}
common schedule processes shared by approach-1 & approach-2
|
||
outer_reduction_tv->split( | ||
axisID, NamedScalar::getParallelDim(ParallelType::BIDy)); | ||
if (rparams.combined_split_grid_inner_dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naive question. Even though we are skipping the split, I thought we still would need to specify the current IterDomain with ParallelType::BIDy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we always specify it as ParallelType::BIDy
using outer_reduction_tv->axis(axisID--)->parallelize(ParallelType::BIDy);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outer_reduction_tv->axis(axisID--)->parallelize(ParallelType::BIDy);
is out of the if
statement.
!build --diff |
!build --diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a minor question/comment.
if (rparams.combined_split_grid_inner_dim) { | ||
outer_reduction_tv->split( | ||
axisID, NamedScalar::getParallelDim(ParallelType::BIDy)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this one is the same as line 837-841.
(Note, I'm only nitpicking, we don't have to change it. I wanted to point it out in case there's some logic issue here.)
what's with the perf regression here? Are those just small kernels with fluctuation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm. that perf regression is indeed strange. I think it's interesting to look at it but I'm not sure about what priority we should give this. (Yet another weird compiler behavior?)
Since you did verify the behavior in generated kernel, I'm stamping it to unblock you.
!build |
!build |
Issue: In inner outer persistent scheduler, the last step is doing an outer reduction, the inner dim is parallelized by
vectorization
,bimdx
, andgdimy
. Current main branch always do three splits usingvectorization
,bdimx
, andgdimy
, however, the last split is not needed ifvectorization * bdimx * gdimy >= inner dim
, for example:The last split is redundant if
4 * blockDim.x * gridDim.y >= i1
Fix:
Only split when
vectorization * bdimx * gdimy < inner dim
Influence:
Removing this extra split saves one loop in the generated code.
Performance is increased in some cases but decreased in other cases, all changes are within 10%. see dashboard.