use aligned array for iter grouped reduction inputs #2934

liqiangxl · 2024-09-11T23:12:45Z

Fix #2930

Using aligned array of registers when needs to vectorized data transfer between registers and shared memory.

liqiangxl · 2024-09-11T23:24:14Z

!build

liqiangxl · 2024-09-13T14:36:55Z

!build

liqiangxl · 2024-09-16T12:46:05Z

!build

liqiangxl · 2024-09-16T16:47:10Z

Failed tests do not seem to be related to this PR. thunder.tests.test_grad.test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_nvfuser_cuda_thunder.dtypes.bfloat16

naoyam

Only about codegen.cpp for now

csrc/codegen.cpp

naoyam · 2024-09-16T18:12:44Z

csrc/codegen.cpp

+              ir_utils::isConsumedByIterGroupedReduction(tv)) {
+            vect_factor = kernel_->summary().num_grouped_iterations;
+          }
+          if (vect_factor > 0) {


When can this be 0?

It is initilized to 0, so stays at 0 when the tv is not vectorized with gmem or smem.

naoyam · 2024-09-16T18:15:04Z

csrc/codegen.cpp

+          } else if (
+              kernel_->summary().num_grouped_iterations > 1 &&
+              ir_utils::isConsumedByIterGroupedReduction(tv)) {
+            vect_factor = kernel_->summary().num_grouped_iterations;


In general, Kernel should already have the correct vectorization factor and CudaCodeGen should just be a straightforward printer. Is there any specific reason for this case?

Kernel only has vectorized factor for global memory access, not for shared memory access.I added comments for clarity.

Should use aligned array of registers when: (1) vectorized ld/st with global memory, tv exists in kernel summary vectorized_accesses. (2) vectorized ld/st with shared memory, tv is input to iteration grouped reduction and vectorized in runtime function.

Another option, is adding these tvs with vectorized shared memory access to kernel_->summary().vectorized_accesses.

It seems strange to me that the kernel summary only has info about global memory tensors but not shared memory. Is there any reason for that?

It seems strange to me that the kernel summary only has info about global memory tensors but not shared memory. Is there any reason for that?

To be more precise, should be kernel summary doesn't have info about vectorized access in runtime functions, e.g. blockIterGroupedYdimReduce()

I understand that. Does that matter?

oh, I mean this statement is not true. Kernel summary has info about vectorized r/w for both global and shared memories, if these vectorized r/w were generated by the scheduler.
It does't have the info about vectorized r/w implemented in the runtime functions.

I know that. I'm asking why.

Becase it doesn't check the runtime functions. These runtime functions are just strings when generating kernel.

It isn't necessary to do so, right? Your PR doesn't do that either. You check some condition in CudaCodeGen. All I'm asking is why it cannot be done when generating the kernel summary. Let me say this again:

In general, Kernel should already have the correct vectorization factor and CudaCodeGen should just be a straightforward printer.

So you are suggesting doing option-2?

Another option, is adding these tvs with vectorized shared memory access to kernel_->summary().vectorized_accesses.

No problem. we can do that.

runtime/grid_reduction.cu

Co-authored-by: Jacob Hinkle <[email protected]>

liqiangxl · 2024-09-19T16:22:27Z

!build

liqiangxl · 2024-09-23T16:43:32Z

!build

…ectorized smem access

liqiangxl · 2024-09-27T15:46:57Z

!build --diff

liqiangxl · 2024-09-27T15:53:24Z

Revised to use option-2, where the tvs used in iter gouped reductions are added to kernel_->summary().vectorized_accesses.
Step-1, mark ParallelType::Group as a special vectorization.

      // ParallelType::Group is used for both reduction & normalization.
      // When used to group iteration dims of outer reduction tvs, it has
      // vectorized access to shared memory and global memory.
      if (ptype == ParallelType::Group) {
        auto def = tv->definition();
        auto grop = dynamic_cast<GroupedReductionOp*>(def);
        if (grop && (!grop->isAllreduce())) {
          has_grouped_vectorize_dim = true;
        }
      }

Step-2, call VectorizeValidator where tv and its producer is added to vectorizedSetInfo and vectorizedAccesses

   if (has_vectorize_dim || has_misaligned_vectorize_dim ||
       has_grouped_vectorize_dim) {
     VectorizeValidator::validate(tv);
   }

liqiangxl · 2024-09-27T17:51:15Z

!build --diff

liqiangxl · 2024-09-27T19:56:04Z

!build --diff

liqiangxl · 2024-09-30T12:38:50Z

!build --diff

liqiangxl · 2024-09-30T21:13:38Z

@naoyam , I think it's ready for another round of review. codediff is expected, distributd failed tests are not related to this PR.

naoyam · 2024-10-01T20:24:41Z

csrc/codegen.cpp

@@ -2972,6 +2960,10 @@ class CudaKernelGenerator : private kir::ConstIrVisitor {
          aligned_array_of_regs_.insert(tv);
        }
      }
+      // tv is aligned if alias is aligned


I'm not sure what we are doing here, not just the change by this PR. When we have an Allocate node, does it automatically mean it's an Array allocation if it's an alias of another tensor but with a different type? It seems that's what is indicated by line 2956 since it does reinterpret_cast to an Array type. Is this really safe? Don't we need to check the type of the original allocation of the alias tensor?

Looks like that was originally introduced in #665. What do you think, @jacobhinkle?

Is this really safe? Don't we need to check the type of the original allocation of the alias tensor?

Isn't that safe if the two types have the same sizeof? That is what #665 is doing.

Sorry, my question was incomplete. If the original tensor is defined as a non-aligned tensor, is it safe to reinterpret-cast it to an aligned array type? Isn't that what could happen?

That's right if the original tensor had a different alignment from the new tensor. So the danger here would be for example if the original tensor had vectorized access of width 2 then the second tensor which tries to alias it has same sizeof(dtype) but it has a vectorized access of width 8. I didn't think about that in #665 but yes I think we should either guard against that when we're setting up the alias or accomodate the alias vectorized accesses when we codegen the original allocation.

Yeah, as far as I can see, the original tensor doesn't even seem to be guaranteed to be aligned at any size.

This doesn't need to be fixed in this PR, but please create an issue.

Actually, I forgot about this since I haven't looked at the aliasing code in a while, but we already do this analysis to guarantee the alias vectorizations are at most the same width as the original:

Fuser/csrc/device_lower/pass/alias_memory.cpp

Lines 1243 to 1247 in d610ce8

// Vectorized allocations require correct alignment so if [this_tv]

// is vectorized, the [reuse_tv] must be vectorized with the same

// or smaller factor.

// No need to check shared memory since it is always aligned to 16

// Bytes which is also the maximum vectorization width.

Oh, I see. Glad it'a a false alarm.

Actually, I wonder if it should specify the alignment size with reinterpret_cast.

<< " = *reinterpret_cast<Array<" << buffer_dtype << ", " << genInline(size) << ">*>(&" << genVariableName(alias_tv)

If I read this correctly, it just uses an Array type with no alignment requirement, for example, Array<float, 8>. The default alignment size is 1, so it seems this would tell the compiler that we are using a non-aligned address with vector loads and stores. The address is indeed aligned, so it should have no problem unless the compiler does something when a given address is marked not aligned properly. I think it'd be safer to use a proper aligned type always even when we know it's properly aligned.

Yeah, it's probably not a bad idea to add it: #3084

csrc/codegen.cpp

naoyam · 2024-10-01T20:38:19Z

csrc/fusion_executor/executor_utils.cpp

@@ -299,7 +299,8 @@ std::unique_ptr<caching::VectorizedTensorInfo> getVectorizedTensorValidationInfo

    auto vector_dim = vector_info.vectorized_loop_id;
    const auto is_aligned =
-        vector_dim->getParallelType() == ParallelType::Vectorize;
+        vector_dim->getParallelType() == ParallelType::Vectorize ||
+        vector_dim->getParallelType() == ParallelType::Group;


Why is this change necessary?

Because grouped reduction tv is added to vectorized_set_info, if not aligned, must be fusion input or output.
If don't directly using VectorizeValidator::validate(tv), won't be added to vectorized_set_info and this change is not required.

naoyam · 2024-10-01T20:51:18Z

I'm not sure why we also need to change the executor as well as the vectorization validator. I thought what we are missing is using aligned arrays inside some of the device functions and they are just local temporary arrays, so those are not something we would need to validate, right? What are we validating then?

naoyam · 2024-10-02T00:32:38Z

csrc/codegen.cpp

@@ -2972,6 +2960,10 @@ class CudaKernelGenerator : private kir::ConstIrVisitor {
          aligned_array_of_regs_.insert(tv);
        }
      }
+      // tv is aligned if alias is aligned


Why is this? I understand this is just a naming difference, but whether the original allocation is aligned or not shouldn't matter for the aliasing tensor, right? For example, this tv can be a tensor with no alignment requirement, right?

It was added to fix the test failure in CombinedSchedulerTest.LayerNormBackward/dtype_float_batch_216_hidden_65536where we have

Array<float, 4, 4> T32; auto& T29 = T32;

compiler treats T29 as aligned array instead of regular array, when passing T29 to a runtime function, should use T29.array instead of T29.
So if the original allocation is aligned, its aliasing tv should also be aligned due to auto type derivation.

On the other hand, T29 does't have to be aligned. We can use dynamic cast to remove this. But then we need to change the code of alias allocation.

liqiangxl · 2024-10-02T03:33:33Z

I'm not sure why we also need to change the executor as well as the vectorization validator. I thought what we are missing is using aligned arrays inside some of the device functions and they are just local temporary arrays, so those are not something we would need to validate, right? What are we validating then?

You are right. There is no need to validate. I was using VectorizeValidator::validate(tv) becuase this function not only validate vectorization, it also collects the vectorization info and stores in GpuLower::current()->vectorizedAccesses(). We need this info to correctly define the aligned array of registers in codegen.
I'll revise to directly add this info instead of reusing the overkill function VectorizeValidator::validate(tv)

liqiangxl · 2024-10-02T03:53:10Z

!build

… aligned due to auto type derivation

liqiangxl · 2024-10-02T13:13:46Z

!build

naoyam · 2024-10-02T16:09:23Z

csrc/device_lower/validation.cpp

@@ -598,7 +598,25 @@ void validateAndCollectVectorizeInfo(Fusion* fusion) {
            "Only allow misaligned vectorization between global and local memory.");
        has_misaligned_vectorize_dim = true;
      }
+
+      // ParallelType::Group is used for both reduction & normalization.


Is this really necessary? Shared memory is always aligned. If the producer is in global memory, then that should be a temporary work buffer, so it should be always a contiguous, aligned buffer.

I mean between register and shared memory, so we need to ensure the registers are aligned when doing vectorized read/write. Let me create an example.

In this case, T13 = GroupedReductionOf(T14), T14 (registers) should be aligned becase runtime funciton uses vectorized copy from T14(registers) to shared memory.

T13_l_float[ iblockIdx.x68{( ceilDiv(( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) ), 1) )}, ithreadIdx.x67{blockDim.x}, iUS69{1}, iG65{4}, rthreadIdx.y62{blockDim.y} ] ca_pos( 3 ) produce_pos( 2 ) = reduction( T14_l_float[ iblockIdx.x54{( ceilDiv(( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) ), 1) )}, ithreadIdx.x53{blockDim.x}, rS60{( ceilDiv(( ceilDiv(( ceilDiv(i1, blockDim.y) ), 2) ), 1) )}rf, iUS55{1}, iS51{4}, ithreadIdx.y57{blockDim.y}rf, rUS61{1}rf, rS59{2}rf ] ca_pos( 2 ) produce_pos( 8 ), op = add, initial value = float(0), allreduce = false )

I think I got it. The comment seems wrong, though.

since they are register arrays defined in runtime function

This producer_tv is not defined in the runtime functions, right?

That's right. producer_tv is not defined in the runtime functions, it is just passed to the runtime function.

comment is revised as

// ParallelType::Group is used for both reduction and normalization. // In grouped outer reduction, the runtime function uses vectorized data // transfers between registers and shared memory. The producer tensor is // stored in registers and loaded into shared memory in a vectorized // manner, so we add it to the vectorizedAccesses map to ensure register // alignment.

See #2934 (comment) PR #665 allowed us to re-use allocations that have different dtypes. We already check that our aliased tensors do not have vectorized accesses larger than those of the original tensors. However, when we have different dtypes we `reinterpret_cast` it to a different `Array` type. Previously we did not specify any alignment in that type's template args, meaning it assumed an alignment of size 1. Since the actual addresses will all still be aligned this does not caused misaligned accesses at runtime. This PR sets the template arg for alignment to be that of the vectorized access width for the alias tensor, so that the compiler could hypothetically do some optimizations knowing the address is aligned.

naoyam

LGTM. Thanks for the fix.

liqiangxl · 2024-10-03T12:01:46Z

!build

liqiangxl added 2 commits September 11, 2024 16:12

use aligned array for iter grouped reduction inputs

ebb63fe

clean

0723396

fix test

5c52624

liqiangxl and others added 3 commits September 16, 2024 05:35

tv is aligned if alias is aligned

e0b0807

format

79cc725

Merge branch 'main' into llu/aligned_reg_array

39ae08d

liqiangxl requested review from naoyam and jacobhinkle September 16, 2024 17:45

liqiangxl marked this pull request as ready for review September 16, 2024 17:45

naoyam reviewed Sep 16, 2024

View reviewed changes

jacobhinkle reviewed Sep 16, 2024

View reviewed changes

runtime/grid_reduction.cu Outdated Show resolved Hide resolved

liqiangxl and others added 3 commits September 17, 2024 08:26

Update runtime/grid_reduction.cu

5eca905

Co-authored-by: Jacob Hinkle <[email protected]>

comment

bab9781

Merge branch 'main' into llu/aligned_reg_array

ea73563

Merge branch 'main' into llu/aligned_reg_array

ece3862

liqiangxl added 3 commits September 27, 2024 13:15

merge main

0445981

treat grouped reduction tv as vectorized since runtime function has v…

48a3140

…ectorized smem access

clean

55a1885

fix

0d5079f

skip check for group vect

2bc4600

skip vect check for norm

9843851

naoyam reviewed Oct 1, 2024

View reviewed changes

csrc/codegen.cpp Show resolved Hide resolved

naoyam reviewed Oct 1, 2024

View reviewed changes

naoyam reviewed Oct 2, 2024

View reviewed changes

Merge branch 'main' into llu/aligned_reg_array

2d2b47a

skip validate

8a7eed3

jacobhinkle mentioned this pull request Oct 2, 2024

Specify different-dtype alias TV alignment #3084

Merged

If the original allocation is aligned, its aliasing tv should also be…

3a93b52

… aligned due to auto type derivation

naoyam reviewed Oct 2, 2024

View reviewed changes

revise comment

4e52208

naoyam approved these changes Oct 2, 2024

View reviewed changes

Merge branch 'main' into llu/aligned_reg_array

b9a34e8

liqiangxl merged commit 94d4b70 into main Oct 3, 2024
11 of 12 checks passed

liqiangxl deleted the llu/aligned_reg_array branch October 3, 2024 12:26

	// Vectorized allocations require correct alignment so if [this_tv]
	// is vectorized, the [reuse_tv] must be vectorized with the same
	// or smaller factor.
	// No need to check shared memory since it is always aligned to 16
	// Bytes which is also the maximum vectorization width.

use aligned array for iter grouped reduction inputs #2934

use aligned array for iter grouped reduction inputs #2934

Conversation

liqiangxl commented Sep 11, 2024 • edited Loading

liqiangxl commented Sep 11, 2024

liqiangxl commented Sep 13, 2024

liqiangxl commented Sep 16, 2024

liqiangxl commented Sep 16, 2024

naoyam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liqiangxl commented Sep 19, 2024

liqiangxl commented Sep 23, 2024

liqiangxl commented Sep 27, 2024

liqiangxl commented Sep 27, 2024 • edited Loading

liqiangxl commented Sep 27, 2024

liqiangxl commented Sep 27, 2024

liqiangxl commented Sep 30, 2024

liqiangxl commented Sep 30, 2024

naoyam Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liqiangxl commented Oct 2, 2024

liqiangxl commented Oct 2, 2024

liqiangxl commented Oct 2, 2024

Choose a reason for hiding this comment

liqiangxl Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

liqiangxl commented Oct 3, 2024

liqiangxl commented Sep 11, 2024 •

edited

Loading

liqiangxl commented Sep 27, 2024 •

edited

Loading

naoyam Oct 1, 2024 •

edited

Loading

liqiangxl Oct 2, 2024 •

edited

Loading