Implement `DotOperandEncodingAttr::getSizePerThread` with block layout parent #5863

anmyachev · 2025-02-08T13:37:00Z

For XPU backend, the logic of the common code is slightly changed and some Triton lit tests encounter the problem of an unimplemented function.

…as parent Signed-off-by: Anatoly Myachev <[email protected]>

Jokeren · 2025-02-08T14:15:46Z

lib/Dialect/TritonGPU/IR/Dialect.cpp

@@ -2213,10 +2213,11 @@ SmallVector<unsigned> DotOperandEncodingAttr::getSizePerThread() const {
  assert(parentLayout && "DotOperandEncodingAttr must have a parent");
  if (auto parentMmaLayout = mlir::dyn_cast<MmaEncodingTrait>(parentLayout)) {
    return parentMmaLayout.getSizePerThreadForOperand(getKWidth(), getOpIdx());
+  } else if (auto blocked = mlir::dyn_cast<BlockedEncodingAttr>(parentLayout)) {
+    return expandMatrixShapeWithBatch(ArrayRef(blocked.getSizePerThread()));


@lezcano I thought this method is going to be deprecated?

lezcano

We plan to implement everything in terms of LinearLayouts, yes, but for now I'm happy to take this to unblock them. It'd be nice to add a test in DialectTest.cpp LinearEncodingTest, making sure that this implementation agrees, at least in principle, with the LinearEncoding implementation.

anmyachev · 2025-02-10T13:31:20Z

We plan to implement everything in terms of LinearLayouts, yes, but for now I'm happy to take this to unblock them. It'd be nice to add a test in DialectTest.cpp LinearEncodingTest, making sure that this implementation agrees, at least in principle, with the LinearEncoding implementation.

Hi @lezcano,

I'm trying to write a test but I get the following error. Could you tell me if this is my mistake in writing the test or if this is a current layout incompatibility?

/project/CMake-src/Source/CTest/cmCTestRunTest.cxx:43 31: Expected equality of these values:

/project/CMake-src/Source/CTest/cmCTestRunTest.cxx:43 31:   distributedEncoding.getSizePerThread()

/project/CMake-src/Source/CTest/cmCTestRunTest.cxx:43 31:     Which is: { 1, 4, 4 }

/project/CMake-src/Source/CTest/cmCTestRunTest.cxx:43 31:   linearEncoding.getSizePerThread()

/project/CMake-src/Source/CTest/cmCTestRunTest.cxx:43 31:     Which is: { 4, 1 }

The changes I ran the test with:

Patch

diff --git a/unittest/Dialect/TritonGPU/DialectTest.cpp b/unittest/Dialect/TritonGPU/DialectTest.cpp
index 797b55d43..1c7d16340 100644
--- a/unittest/Dialect/TritonGPU/DialectTest.cpp
+++ b/unittest/Dialect/TritonGPU/DialectTest.cpp
@@ -502,6 +502,7 @@ TEST_F(LinearEncodingTest, DistributedEncodingToLinearEncoding) {
       triton::gpu::CTALayoutAttr::get(&ctx, {4, 2}, {2, 2}, {1, 0}),
   };
   SmallVector<triton::gpu::DistributedEncodingTrait> distributedEncodings;
+  SmallVector<triton::gpu::DistributedEncodingTrait> distributedEncodings2;
 
   // Create BlockedEncodingAttr and SliceEncodingAttr
   {
@@ -516,6 +517,11 @@ TEST_F(LinearEncodingTest, DistributedEncodingToLinearEncoding) {
         distributedEncodings.push_back(blockedEncoding);
         distributedEncodings.push_back(
             triton::gpu::SliceEncodingAttr::get(&ctx, 0, blockedEncoding));
+        // Create an opIdx=0 and opIdx=1 encoding
+        for (unsigned opIdx = 0; opIdx < 2; ++opIdx) {
+          distributedEncodings2.push_back(
+            triton::gpu::DotOperandEncodingAttr::get(&ctx, opIdx, blockedEncoding, 0));
+        }
       }
     }
   }
@@ -538,6 +544,30 @@ TEST_F(LinearEncodingTest, DistributedEncodingToLinearEncoding) {
     }
   }
 
+  for (const auto &distributedEncoding : distributedEncodings2) {
+    for (auto shape : shapes) {
+      if (auto sliceEncoding =
+              dyn_cast<triton::gpu::SliceEncodingAttr>(distributedEncoding)) {
+        shape.erase(shape.begin() + sliceEncoding.getDim());
+      }
+
+      // Create LinearEncodingAttr from the LinearLayout
+      auto linearLayout = distributedEncoding.toLinearLayout(shape);
+      auto linearEncoding =
+          triton::gpu::LinearEncodingAttr::get(&ctx, linearLayout);
+
+        if (auto layout = dyn_cast<triton::gpu::DotOperandEncodingAttr>(distributedEncoding)) {
+          if (isa<triton::gpu::BlockedEncodingAttr>(layout.getParent())) {
+              // FIXME: This happens to be correct for SliceLayout because of the hack
+              // in SliceEncodingAttr::toLinearLayout(). We should remove the hack
+              // and the skips in the getWarpsPerCTA() and getThreadsPerWarp()
+              ASSERT_EQ(distributedEncoding.getSizePerThread(),
+                        linearEncoding.getSizePerThread());
+          }
+        }
+    }
+  }
+
   for (const auto &distributedEncoding : distributedEncodings) {
     for (auto shape : shapes) {
       if (auto sliceEncoding =

lezcano · 2025-02-10T14:10:15Z

At first sight, it looks like the LinearLayout is more correct to my view. For starters, it has the right rank, right? Difficult to tell without knowing the blocked layout used.

In general, look at the structure of the linear layout and see if it returns what it should, then compare with what the legacy path returns.

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2025-02-10T15:20:03Z

For starters, it has the right rank, right?

Thanks for looking! Right, it was because of using expandMatrixShapeWithBatch (removed it). Now the difference is: { 4, 4 } vs { 4, 1 }.

lezcano · 2025-02-10T15:28:44Z