feat: implement SM-Constrained GEMM API #744

lanchongyizu · 2025-01-21T03:03:04Z

As requested in #591, this PR implements the plan function of GEMM
with num_ctas as an argument to specify the grid size.

@yzh119

yzh119

Hi @lanchongyizu , great work! Can you also support this for SM80 template?

Reference: https://github.com/efeslab/Nanoflow/blob/22f0b48739d3a9ad1d8c82f956906b3bc58d519b/pipeline/include/cutlassGemmWrapperImpl.cuh#L92

For SM80 API, we might support setting a 3D tuple of num_ctas (on m, n, k dimension, correspondingly).

yzh119 · 2025-01-21T04:33:55Z

tests/test_group_gemm.py

@@ -33,6 +33,7 @@
 @pytest.mark.parametrize("dtype", DTYPES)
 @pytest.mark.parametrize("device", CUDA_DEVICES)
 @pytest.mark.parametrize("backend", ["auto", "sm90", "sm80"])
+@pytest.mark.parametrize("num_ctas", [0, 4, 16, 64])


What's the expected behavior of num_ctas=0?

As requested in flashinfer-ai#591, this PR implements the `plan` function of GEMM with `num_ctas` as an argument to specify the grid size.

yzh119 · 2025-01-24T03:04:25Z

include/flashinfer/gemm/group_gemm_sm90.cuh

@@ -121,8 +121,7 @@ cudaError_t CutlassSegmentGEMMSM90Run(void* float_buffer, size_t float_buffer_si

      cutlass::KernelHardwareInfo hw_info;
      cudaGetDevice(&hw_info.device_id);
-      hw_info.sm_count =
-          cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+      hw_info.sm_count = num_ctas;


I tried nsys profiler and it turns out this value can't control the number of SMs this kernel used.
A more fundamental approach might be using green context.

yzh119 reviewed Jan 21, 2025

View reviewed changes

feat: implement SM-Constrained GEMM API

eac553b

As requested in flashinfer-ai#591, this PR implements the `plan` function of GEMM with `num_ctas` as an argument to specify the grid size.

yzh119 force-pushed the sm_constrained_gemm branch from 841b423 to eac553b Compare January 24, 2025 02:28

yzh119 reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement SM-Constrained GEMM API #744

feat: implement SM-Constrained GEMM API #744

lanchongyizu commented Jan 21, 2025

yzh119 left a comment

yzh119 Jan 21, 2025

yzh119 Jan 24, 2025

feat: implement SM-Constrained GEMM API #744

Are you sure you want to change the base?

feat: implement SM-Constrained GEMM API #744

Conversation

lanchongyizu commented Jan 21, 2025

yzh119 left a comment

Choose a reason for hiding this comment

yzh119 Jan 21, 2025

Choose a reason for hiding this comment

yzh119 Jan 24, 2025

Choose a reason for hiding this comment