[BUG] Mixed Precision Gemm Correctness Regression in Cutlass 3.7/3.8 #2070

jwfromm · 2025-01-29T17:56:08Z

Describe the bug
Since Cutlass 3.7, mixed input dtype GEMMs are producing less accurate outputs than they were in Cutlass 3.6. The loss of accuracy is substantial and makes using mixed input impractical for real use-cases.

Specifically, we have a collection of mixed input GEMMs in FBGEMM that work well on Cutlass 3.6. While these kernels compile fine with newer versions of cutlass (after small api updates), they produce garbage outputs.

Directly copying example 55's BF16 x I4 Gemm example produces slightly better results, but the outputs are still much less accurate than the 3.6 equivalents.

Steps/Code to reproduce bug
We use this benchmarking script to measure the performance and accuracy of kernels. The script can be run with these sample arguments:

python quantize_bench.py --kernels=bf16_baseline,cutlass_bf16i4_rowwise --M=128 --N=2048 --K=2048

This will produce an output like this:

bf16_baseline sim: 0.000.
bf16_baseline ms: 0.007.
bf16_baseline TFLOPS: 150.635.
bf16_baseline GB/s: 1323.942.
cutlass_bf16i4_rowwise sim: 28.375.
cutlass_bf16i4_rowwise ms: 0.013.
cutlass_bf16i4_rowwise TFLOPS: 79.561.
cutlass_bf16i4_rowwise GB/s: 233.089.

The sim metric is an L1 distance from the BF16 output. After updating to cutlass 3.7, copying example 55, and running the same script we get:

bf16_baseline sim: 0.000.
bf16_baseline ms: 0.007.
bf16_baseline TFLOPS: 150.563.
bf16_baseline GB/s: 1323.308.
cutlass_bf16i4_rowwise sim: 328.000.
cutlass_bf16i4_rowwise ms: 0.013.
cutlass_bf16i4_rowwise TFLOPS: 80.016.
cutlass_bf16i4_rowwise GB/s: 234.421.

Which has a clearly less correct output. The updated version of the kernel can be found at this PR

Expected behavior
The accuracy of mixed input kernels should not have changed due to updates.

Environment details (please complete the following information):
cuda 12.4 driver version 535.154.05 on Linux system with 8X H100 GPUs.

The text was updated successfully, but these errors were encountered:

hwu36 · 2025-02-04T18:40:59Z

@IwakuraRein

IwakuraRein · 2025-02-06T01:21:51Z

@jwfromm Thanks for submitting this bug. Could you first try reverting every change in the PR, except for the cutlass 3.7 and mixed precision api update (i.e., remove MixedInput in the kernel schedule flag)? At first glance, I think some changes like int num_groups = w_scale.size(0); to int scale_k = w_scale.size(1); may have caused the incorrect result.

I will try to clone and build the repo and reproduce the issue. If there's any tips to speed up the build, I'd greatly appreciate it. Thanks.

IwakuraRein · 2025-02-06T12:54:03Z

@jwfromm I am able to reproduce the issue. Looks like when the data type of scale/zero is not the same as the activation (fp16 and bf16 in your original code base), the result will always be incorrect.

I tried checking out cutlass 3.7, removing the MixedInput in the kernel schedule flag, and enforcing the scale/zero type by changing quantize_ops.py:1033: return x.to(torch.bfloat16), wq, w_scale, w_zp to return x.to(torch.bfloat16), wq, w_scale.to(torch.bfloat16), w_zp.to(torch.bfloat16), and this time the benchmark result seems normal:

TMA benchmarks will be running with experimental grid constant TMA descriptor.
Benchmarking B=1, M=128, N=2048, K=2048.
bf16_baseline sim: 0.000.
bf16_baseline ms: 0.007.
bf16_baseline TFLOPS: 159.461.
bf16_baseline GB/s: 1401.517.
cutlass_bf16i4_rowwise sim: 20.625.
cutlass_bf16i4_rowwise ms: 0.015.
cutlass_bf16i4_rowwise TFLOPS: 73.762.
cutlass_bf16i4_rowwise GB/s: 216.099.

Plz see this commit for more info.

The scales/zeros data type in your PR are also the same as activation but the results are incorrect so I guess there's some other bugs introduced by the other changes.

Thanks again for submitting the bug. I will fix this edge case asap.

IwakuraRein · 2025-02-07T08:08:35Z

@jwfromm I have located the source for the bug. To fix the issue, in the include/cutlass/detail/collective/mixed_input_utils.hpp:72, change src.size() to src_vm.size().

The fix will be included in the cutlass 3.8 tag

jwfromm added ? - Needs Triage bug Something isn't working labels Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Mixed Precision Gemm Correctness Regression in Cutlass 3.7/3.8 #2070

[BUG] Mixed Precision Gemm Correctness Regression in Cutlass 3.7/3.8 #2070

jwfromm commented Jan 29, 2025 •

edited

Loading

hwu36 commented Feb 4, 2025

IwakuraRein commented Feb 6, 2025

IwakuraRein commented Feb 6, 2025

IwakuraRein commented Feb 7, 2025

[BUG] Mixed Precision Gemm Correctness Regression in Cutlass 3.7/3.8 #2070

[BUG] Mixed Precision Gemm Correctness Regression in Cutlass 3.7/3.8 #2070

Comments

jwfromm commented Jan 29, 2025 • edited Loading

hwu36 commented Feb 4, 2025

IwakuraRein commented Feb 6, 2025

IwakuraRein commented Feb 6, 2025

IwakuraRein commented Feb 7, 2025

jwfromm commented Jan 29, 2025 •

edited

Loading