[Misc][Kernel]: Add GPTQAllSpark Quantization #12931
Open
+2,254
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR mainly added specific optimizations for the Ampere architecture A16W8 quantization, supporting the GPTQ quantization model in the scenario where GroupSize=-1 and act_desc is False, and its performance in this scenario is better than Marlin.
Operator performance test (Marlin VS AllSpark) can be performed through the following command:
python3 benchmarks/kernels/benchmark_marlin.py --limit-num-bits 8 --limit-act-order 0 --limit-k-full 1 --limit-group-size -1
The following figure shows the performance comparison of Marlin vs. AllSpark under different M settings for the common Gemm scale of the model on the A100 GPU. The blue line shows the acceleration ratio of Marlin A16W8 Gemm compared to Torch FP16 Gemm, and the orange line shows the acceleration ratio of AllSpark A16W8 Gemm compared to Torch FP16 Gemm. In scenarios where N and K are small and M is large, AllSpark performs significantly better than Marlin. In other scenarios, the performance is basically the same.
Use the following command to perform performance test on the Qwen2-7B-Instruct-quantized.w8a16 model on a single A100 card
CUDA_VISIBLE_DEVICES=1 python3 benchmarks/benchmark_throughput.py --backend=vllm --model Qwen2-7B-Instruct-quantized.w8a16/ --quantization gptq_allspark(or gptq_marlin) --input-len 2048 --output-len 256 --num-prompts=1000 --trust-remote-code --dtype=float16 --kv-cache-dtype=auto --device=cuda
The performance results of the whole network are as follows: