New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

大语言模型LLM推理时量化问题 #3251

Open

Sodiride123 opened this issue Feb 20, 2025 · 1 comment

Labels

User

Sodiride123 commented Feb 20, 2025

您好，目前用llm_demo实现了Qwen2-0.5B-Instruct的推理，比较疑惑里面的量化问题。我理解的关于量化的选项有3个：

编译mnn时打开 MNN_LOW_MEMORY 宏
执行llmexport.py导出.mnn格式模型时，指定 --quant_bit (4或者8)
推理时，设置config.json的precision=low

有2个问题：

执行llmexport.py导出mnn模型时，quant_bit的可选的值只有4和8吗？然后导出的就是对应精度的量化模型是吗？
执行llm_demo推理时，比如我用的GPU是RTX 4090，然后我选OpenCL作为推理后端，precision设为low；模型会以量化的精度进行推理吗？

Collaborator

jxt1234 commented Feb 21, 2025 •

edited

Loading

使用 llmexport.py 直接导出 mnn 只支持 4 和 8，但可以加 lm_quant_bit 为 lm 层单独设置 quant_bit 值，也可以增加 --quant_block 参数，缩小量化尺寸来提升精度（quant_block 越小，精度越高，对应地体积越大，性能越慢）。如果需要其他的 bit 数，先用 llmexport.py 导出onnx ，再用 MNNConvert 把 onnx 转成 mnn ，参考文档 https://mnn-docs.readthedocs.io/en/latest/transformers/llm.html#id5
GPU 是基于输入 fp32/ fp16 ，权重 quant_bit 的模式进行推理。CPU 是先把输入量化到 int8，再和 quant_bit 的权重进行计算

jxt1234 added the User label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment