Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大语言模型LLM推理时量化问题 #3251

Open
Sodiride123 opened this issue Feb 20, 2025 · 1 comment
Open

大语言模型LLM推理时量化问题 #3251

Sodiride123 opened this issue Feb 20, 2025 · 1 comment
Labels
User The user ask question about how to use. Or don't use MNN correctly and cause bug.

Comments

@Sodiride123
Copy link

您好,目前用llm_demo实现了Qwen2-0.5B-Instruct的推理,比较疑惑里面的量化问题。我理解的关于量化的选项有3个:

  1. 编译mnn时打开 MNN_LOW_MEMORY 宏
  2. 执行llmexport.py导出.mnn格式模型时,指定 --quant_bit (4或者8)
  3. 推理时,设置config.json的precision=low

有2个问题:

  1. 执行llmexport.py导出mnn模型时,quant_bit的可选的值只有4和8吗?然后导出的就是对应精度的量化模型是吗?
  2. 执行llm_demo推理时,比如我用的GPU是RTX 4090,然后我选OpenCL作为推理后端,precision设为low;模型会以量化的精度进行推理吗?
@jxt1234
Copy link
Collaborator

jxt1234 commented Feb 21, 2025

  1. 使用 llmexport.py 直接导出 mnn 只支持 4 和 8,但可以加 lm_quant_bit 为 lm 层单独设置 quant_bit 值,也可以增加 --quant_block 参数,缩小量化尺寸来提升精度(quant_block 越小,精度越高,对应地体积越大,性能越慢)。如果需要其他的 bit 数,先用 llmexport.py 导出onnx ,再用 MNNConvert 把 onnx 转成 mnn ,参考文档 https://mnn-docs.readthedocs.io/en/latest/transformers/llm.html#id5

  2. GPU 是基于输入 fp32/ fp16 ,权重 quant_bit 的模式进行推理。CPU 是先把输入量化到 int8,再和 quant_bit 的权重进行计算

@jxt1234 jxt1234 added the User The user ask question about how to use. Or don't use MNN correctly and cause bug. label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User The user ask question about how to use. Or don't use MNN correctly and cause bug.
Projects
None yet
Development

No branches or pull requests

2 participants