We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好,目前用llm_demo实现了Qwen2-0.5B-Instruct的推理,比较疑惑里面的量化问题。我理解的关于量化的选项有3个:
有2个问题:
The text was updated successfully, but these errors were encountered:
使用 llmexport.py 直接导出 mnn 只支持 4 和 8,但可以加 lm_quant_bit 为 lm 层单独设置 quant_bit 值,也可以增加 --quant_block 参数,缩小量化尺寸来提升精度(quant_block 越小,精度越高,对应地体积越大,性能越慢)。如果需要其他的 bit 数,先用 llmexport.py 导出onnx ,再用 MNNConvert 把 onnx 转成 mnn ,参考文档 https://mnn-docs.readthedocs.io/en/latest/transformers/llm.html#id5
GPU 是基于输入 fp32/ fp16 ,权重 quant_bit 的模式进行推理。CPU 是先把输入量化到 int8,再和 quant_bit 的权重进行计算
Sorry, something went wrong.
No branches or pull requests
您好,目前用llm_demo实现了Qwen2-0.5B-Instruct的推理,比较疑惑里面的量化问题。我理解的关于量化的选项有3个:
有2个问题:
The text was updated successfully, but these errors were encountered: