Integrate kleidiAI release v0.1.0 into MNN 2.9.3 #2995

xhzheng1895 · 2024-08-16T01:06:46Z

Put KleidiAI files in folder source/backend/cpu/arm/kleidiAI/kai, download from arm gitlab and remain unchanged. Maybe will remove these files and download them when build.

MNNKleidiAI.cpp is interface between MNN and KleidiAI.

Rewrite function in class DenseConvInt8TiledExecutor , in ConvInt8TiledExecutor.cpp, to call KleidiAI functions. Maybe implement a new execution later.

Changes to GeometryConvUtils.cpp and ShapeTensorConvert.cpp are for the input and output of DenseConvInt8TiledExecutor is NCHW, rather than NC4HW4, to avoid redundant pack/unpack and get better performance.

Put KleidiAI files in folder source/backend/cpu/arm/kleidiAI/kai, download from arm gitlab and remain unchanged. Maybe will remove these files and download them when build. MNNKleidiAI.cpp is interface between MNN and KleidiAI. Rewrite function in class DenseConvInt8TiledExecutor , in ConvInt8TiledExecutor.cpp, to call KleidiAI functions. Maybe implement a new execution later. Changes to GeometryConvUtils.cpp and ShapeTensorConvert.cpp are for the input and output of DenseConvInt8TiledExecutor is NCHW, rather than NC4HW4, to avoid redundant pack/unpack and get better performance.

CLAassistant · 2024-08-16T01:06:52Z

All committers have signed the CLA.

wangzhaode · 2024-08-20T03:19:04Z

在M3芯片上测试了下面的2个模型，结果不正确

https://modelscope.cn/models/zhaode/Qwen2-7B-Instruct-MNN
https://modelscope.cn/models/zhaode/Qwen2-1.5B-Instruct-MNN

xhzheng1895 · 2024-08-21T01:28:23Z

Hi，现在kleidiAI只支持对称量化的模型。
对于非对称量化模型，会走到DenseConvInt8TiledExecutor原本的一些函数里。但是需要把KAI_CONV_NCHW_IN_OUT这个宏关掉，否则输入输出format会和DenseConvInt8TiledExecutor原生的函数不匹配。

wangzhaode · 2024-08-21T08:53:23Z

OK测试了一下对称量化的模型没有问题，decode性能相比MNN的原始实现有加速效果
在M3 Pro上测试Qwen2-1.5B-int4， CPU 4线程速度如下：

	prefill	decode
MNN	330	75
KleidiAI	295	85

yiyangfan01 · 2024-08-24T09:32:05Z

Here is the perf data I collected with the same model with @wangzhaode on RedMi K60 ultra(MTK D9300 inside), 16GB RAM, 4Threads.
Prefill has 57% improvement, decode has 28% improvement.

xhzheng1895 closed this Aug 21, 2024

xhzheng1895 reopened this Aug 21, 2024

wangzhaode added 2 commits September 2, 2024 17:35

Bugfix of thread workload.

6142111

Bugfix of ocIndex over ocUp4.

018eb7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate kleidiAI release v0.1.0 into MNN 2.9.3 #2995

Integrate kleidiAI release v0.1.0 into MNN 2.9.3 #2995

xhzheng1895 commented Aug 16, 2024

CLAassistant commented Aug 16, 2024 •

edited

Loading

wangzhaode commented Aug 20, 2024

xhzheng1895 commented Aug 21, 2024

wangzhaode commented Aug 21, 2024

yiyangfan01 commented Aug 24, 2024

Integrate kleidiAI release v0.1.0 into MNN 2.9.3 #2995

Are you sure you want to change the base?

Integrate kleidiAI release v0.1.0 into MNN 2.9.3 #2995

Conversation

xhzheng1895 commented Aug 16, 2024

CLAassistant commented Aug 16, 2024 • edited Loading

wangzhaode commented Aug 20, 2024

xhzheng1895 commented Aug 21, 2024

wangzhaode commented Aug 21, 2024

yiyangfan01 commented Aug 24, 2024

CLAassistant commented Aug 16, 2024 •

edited

Loading