MNN batch inference time not more efficient than single image #3184

danielbr33 · 2025-01-28T23:39:48Z

You can use C++ API to do it. How ever, it's not more efficient to use multi batch for inference.

Why is multi batch computing inefficient using MNN?

Is this true for all operators?

Originally posted by @mingyunzzu in #673

Can somebody explain why inference with batch isn't more efficient in MNN? When I run detection on single image it takes 7 miliseconds and when I run on batch of 32 images it takes 8 miliseconds per image. This is only the time of inference measured by time of runSession without preparing images and postprocessing. What can I use to reach better results?

jxt1234 · 2025-01-29T03:42:28Z

The issue's bug has been resolved. Now batch inference time depend on the compute flops of device. If one image has reach the peak flops, then batch image will not more efficient.
The get the device's compute peak flops, can use ./run_test.out speed/MatMulBConst
Normally GPU has more flops than CPU, you can use opencl instead of CPU to forward.

danielbr33 · 2025-02-01T00:05:06Z

I've run the modified test with dimensions closer to mine. The results are below. My image is a tensor of shape (3, 413, 413).
Could you explain me if I correctly understand that there is no large difference in FLOPS result between 10x and 100x larger sizes and therefore does not achieve better detection time when using batch?

I've also tried to quantize model, which reduced it size from 6.9MB to 1.8MB, but time increased from 7.5ms to 11ms which also seems strange to me. I used low precision in my model's BackendConfig.

Do you have any different advices what can I use to reduce time of inference if I cannot use GPU?

(base) daniel@Daniel-PC:~/Desktop/MNN/build$ ./run_test.out speed/MatMulBConst
CPU Group: [ 14 12 15 13 ], 800000 - 3600000
CPU Group: [ 11 8 6 4 2 0 9 10 7 5 3 1 ], 800000 - 4900000
The device supports: i8sdot:0, fp16:0, i8mm: 0, sve2: 0
running speed/MatMulBConstTest.
MatMul B Const (Conv1x1): [540, 540, 320], run 100
_runConst, 203, cost time: 9.487000 ms
[540, 540, 320], Avg time: 1.366700 ms , flops: 68.275406 G
MatMul B Const (Conv1x1): [1024, 1024, 1024], run 100
_runConst, 203, cost time: 18.649000 ms
[1024, 1024, 1024], Avg time: 13.539290 ms , flops: 79.305626 G
MatMul B Const (Conv1x1): [3, 416, 416], run 1000
_runConst, 203, cost time: 0.081000 ms
[3, 416, 416], Avg time: 0.011036 ms , flops: 47.043137 G
MatMul B Const (Conv1x1): [30, 416, 416], run 1000
_runConst, 203, cost time: 0.153000 ms
[30, 416, 416], Avg time: 0.079503 ms , flops: 65.301689 G
MatMul B Const (Conv1x1): [300, 416, 416], run 100
_runConst, 203, cost time: 0.807000 ms
[300, 416, 416], Avg time: 0.753230 ms , flops: 68.925568 G
speed/MatMulBConstTest cost time: 1693.446 ms
√√√ all <speed/MatMulBConst> tests passed.
TEST_NAME_UNIT: 单元测试
TEST_CASE_AMOUNT_UNIT: {"blocked":0,"failed":0,"passed":1,"skipped":0}
TEST_CASE={"name":"单元测试","failed":0,"passed":1}

jxt1234 · 2025-02-19T11:19:30Z

Run the command to test GPU Speed
./run_test.out speed/MatMulBConst 3
Run timeProfile.out can get the model's FLOPS. If the FLOPS reach the result of GPU FLOPS. Batch Infer can't speed up.
For Int8 model normally can only speed up when use CPU for ARM or x64 with AVX512.

jxt1234 added the User The user ask question about how to use. Or don't use MNN correctly and cause bug. label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNN batch inference time not more efficient than single image #3184

MNN batch inference time not more efficient than single image #3184

danielbr33 commented Jan 28, 2025

jxt1234 commented Jan 29, 2025

danielbr33 commented Feb 1, 2025

jxt1234 commented Feb 19, 2025

MNN batch inference time not more efficient than single image #3184

MNN batch inference time not more efficient than single image #3184

Comments

danielbr33 commented Jan 28, 2025

jxt1234 commented Jan 29, 2025

danielbr33 commented Feb 1, 2025

jxt1234 commented Feb 19, 2025