why llama.cpp inference with Vulkan backend by using Android GPU has very bad performance #9464

FranzKafkaYu · 2024-09-13T07:05:25Z

FranzKafkaYu
Sep 13, 2024

I have tried many times when I enable Vulkan backend with GPU acceleration,the performance is very bad.Currently I have tried Qulcomm Adreno GPU and ARM Mali GPU,Adreno GPU will faied in loading model,and Mali GPU can load model and execute inference but the Performance is very bad.

I tried other projects,such as MLC-LLM and MediaPipe,they work with GPU and the performance is decent.Why llama.cpp can't compete with these projects?Can someone explain this to me？

FranzKafkaYu · 2024-09-13T07:14:54Z

FranzKafkaYu
Sep 13, 2024
Author

with same model,same prompt,same output:

reference with pure CPU will cost 1500ms~1700ms

reference with Vulkan(GPU) will cost 24000ms~25000ms

1 reply

kinchahoy Nov 5, 2024

+1 I see the same issue with Vulkan on ARM RK3588 devices. I don't think it's a driver issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why llama.cpp inference with Vulkan backend by using Android GPU has very bad performance #9464

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

why llama.cpp inference with Vulkan backend by using Android GPU has very bad performance #9464

FranzKafkaYu Sep 13, 2024

Replies: 1 comment · 1 reply

FranzKafkaYu Sep 13, 2024 Author

kinchahoy Nov 5, 2024

FranzKafkaYu
Sep 13, 2024

Replies: 1 comment 1 reply

FranzKafkaYu
Sep 13, 2024
Author