llama.cpp allocates a way more ram than ollama #9414

commonuserlol · 2024-09-10T16:59:56Z

commonuserlol
Sep 10, 2024

Hi, right away I want to warn that I am just an end user who wants to use the features of Vulkan/AVX which ollama doesn't have.

I've compiled llama.cpp and tried to run https://huggingface.co/bartowski/LongWriter-llama3.1-8b-GGUF/blob/main/LongWriter-llama3.1-8b-IQ3_XS.gguf. On the CPU it somehow got up and running (although I think slower than ollama (well I'm comparing to https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct which might have less ram usage due safetensors format?), and used almost all the available RAM). On the GPU, Vulkan was unable to allocate memory. I've noticed that K and V size is 8GB for each what means I need more (V)RAM?

Setup:
Windows 11 24h2 (msys2 for Vulkan).
RX 570 4GB
16GB RAM

ggerganov · 2024-09-10T17:06:28Z

ggerganov
Sep 10, 2024
Maintainer

By default, llama.cpp examples allocate the full training context of the model which in this case is quite big: ~128k tokens. You can reduce that, for example -c 8192.

1 reply

commonuserlol Sep 10, 2024
Author

Thanks, you're saved me :)

However, I'm able to allocate ~16k tokens on GPU for which would fit into 2gb (that's max size available for vulkan as log reports). Can I increase buffer size since performance is worse than on CPU?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp allocates a way more ram than ollama #9414

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

llama.cpp allocates a way more ram than ollama #9414

commonuserlol Sep 10, 2024

Replies: 1 comment · 1 reply

ggerganov Sep 10, 2024 Maintainer

commonuserlol Sep 10, 2024 Author

commonuserlol
Sep 10, 2024

Replies: 1 comment 1 reply

ggerganov
Sep 10, 2024
Maintainer

commonuserlol Sep 10, 2024
Author