llama.cpp Windows/ROCm builds are broken? Using shared GPU memory instead of dedicated. #9960
Replies: 3 comments
-
After doing more testing, i've noticed two things: First thing; i was quantizing models with Second thing, closely related to the issue: some models work just fine. HOWEVER, LLaMA 3.2 3B quantized to Q8_0 works just fine, and is loaded to dedicated GPU memory! |
Beta Was this translation helpful? Give feedback.
-
Small update; I've confirmed that this bug does not happen when using Vulkan as a backend. |
Beta Was this translation helpful? Give feedback.
-
Just ran into this issue on a Ryzen 5 5600X with Radeon RX 6800XT on Windows 11. Can confirm that the issue is with the latest ROCm runtime (1.1.13). Switching to Vulkan or 1.1.10 runtime works. |
Beta Was this translation helpful? Give feedback.
-
I've been using llama.cpp w/ ROCm 6.1.2 on latest Windows 11 for quite a while.
My hardware setup is RX 7900XT (gfx1100, 20GB VRAM), paried with 32GB RAM and Ryzen 9 5900X.
Recently (this month), i've noticed that latest builds have been performing extremely bad compared to previous ones - with a magnitude slower inference, and it happens only on Windows - i'm also using llama.cpp on Arch Linux w/ ROCm 6.0.2, on the same hardware, without any performance-related issues whatsoever, so i assume it's Windows-specific bug. Note that i haven't changed anything in my hardware setup and OS. I did modify the way i'm building llama.cpp recently, but not in a major way, and i've checked with "old" building method (attached below) that has worked previously.
I have noticed that any model i'm trying to load gets pushed to shared GPU memory (per Task Manager), instead of dedicated one, per screenshot below. Since shared memory is reported as 16GB, this GPU has 20GB VRAM (which is the same as reported dedicated memory), and RAM spikes to maximum... i guess that's just RAM, which makes the inference incredibly slow due to memory speed contraints.
I would like to present the results of
llama-benchmark
here, but i'm unable to finish it in reasonable time - it gives me ~500 score in pp512 (it's well above 1000 on Linux build), and then just hangs trying to do warmup for generation. I've managed to test the inference speed with llama-server, but i've got worse results than on raw CPU inference.To check if this is an issue with recent build, i've rolled back to llama.cpp from around a month ago, which were the builds i knew for sure have been working fine, but... nope, still the same thing! Inference happens on GPU, but model is loaded to shared memory!
I have no idea what could cause that. I'm attaching my build script for Windows i've been using for a long time with success. I'm using the same CMake flags on Linux. Can anyone explain what could i do to diagnoze and fix this issue?
Beta Was this translation helpful? Give feedback.
All reactions