llama.cpp Windows/ROCm builds are broken? Using shared GPU memory instead of dedicated. #9960

SteelPh0enix · 2024-10-20T11:12:00Z

SteelPh0enix
Oct 20, 2024

I've been using llama.cpp w/ ROCm 6.1.2 on latest Windows 11 for quite a while.
My hardware setup is RX 7900XT (gfx1100, 20GB VRAM), paried with 32GB RAM and Ryzen 9 5900X.

Recently (this month), i've noticed that latest builds have been performing extremely bad compared to previous ones - with a magnitude slower inference, and it happens only on Windows - i'm also using llama.cpp on Arch Linux w/ ROCm 6.0.2, on the same hardware, without any performance-related issues whatsoever, so i assume it's Windows-specific bug. Note that i haven't changed anything in my hardware setup and OS. I did modify the way i'm building llama.cpp recently, but not in a major way, and i've checked with "old" building method (attached below) that has worked previously.

I have noticed that any model i'm trying to load gets pushed to shared GPU memory (per Task Manager), instead of dedicated one, per screenshot below. Since shared memory is reported as 16GB, this GPU has 20GB VRAM (which is the same as reported dedicated memory), and RAM spikes to maximum... i guess that's just RAM, which makes the inference incredibly slow due to memory speed contraints.

I would like to present the results of llama-benchmark here, but i'm unable to finish it in reasonable time - it gives me ~500 score in pp512 (it's well above 1000 on Linux build), and then just hangs trying to do warmup for generation. I've managed to test the inference speed with llama-server, but i've got worse results than on raw CPU inference.

To check if this is an issue with recent build, i've rolled back to llama.cpp from around a month ago, which were the builds i knew for sure have been working fine, but... nope, still the same thing! Inference happens on GPU, but model is loaded to shared memory!

I have no idea what could cause that. I'm attaching my build script for Windows i've been using for a long time with success. I'm using the same CMake flags on Linux. Can anyone explain what could i do to diagnoze and fix this issue?

REM execute via VS native tools command line prompt
REM make sure to clone the repo first, put this script next to the repo dir
REM this script is configured for building llama.cpp w/ ROCm support
REM for a system with Ryzen 9 5900X and RX 7900XT.
REM Unless you have the exact same setup, you may need to change some flags
REM and/or strings here.

set AMDGPU_TARGETS="gfx1100"
set HSA_OVERRIDE_GFX_VERSION="11.0.0"
set ROCM_VERSION="6.1.2"
set USE_ROCM=1
set ROCM_PATH=%HIP_PATH%
set CMAKE_MODULE_PATH=%HIP_PATH%cmake;%CMAKE_MODULE_PATH%
set CMAKE_PREFIX_PATH=%ROCM_PATH%;%CMAKE_PREFIX_PATH%
set PATH=%ROCM_PATH%bin;C:\Strawberry\perl\bin;%PATH%
set LLAMA_CPP_PYTHON_VENV_PATH=%USERPROFILE%\.llama.cpp.venv

call %LLAMA_CPP_PYTHON_VENV_PATH%\Scripts\activate.bat
cd %HOMEPATH%\.llama.cpp
git fetch
git clean -xddf
git pull
git submodule update --recursive
git lfs pull

REM update Python dependencies
python -m pip install --upgrade pip setuptools wheel
python -m pip install --upgrade sentencepiece transformers protobuf torch

cmake -S . -B build -G Ninja^
    -DCMAKE_BUILD_TYPE=Release^
    -DCMAKE_CXX_COMPILER=clang++^
    -DCMAKE_C_COMPILER=clang^
    -DLLAMA_BUILD_TESTS=OFF^
    -DLLAMA_BUILD_EXAMPLES=ON^
    -DLLAMA_BUILD_SERVER=ON^
    -DLLAMA_STANDALONE=ON^
    -DLLAMA_CURL=OFF^
    -DGGML_NATIVE=ON^
    -DGGML_LTO=ON^
    -DGGML_OPENMP=ON^
    -DAMDGPU_TARGETS=%AMDGPU_TARGETS%^
    -DGGML_HIPBLAS=ON^
    -DGGML_CUDA_FORCE_CUBLAS=ON
cmake --build build --config Release --parallel 24

SteelPh0enix · 2024-10-20T16:34:28Z

SteelPh0enix
Oct 20, 2024
Author

After doing more testing, i've noticed two things:

First thing; i was quantizing models with --leave-output-tensor, which made my models run very slow under Linux too. That was a side-effect of my investigation, and i'm leaving that in case somebody else has this issue :)

Second thing, closely related to the issue: some models work just fine.
In my first test, i was checking Qwen2.5 14B finetune, quantized to Q6_K. This model behaves the same on Windows whether i leave the output tensor as-is, or not, in terms of memory allocation. Not leaving output tensor increases performance a small bit, but it's not very noticeable due to memory constraints.

HOWEVER, LLaMA 3.2 3B quantized to Q8_0 works just fine, and is loaded to dedicated GPU memory!
What's going on here?
This seems like a bug.

0 replies

SteelPh0enix · 2024-10-21T22:47:19Z

SteelPh0enix
Oct 21, 2024
Author

Small update; I've confirmed that this bug does not happen when using Vulkan as a backend.

0 replies

koellum · 2024-11-04T11:58:58Z

koellum
Nov 4, 2024

Just ran into this issue on a Ryzen 5 5600X with Radeon RX 6800XT on Windows 11. Can confirm that the issue is with the latest ROCm runtime (1.1.13). Switching to Vulkan or 1.1.10 runtime works.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp Windows/ROCm builds are broken? Using shared GPU memory instead of dedicated. #9960

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

llama.cpp Windows/ROCm builds are broken? Using shared GPU memory instead of dedicated. #9960

SteelPh0enix Oct 20, 2024

Replies: 3 comments

SteelPh0enix Oct 20, 2024 Author

SteelPh0enix Oct 21, 2024 Author

koellum Nov 4, 2024

SteelPh0enix
Oct 20, 2024

SteelPh0enix
Oct 20, 2024
Author

SteelPh0enix
Oct 21, 2024
Author

koellum
Nov 4, 2024