Is RPC with multiple clusters of AMD GPUs possible? #9257

Allan-Luu · 2024-08-31T01:17:11Z

Allan-Luu
Aug 31, 2024

Hello,

I'm currently trying to find a way to connect two clusters of AMD GPUs to perform LLM inference across both clusters. Is this possible with llama.cpp? Looking at the example's rpc-server.cpp file, there doesn't seem to be anything like GGML_USE_ROCM

I'd greatly appreciate any direction to documentation or articles touching this topic! Thanks in advance.

Answered by 8XXD8

Aug 31, 2024

You can build it with make GGML_HIPBLAS=1 GGML_RPC=1 -j16

On the remote cluster you have to launch one RPC server per GPU:
HIP_VISIBLE_DEVICES=0 ./rpc-server --host 127.0.0.1 --port 9999 --mem 16000
HIP_VISIBLE_DEVICES=1 ./rpc-server --host 127.0.0.1 --port 9998 --mem 16000

On the host use the --rpc parameter
./llama-cli -m /.../LLama3-8b.gguf -ngl 99 -p "Tell a joke" --rpc 127.0.0.1:9999,127.0.0.1:9998

Keep in mind row split does not work with RPC, and you have specify your available GPU memory with --mem
RPC can work between different backends, I have tested Rocm+linux host with Cuda+windows RPC server and it worked.

View full answer

rgerganov · 2024-08-31T07:48:26Z

rgerganov
Aug 31, 2024
Collaborator

Hi,

I think ROCM is supported through hipBLAS which piggybacks on the CUDA backend. So I think you just need to build the project following the instructions for hipBLAS and just add -DGGML_RPC=ON.

Let me know if this works for you. Also sharing your performance results with rpc-server would be very appreciated!

4 replies

Allan-Luu Sep 1, 2024
Author

For sure, i'll update with results once I set it up completely after I get access rights to the second cluster. What kind of performance results are you looking for?

I also try to avoid building with cmake because on my system it causes a lot of CUDA errors or segmentation faults. building with make has much more success for me. @8XXD8 has been a great help throughout my llama.cpp journey.

rgerganov Sep 1, 2024
Collaborator

You can share the results from llama-bench with --rpc options and the model that you are using. Thanks.

Allan-Luu Sep 4, 2024
Author

I was finally able to get --rpc to work on both clusters using AMD GPUs.

I ran ./llama-bench -m /data/models/TinyLlama-1.1B-Chat-v1.0-f16.gguf -p "can you tell me about jacket potatoes?" -n 128 -ngl 99 --rpc xxx and here is the screenshot of my results. There were 14 AMD GPUs from two clusters utilized in this --rpc run. I have 16 available but when going over 14 GPUs, I get an error GGML_ASSERT(n_backends <= GGML_SCHED_MAX_BACKENDS) failed.

Not sure if these results tell you much, but if there's more you're looking for let me know and I can send more details!

Allan-Luu Sep 22, 2024
Author

Is there a way of telling which models are compatible with RPC, and which ones are compatible with utilizing 16 backends excluding the host backend?

Since the host is considered a backend, i'm not able to utilize my 16 GPUs with the exception of TinyLlama-1.1B-Chat-v1.0-f16.gguf, for some reason this model can offload to my 8 local GPUs and 8 RPC servers.

8XXD8 · 2024-08-31T07:57:49Z

8XXD8
Aug 31, 2024

You can build it with make GGML_HIPBLAS=1 GGML_RPC=1 -j16

On the remote cluster you have to launch one RPC server per GPU:
HIP_VISIBLE_DEVICES=0 ./rpc-server --host 127.0.0.1 --port 9999 --mem 16000
HIP_VISIBLE_DEVICES=1 ./rpc-server --host 127.0.0.1 --port 9998 --mem 16000

On the host use the --rpc parameter
./llama-cli -m /.../LLama3-8b.gguf -ngl 99 -p "Tell a joke" --rpc 127.0.0.1:9999,127.0.0.1:9998

Keep in mind row split does not work with RPC, and you have specify your available GPU memory with --mem
RPC can work between different backends, I have tested Rocm+linux host with Cuda+windows RPC server and it worked.

7 replies

rgerganov Sep 4, 2024
Collaborator

This worked great, I was able to set up the servers for each GPU on both of my clusters.

Just keep in mind that you don't need to run servers for local GPUs. If you have two clusters then the most efficient way to use them is to run llama-cli on one of them and start rpc-server for each GPU on the other one. You may also get better results when PR #9296 is merged.

Allan-Luu Sep 11, 2024
Author

Just keep in mind that you don't need to run servers for local GPUs. If you have two clusters then the most efficient way to use them is to run llama-cli on one of them and start rpc-server for each GPU on the other one. You may also get better results when PR #9296 is merged.

Thank you for your reply. I've been trying to run llama-server or llama-cli without setting up the rpc-server on the local GPUs by building the host backend with enabling HIPBLAS with the -DGGML_HIPBLAS=ON argument, but either way with or without the HIPBLAS argument, it hangs my GPU. I believe this is the only way to fully utilize all 16 of my GPUs since the host backend is counted as a device, there is a limitation on how many rpc-servers I'm able to spin up.

mkdir build-rpc
cd build-rpc
cmake .. -DGGML_RPC=ON {-DGGML_HIPBLAS=ON}
cmake --build . --config Release

Any pointers on how to enable local GPUs? I've been testing with the zephyr-7b-beta.Q4_K_M.gguf and TinyLlama-1.1B-Chat-v1.0-f16.gguf models.

I also wanted to ask if RPC-over-RDMA is supported with llama.cpp?

Thank you again for the help!

rgerganov Sep 11, 2024
Collaborator

I don't know what you mean by "hangs the GPU". Does llama-cli work when you build with -DGGML_HIPBLAS=ON and you don't specify any RPC servers?

I also wanted to ask if RPC-over-RDMA is supported with llama.cpp?

Currently not as I am not familiar with RDMA. Could you provide any references on what kind of hardware and software is needed for this? Thanks.

8XXD8 Sep 11, 2024

I built llama.cpp with -DGGML_HIPBLAS=1 -DGGML_RPC=1
If I start 2 rpc-servers with:
HIP_VISIBLE_DEVICES=2 ./rpc-server --host 127.0.0.1 --port 9999 --mem 16000
HIP_VISIBLE_DEVICES=3 ./rpc-server --host 127.0.0.1 --port 9998 --mem 16000
And run llama-cli with
HIP_VISIBLE_DEVICES=0,1 ./llama-cli -m ~/.../models/LLama3-8b.gguf -ngl 99 -p "Tell a joke" -n 128 --rpc 127.0.0.1:9999,127.0.0.1:9998
all 4 GPUs are utilized.
With just HIP_VISIBLE_DEVICES=0,1 ./llama-cli -m ~/.../models/LLama3-8b.gguf -ngl 99 -p "Tell a joke" -n 128 only 2 GPUs are used, as expected, so I have no hanging issue, though I only tried it on localhost.

Allan-Luu Sep 11, 2024
Author

I find that some models lead to a GPU hang, the screenshot below is an example of an error message I get when attempting to run falcon-mamba-7B-instruct-Q8_0.gguf with the -ngl argument and no --rpc when llama-cli is built with -DGGML_HIPBLAS=ON.

I did manage to rebuild in a fresh docker container using both -DGGML_HIPBLAS=ON -DGGML_RPC=ON and it ended up mostly working with local GPUs specified with HIP_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 and --rpc {8 ip addresses} but I didn't see any local GPU activity when running radeontop. The models that worked for this was TinyLlama-1.1B-Chat-v1.0-f16.gguf and zephyr-7b-beta.Q4_K_M.gguf. Another note, when trying to use all 16 GPUs, some of the local ones were not showing up in the ggml_cuda_init: found x ROCm devices. There were 1 or 2 GPUs missing from the local GPUs.

Once the GPU hangs, the GPU is unusable until a reboot of the server/computer, restoring the machine to the last stable state.
The screenshot below shows GPU node-5 is different from the rest of the GPU nodes.

Image of attempt to run llama-cli after GPU hang

As for the RPC-over-RDMA, this communication protocol is highly efficient in situations requiring high throughput and low latency, great for data transfer tasks, and are enabled in clusters connected through some RDMA capable hardware, like an infinibus/network switch in my case. The clusters need to be connected to the same RDMA-enabled network infrastructure.

The RDMA protocol will completely bypass the OS/CPU and allow the clients/servers to communicate directly with each other, leading to more efficient data transfer/communications.

libibverbs and rdma-core are 2 popular libraries for c from what I've seen online.

slavonnet · 2024-09-23T12:17:09Z

slavonnet
Sep 23, 2024

#9493
I have added a similar task on this topic

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is RPC with multiple clusters of AMD GPUs possible? #9257

{{title}}

Replies: 3 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is RPC with multiple clusters of AMD GPUs possible? #9257

Allan-Luu Aug 31, 2024

Replies: 3 comments · 11 replies

rgerganov Aug 31, 2024 Collaborator

Allan-Luu Sep 1, 2024 Author

rgerganov Sep 1, 2024 Collaborator

Allan-Luu Sep 4, 2024 Author

Allan-Luu Sep 22, 2024 Author

8XXD8 Aug 31, 2024

rgerganov Sep 4, 2024 Collaborator

Allan-Luu Sep 11, 2024 Author

rgerganov Sep 11, 2024 Collaborator

8XXD8 Sep 11, 2024

Allan-Luu Sep 11, 2024 Author

slavonnet Sep 23, 2024

Allan-Luu
Aug 31, 2024

Replies: 3 comments 11 replies

rgerganov
Aug 31, 2024
Collaborator

Allan-Luu Sep 1, 2024
Author

rgerganov Sep 1, 2024
Collaborator

Allan-Luu Sep 4, 2024
Author

Allan-Luu Sep 22, 2024
Author

8XXD8
Aug 31, 2024

rgerganov Sep 4, 2024
Collaborator

Allan-Luu Sep 11, 2024
Author

rgerganov Sep 11, 2024
Collaborator

Allan-Luu Sep 11, 2024
Author

slavonnet
Sep 23, 2024