Llama.cpp RPC over Ethernet strangely slow #9136

becky-soda · 2024-08-22T20:49:28Z

becky-soda
Aug 22, 2024

Hey everyone,

I was hoping to get some help with the RPC service on Llama.cpp. I'm running a pair of systems, with the latest Llama.cpp which I compiled myself ('DGGML_CUDA=ON DGGML_RPC=ON DGGML_CUDA_FORCE_CUBLAS=ON' flags on cmake). Each system has two GPUs, all recent discrete GeForce cards (4090 and 4060Tis).

The trouble is that I recently upgraded that segment of my network to have 2.5GB ethernet, as I understood from reading Reddit posts that this would be the limiting factor of Llama.cpp's ability to provide inference over RPC. Someone on one Reddit thread was talking about using USB4/Thunderbolt 4 to achieve a theoretical 40 Gb/s. The strange thing is that I'm only seeing about 30-50 Mb/s (as in megabit, not gigabit) of transfer on my ethernet when I'm running an inference task. This is nowhere near the maximum ethernet speed I've seen when doing other tasks, such as loading the model (which is about 1.9-2.1 Gb/s). As a result, the tokens per second is much slower than I would have expected, being around 3t/s.

If I disable one of the cards from being involved in the RPC cluster, the inference speeds up a little, but the network still doesn't transfer any faster than around 30-50 Mb/s. It makes sense that the inference is a little faster, but the amount of VRAM is lower, so the models can't be as large.

(I've also tried several models, with no notable difference in that network speed or tokens per second.)

Given that the GPU's internal buses are faster (18 Gb/s memory bus on a 4060 Ti) and PCIe is faster (7.877 GB/s on PCIe 4.0 at 4x), I don't understand why the network isn't being saturated, and thus the inference is running much slower. So, I don't understand what's causing this bottleneck.

Are there any obvious things I can check to try to fix this please? Any help is greatly, greatly appreciated, thank you!

ggerganov · 2024-08-23T06:53:32Z

ggerganov
Aug 23, 2024
Maintainer

Could you provide some more information about the commands and models that you are using? Would be useful to run llama-bench without and with RPC to see how the numbers compare. Also some iperf numbers might help.

During inference, only the hidden state is transferred across the network after each layer. The state is very small (few kB), so it's normal to not see huge network traffic. The "3t/s" speed is hard to say if it is expected without additional information.

There is some work pending for optimizing the network overhead (#8032), but not yet ready for testing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama.cpp RPC over Ethernet strangely slow #9136

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Llama.cpp RPC over Ethernet strangely slow #9136

becky-soda Aug 22, 2024

Replies: 1 comment

ggerganov Aug 23, 2024 Maintainer

becky-soda
Aug 22, 2024

ggerganov
Aug 23, 2024
Maintainer