Llama.cpp RPC over Ethernet strangely slow #9136
Unanswered
becky-soda
asked this question in
Q&A
Replies: 1 comment
-
Could you provide some more information about the commands and models that you are using? Would be useful to run During inference, only the hidden state is transferred across the network after each layer. The state is very small (few kB), so it's normal to not see huge network traffic. The "3t/s" speed is hard to say if it is expected without additional information. There is some work pending for optimizing the network overhead (#8032), but not yet ready for testing. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey everyone,
I was hoping to get some help with the RPC service on Llama.cpp. I'm running a pair of systems, with the latest Llama.cpp which I compiled myself ('DGGML_CUDA=ON DGGML_RPC=ON DGGML_CUDA_FORCE_CUBLAS=ON' flags on cmake). Each system has two GPUs, all recent discrete GeForce cards (4090 and 4060Tis).
The trouble is that I recently upgraded that segment of my network to have 2.5GB ethernet, as I understood from reading Reddit posts that this would be the limiting factor of Llama.cpp's ability to provide inference over RPC. Someone on one Reddit thread was talking about using USB4/Thunderbolt 4 to achieve a theoretical 40 Gb/s. The strange thing is that I'm only seeing about 30-50 Mb/s (as in megabit, not gigabit) of transfer on my ethernet when I'm running an inference task. This is nowhere near the maximum ethernet speed I've seen when doing other tasks, such as loading the model (which is about 1.9-2.1 Gb/s). As a result, the tokens per second is much slower than I would have expected, being around 3t/s.
If I disable one of the cards from being involved in the RPC cluster, the inference speeds up a little, but the network still doesn't transfer any faster than around 30-50 Mb/s. It makes sense that the inference is a little faster, but the amount of VRAM is lower, so the models can't be as large.
(I've also tried several models, with no notable difference in that network speed or tokens per second.)
Given that the GPU's internal buses are faster (18 Gb/s memory bus on a 4060 Ti) and PCIe is faster (7.877 GB/s on PCIe 4.0 at 4x), I don't understand why the network isn't being saturated, and thus the inference is running much slower. So, I don't understand what's causing this bottleneck.
Are there any obvious things I can check to try to fix this please? Any help is greatly, greatly appreciated, thank you!
Beta Was this translation helpful? Give feedback.
All reactions