Performance seems too good, what is going on ? #337

cmp-nct · 2023-03-20T19:19:34Z

cmp-nct
Mar 20, 2023

I've been testing the 8 bit 6B llama on my 3090 and my results were at best as fast as your CPU video.
I noticed the GPU does not get used much, so I assume the CPP is well optimized and the GPU implementation I ran was not.

Though maybe someone has additional insights ?

If CPU gives such a high speed, a 3090 should deliver hundreds or more tokens/sec. I was running at 4-5

BadisG · 2023-03-20T19:32:25Z

BadisG
Mar 20, 2023

There is no gpu use during the process. Only the CPU does the work... I know right it's black magic what he has achieved!

7 replies

cmp-nct Mar 20, 2023
Author

Because the GPU version (webui) is written in python, and python is slow as fuck lmao

transformers, pytorch that's all C++
I really need to find out the point that slows it by the factor of 100

BadisG Mar 20, 2023

Using C++ libraries is not enough to speed your code up. Python is often much slower than C++ because Python is interpreted, while C++ is compiled. This means that when using Python, the interpreter must first convert the code into an executable format which can then be executed. This process of interpretation and compilation is what slows down the execution of the program in comparison to directly writing the code in C++ which does not need to go through this intermediate step.

setzer22 Mar 20, 2023

@BadisG, this is not factually accurate. Interpreted python is slow, that much I agree with, but libraries like torch work in a way where you set up a computation by batching all the work and submitting it in one go. Tensor data never even touches the Python side and there's almost no work done by the python interpreter that would slow thigns down.

IMO, this is a valid question. I, too, have been wondering why is it that llama.cpp is capable of keeping up with GPU implementations 🤔 And it's certainly not because "Python".

cmp-nct Mar 20, 2023
Author

@setzer22 I really think it's important to find that out.
While I love a C implementation exists and value the future implications we can already see the real powerhorse of ML is a GPU.
I digged into my GPU usage while working with the 6B model and it was so low that it didn't even show up as task use.
Compare that to stable-diffusion .. the GPU is burning.

On one side, I would love to see a CUDA extension on this repository. To get away from python.
But first, to be practical, we should find out what slows it down on python implementations and fix that.

Given the speed on a PC, a good gaming GPU should deliver 100-500 tokens per second. not 5.

Tachyon5 Oct 22, 2024

The short answer is you need to compile llama.cpp for gpu usage and offload the layers to GPU using the appropriate arguments. llama.cpp has various backends and the default ggml will not even utilize the GPU. I can personally attest that the llama.cpp with GPU backend is much faster.

anzz1 · 2023-03-20T21:02:19Z

anzz1
Mar 20, 2023

Take into note that while named llama.cpp, in reality it's coded mostly in C and leans heavily towards lean and fast C-style code. C++ is hardly used at all and none of that slow "modern C++" stuff. Functions are lean and fast, data structures are lean and fast, memory accesses are fast, math is fast, everything is fast.

For instance, even the simplest C++ paradigms, classes, increase the cost of a function call tenfold since they will not be called directly anymore but routed through virtual function tables. You need constructors, destructors, extra (de)allocations of memory and all that nonsense. All these individually small things add up. C is about as fast as you can get without handcrafting assembly.

As an example, the latest compiled windows binary weights just 190KB for llama.exe and 85KB for quantize.exe with very few imports.

Obviously the used codepaths matter the most so size doesn't 100% directly correlate, but do me a favor and open the binary in your favourite disassembler (Ghidra/IDA Pro/etc), then open the disassembly of some other implementation and compare, and you'll instantly see what I'm talking about here.

I am very delighted to see such lean and fast code which is a rare sight these days. I think most people don't really grasp what an exceptional job @ggerganov has done here; condensing what is in essence a pretty complex thing in such a minimal, fast, close to hardware implementation without resorting to using outside libraries. This not only makes it very fast but also very portable as it lacks any dependencies. Most people would just slap a chonky library in top of another and then wonder why the code isn't fast.

That being said, in these types of workloads GPU's are exponentially faster so even less optimized code should run faster than a well-optimized CPU implementation like this. If a CUDA implementation would be made in the llama.cpp style, it would be exceptionally fast.

0 replies

bitRAKE · 2023-03-20T21:05:13Z

bitRAKE
Mar 20, 2023

Memory bandwidth is the bottleneck with these models. No one has several gigabytes of cache, yet. The ratio of instruction speed to memory speed is massive. Given this fact I would recommend using a 4-bit quantized model on the 3090 - there is little difference in output quality.

https://github.com/qwopqwop200/GPTQ-for-LLaMA/

If you want faster, you are going to need the whole model to fit on your graphics card - this means switching to the 30B model (4-bit).

Let us know how it goes.

4 replies

cmp-nct Mar 20, 2023
Author

I must be missing something.
My CPU to RAM bandwidth is around 95GB/sec. My GPU memory bandwidth is somewhere around 1200 GB/sec.
GPU RAM is at 24 gigs which is enough to hold the entire 6B weights.

I did not dig into the python code yet but I'm quite confident the first step is loading the weights into the GPU memory. That's why it takes 30 seconds to start.

bitRAKE Mar 21, 2023

Ha! I mis-read that you were using the 65B model. Sorry. You're correct - the whole model would be on the GPU and minimal bus bandwidth would be needed to transfer tokens.

What performance are you getting?

I can get 8-12 tokens/second, with the 7B parameter model on the CPU.

cmp-nct Mar 21, 2023
Author

Yea I have 4-7 tokens per second on the 3090 (i only tested the webui repository, it seems the most popular)
The GPU is idle in task manager, no real usage happens so something is really off

bitRAKE Mar 21, 2023

That doesn't sound right - I thought the tensor cores could do like 20 TFLOPS? If the GPU is idle, I'd first check the TensorFlow version installed (pip list). If it were using the CPU that would seem fairly obvious. So, it's probably something else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance seems too good, what is going on ? #337

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance seems too good, what is going on ? #337

Replies: 3 comments · 11 replies

cmp-nct Mar 20, 2023 Author

cmp-nct Mar 20, 2023 Author

cmp-nct Mar 20, 2023 Author

cmp-nct Mar 21, 2023 Author

Replies: 3 comments 11 replies

cmp-nct Mar 20, 2023
Author

cmp-nct Mar 20, 2023
Author

cmp-nct Mar 20, 2023
Author

cmp-nct Mar 21, 2023
Author