Support GGUF quantized models #413

bartekupartek · 2025-01-28T20:17:48Z

bumblebee supports only hardcoded hugging face models, I found that ie llama 3.2 might be 2x and with 60% less memory footprint when using quantized version, and unsloth do it so well: https://huggingface.co/unsloth
I found GH Issue 376 but it doesn't fully answer if this is possible or what's the problem, so might it be possible by bumblebee?
Currently I'm using only official repos like llama 3.2 but it's hard to fit more than one model on single GPU, but still love Elixir to interact with models over liveview without coupling to Python.

jonatanklosko · 2025-01-29T04:39:50Z

Axon provides a way to quantize a model (as described in the quantization section in this article, however that requires loading the whole model first. Ideally we would be able to load the gguf format, but that's not supported at the moment. Also related to #411.

bartekupartek · 2025-01-29T10:13:38Z

Thank you for the update!
Looks like some sort of binary conversion functions with desired 4-bit type were merged elixir-nx/nx#1528 but for my quick research native support in XLA/pytorch is months away. It looks like only jax-ml recently added unpacked 4-bit dtype support, so it will probably take much of time until this will be available across the Nx backends.
Also GGUF is another format on top of that which is different story for my understanding.

bartekupartek · 2025-02-02T21:46:57Z

Initially I was looking at GGUF, but actually many quantized models on Hugging Face (like unsloth's optimized versions) use bitsandbytes library instead of GGUF format which seems to be more Ollama's thing.
I think I figured out how we could solve this and make Nvidia GPUs go BRRR by porting bitsandbytes which is actually a Python wrapper around CUDA custom functions:

in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.

also they claim to support not only CUDA in the future.

I made a PoC by porting one of bitsandbytes' CUDA function quantizeBlockwise_fp16_fp4 to Elixir
https://github.com/bartekupartek/nx_bits_and_bytes/blob/main/c_src/elixirInterface.cpp
vs
https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/csrc/pythonInterface.cpp#L129

Would love to hear thoughts on this approach, seems like it could be a quicker win than waiting for native 4bit support in XLA/PyTorch backends? I was thinking about new Nx backend implementation or keeping this as separate library

jonatanklosko · 2025-02-03T08:15:36Z

For the reference, here is a whole table of different quantization libraries/techniques/formats that hf/transformers support.

Axon implements a specific quantization method. I believe the idea is that we could use one of GGUF checkpoint types to load the params as matching Axon quantization state, but my interpretation may be wrong.

Using native kernels is an option. I don't think it makes sense as a separate backend, because it implements a few specific operations, so model inference would need to transfer back and forth between backends. We would rather want to plug that into EXLA computation, but it's still an open question how this should work (related to elixir-nx/nx#1519).

@seanmor5 may have more thoughts here :)

bartekupartek · 2025-02-03T15:29:31Z

It makes sense more or less, the missing part for me was the current Axon quantization implementation so it would integrate well with plugins that might support ie 4 bit quantization like bitesandbytes or some other plugin that might support GGUF format.
I like an idea of integrating this as EXLA plugin not via custom NIFs, so possibly reduce memory transfers.

And the architectural question is I'm not sure how much the bitesandbytes CUDA code is compatible or redundant with XLA or if more clean approach wouldn't be to go all in pure XLA or all in pure bitesandbytes 🤷‍♂

@seanmor5 what's missing for merging this functionality? will this have some manual?

jonatanklosko changed the title ~~support for non official GGUF quantized models~~ Support GGUF quantized models Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GGUF quantized models #413

Support GGUF quantized models #413

bartekupartek commented Jan 28, 2025

jonatanklosko commented Jan 29, 2025

bartekupartek commented Jan 29, 2025 •

edited

Loading

bartekupartek commented Feb 2, 2025 •

edited

Loading

jonatanklosko commented Feb 3, 2025

bartekupartek commented Feb 3, 2025

Support GGUF quantized models #413

Support GGUF quantized models #413

Comments

bartekupartek commented Jan 28, 2025

jonatanklosko commented Jan 29, 2025

bartekupartek commented Jan 29, 2025 • edited Loading

bartekupartek commented Feb 2, 2025 • edited Loading

jonatanklosko commented Feb 3, 2025

bartekupartek commented Feb 3, 2025

bartekupartek commented Jan 29, 2025 •

edited

Loading

bartekupartek commented Feb 2, 2025 •

edited

Loading