Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GGUF quantized models #413

Open
bartekupartek opened this issue Jan 28, 2025 · 5 comments
Open

Support GGUF quantized models #413

bartekupartek opened this issue Jan 28, 2025 · 5 comments

Comments

@bartekupartek
Copy link

bumblebee supports only hardcoded hugging face models, I found that ie llama 3.2 might be 2x and with 60% less memory footprint when using quantized version, and unsloth do it so well: https://huggingface.co/unsloth
I found GH Issue 376 but it doesn't fully answer if this is possible or what's the problem, so might it be possible by bumblebee?
Currently I'm using only official repos like llama 3.2 but it's hard to fit more than one model on single GPU, but still love Elixir to interact with models over liveview without coupling to Python.

@jonatanklosko
Copy link
Member

Axon provides a way to quantize a model (as described in the quantization section in this article, however that requires loading the whole model first. Ideally we would be able to load the gguf format, but that's not supported at the moment. Also related to #411.

@jonatanklosko jonatanklosko changed the title support for non official GGUF quantized models Support GGUF quantized models Jan 29, 2025
@bartekupartek
Copy link
Author

bartekupartek commented Jan 29, 2025

Thank you for the update!
Looks like some sort of binary conversion functions with desired 4-bit type were merged elixir-nx/nx#1528 but for my quick research native support in XLA/pytorch is months away. It looks like only jax-ml recently added unpacked 4-bit dtype support, so it will probably take much of time until this will be available across the Nx backends.
Also GGUF is another format on top of that which is different story for my understanding.

@bartekupartek
Copy link
Author

bartekupartek commented Feb 2, 2025

Initially I was looking at GGUF, but actually many quantized models on Hugging Face (like unsloth's optimized versions) use bitsandbytes library instead of GGUF format which seems to be more Ollama's thing.
I think I figured out how we could solve this and make Nvidia GPUs go BRRR by porting bitsandbytes which is actually a Python wrapper around CUDA custom functions:

in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.

also they claim to support not only CUDA in the future.

I made a PoC by porting one of bitsandbytes' CUDA function quantizeBlockwise_fp16_fp4 to Elixir
https://github.com/bartekupartek/nx_bits_and_bytes/blob/main/c_src/elixirInterface.cpp
vs
https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/csrc/pythonInterface.cpp#L129

Would love to hear thoughts on this approach, seems like it could be a quicker win than waiting for native 4bit support in XLA/PyTorch backends? I was thinking about new Nx backend implementation or keeping this as separate library

@jonatanklosko
Copy link
Member

For the reference, here is a whole table of different quantization libraries/techniques/formats that hf/transformers support.

Axon implements a specific quantization method. I believe the idea is that we could use one of GGUF checkpoint types to load the params as matching Axon quantization state, but my interpretation may be wrong.

Using native kernels is an option. I don't think it makes sense as a separate backend, because it implements a few specific operations, so model inference would need to transfer back and forth between backends. We would rather want to plug that into EXLA computation, but it's still an open question how this should work (related to elixir-nx/nx#1519).

@seanmor5 may have more thoughts here :)

@bartekupartek
Copy link
Author

It makes sense more or less, the missing part for me was the current Axon quantization implementation so it would integrate well with plugins that might support ie 4 bit quantization like bitesandbytes or some other plugin that might support GGUF format.
I like an idea of integrating this as EXLA plugin not via custom NIFs, so possibly reduce memory transfers.

And the architectural question is I'm not sure how much the bitesandbytes CUDA code is compatible or redundant with XLA or if more clean approach wouldn't be to go all in pure XLA or all in pure bitesandbytes 🤷‍♂

@seanmor5 what's missing for merging this functionality? will this have some manual?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants