-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GGUF quantized models #413
Comments
Axon provides a way to quantize a model (as described in the quantization section in this article, however that requires loading the whole model first. Ideally we would be able to load the gguf format, but that's not supported at the moment. Also related to #411. |
Thank you for the update! |
Initially I was looking at GGUF, but actually many quantized models on Hugging Face (like unsloth's optimized versions) use bitsandbytes library instead of GGUF format which seems to be more Ollama's thing.
also they claim to support not only CUDA in the future. I made a PoC by porting one of bitsandbytes' CUDA function quantizeBlockwise_fp16_fp4 to Elixir Would love to hear thoughts on this approach, seems like it could be a quicker win than waiting for native 4bit support in XLA/PyTorch backends? I was thinking about new Nx backend implementation or keeping this as separate library |
For the reference, here is a whole table of different quantization libraries/techniques/formats that hf/transformers support. Axon implements a specific quantization method. I believe the idea is that we could use one of GGUF checkpoint types to load the params as matching Axon quantization state, but my interpretation may be wrong. Using native kernels is an option. I don't think it makes sense as a separate backend, because it implements a few specific operations, so model inference would need to transfer back and forth between backends. We would rather want to plug that into EXLA computation, but it's still an open question how this should work (related to elixir-nx/nx#1519). @seanmor5 may have more thoughts here :) |
It makes sense more or less, the missing part for me was the current Axon quantization implementation so it would integrate well with plugins that might support ie 4 bit quantization like bitesandbytes or some other plugin that might support GGUF format. And the architectural question is I'm not sure how much the bitesandbytes CUDA code is compatible or redundant with XLA or if more clean approach wouldn't be to go all in pure XLA or all in pure bitesandbytes 🤷♂ @seanmor5 what's missing for merging this functionality? will this have some manual? |
bumblebee supports only hardcoded hugging face models, I found that ie llama 3.2 might be 2x and with 60% less memory footprint when using quantized version, and unsloth do it so well: https://huggingface.co/unsloth
I found GH Issue 376 but it doesn't fully answer if this is possible or what's the problem, so might it be possible by bumblebee?
Currently I'm using only official repos like llama 3.2 but it's hard to fit more than one model on single GPU, but still love Elixir to interact with models over liveview without coupling to Python.
The text was updated successfully, but these errors were encountered: