Hugging Face Inference Endpoints now supports GGUF out of the box! #9669
Pinned
ngxson
started this conversation in
Show and tell
Replies: 1 comment
-
Hermes 405B model can be deployed on 2xA100. The generation speed is around 8t/s, which is not bad! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
You can now deploy any GGUF model on your own endpoint, in just a few clicks!
Simply select GGUF, select hardware configuration and done! An endpoint powered by llama-server (built from
master
branch) will be deployed automatically. It works with all llama.cpp-compatible models, with all size, from 0.1B up to 405B parameters.Try it now --> https://ui.endpoints.huggingface.co/
And the best part is:
A huge thanks to @ggerganov @slaren and @huggingface team for making this possible!
llama.hfe.ok.mp4
Beta Was this translation helpful? Give feedback.
All reactions