Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc]: How are quantized models loaded compared to non-quantized models? #8632

Closed
gnpinkert opened this issue Sep 19, 2024 · 3 comments
Closed
Labels

Comments

@gnpinkert
Copy link
Contributor

gnpinkert commented Sep 19, 2024

Hi,

I am trying to research MoE layer memory optimizations and I am using vLLM to do so. I have some custom logging code in the initializers/model code of the Mixtral model. When I load a quantized model, no logging code is executed. Simple print statements in the MixtralModel.__init__ are not printed to screen. Is this on purpose? Where are the MoE kernels getting executed?

Thanks for any help, I have been stuck on this for a while.

For reference, I have tried to use the https://huggingface.co/TheBloke/mixtral-8x7b-v0.1-AWQ and I have quantized my own models with autoAWQ and bitsandbytes and the same behavior occurs.

@gnpinkert gnpinkert added the misc label Sep 19, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

For Mixtral, there is a hack for AWQ right now since we do not have a fused moe kernel for AWQ. Look at mixtral_quant.py

@gnpinkert
Copy link
Contributor Author

Great, thanks for the heads up!

@gnpinkert
Copy link
Contributor Author

gnpinkert commented Sep 23, 2024

@robertgshaw2-neuralmagic is it a good idea to submit a PR to change the name of the quant Mixtral implementation? It wasn't very clear to me that it was the same name as the non-quantized implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants