[Misc]: How are quantized models loaded compared to non-quantized models? #8632

gnpinkert · 2024-09-19T11:53:50Z

Hi,

I am trying to research MoE layer memory optimizations and I am using vLLM to do so. I have some custom logging code in the initializers/model code of the Mixtral model. When I load a quantized model, no logging code is executed. Simple print statements in the MixtralModel.__init__ are not printed to screen. Is this on purpose? Where are the MoE kernels getting executed?

Thanks for any help, I have been stuck on this for a while.

For reference, I have tried to use the https://huggingface.co/TheBloke/mixtral-8x7b-v0.1-AWQ and I have quantized my own models with autoAWQ and bitsandbytes and the same behavior occurs.

The text was updated successfully, but these errors were encountered:

robertgshaw2-neuralmagic · 2024-09-20T16:56:09Z

For Mixtral, there is a hack for AWQ right now since we do not have a fused moe kernel for AWQ. Look at mixtral_quant.py

gnpinkert · 2024-09-23T00:20:18Z

Great, thanks for the heads up!

gnpinkert · 2024-09-23T03:46:09Z

@robertgshaw2-neuralmagic is it a good idea to submit a PR to change the name of the quant Mixtral implementation? It wasn't very clear to me that it was the same name as the non-quantized implementation.

gnpinkert added the misc label Sep 19, 2024

robertgshaw2-neuralmagic closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc]: How are quantized models loaded compared to non-quantized models? #8632

[Misc]: How are quantized models loaded compared to non-quantized models? #8632

gnpinkert commented Sep 19, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Sep 20, 2024

gnpinkert commented Sep 23, 2024

gnpinkert commented Sep 23, 2024 •

edited

Loading

[Misc]: How are quantized models loaded compared to non-quantized models? #8632

[Misc]: How are quantized models loaded compared to non-quantized models? #8632

Comments

gnpinkert commented Sep 19, 2024 • edited Loading

robertgshaw2-neuralmagic commented Sep 20, 2024

gnpinkert commented Sep 23, 2024

gnpinkert commented Sep 23, 2024 • edited Loading

gnpinkert commented Sep 19, 2024 •

edited

Loading

gnpinkert commented Sep 23, 2024 •

edited

Loading