-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] [QUANT] Support for GPTQModel's dynamic
quantization per module override/control
#7086
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Hi @Qubitium thanks for sharing your interesting work! We have the notion of variable quantization already in vLLM through our Regarding merged layers, I think the performance and complexity cost of needing to support possibly unmerging layers like QKV or GateUp is too high. I want to recommend keeping the quantization level of merged layers the same so we (and several other inference engines) don't run into this issue. If you are still open to editing your format, I also think |
@mgoin Wow, I totally missed this PR. After cursory check of the https://github.com/vllm-project/vllm/pull/6515/files PR, our pr is entirely redundant. The core concept is similar including re matching. The only little advantage of this pr, and very little at this point, is minimal code-change to bootstrap gptq flexible layer/module quant. I will need to digest the vllm pr/unit tests to test with gptqmodel export. If gptq model can integrate with
Yes, this our finding as well. Merged layers should retain the same scheme.
I want the config to to be compatible to vllm/sglang, and since sglang for the most part re-uses/import vllm model weight/model layers. Do not want another protocol parser so if vllm |
This pull request has merge conflicts that must be resolved before it can be |
@mgoin We are working on the lm_head portion to get it ready for today. gptqmodel has been merged into hf transformers, optimum, peft, and hf has agreed in principle with us to start deprecating autogptq in the near future. We also have pending PR, slated for next week, that we are actively working with Nvidia staff which will introduce another gptq config for lora layer optimized for gptq. What I am trying to say is, gptq quantized model, that is quantized via hf optimum or gptqmodel will use gptqmodel standard config and the config is expanding as features expand. Can we be free to sync gptqconfig in vllm to gptqmodel format and not have deviations such as |
@mgoin We are working on the lm_head portion to get it ready for today. gptqmodel has been merged into hf transformers, optimum, peft, and hf has agreed in principle with us to start deprecating autogptq in the near future. We also have pending PR, slated for next week, thar we are actively working with Nvidia staff which will introduce another gptq config for lora layer optimized for gptq. What I am trying to say is, gptq quantized model, that is quantized via hf optimum or gptqmodel will use gptqmodel standard config and the config is expanding as features expand. Hf and gptqmodel are not the only tools to generate gptq but we are the only actively maintained project that is exclusive gptq. Can we be free to sync gptqconfig in vllm to gptqmodel format and not have deviations such as For ctx, dynamic field was renamed to dynamic_cfg in earlier phase of the pr due to review feedback as dynamic doesnt actually mean runtime dynamism (if there is such a word) and in the ctx of vllm inference is nothing dynamic about it but more of a static override of configd per module. |
…nsistent with LinearBase.
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
@mgoin PR is ready for re-review but we are incapable of fixing the missing |
DCO has passed |
How did you get it to pass? |
@Qubitium commiter can fix DCO failure directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah don't worry about the DCO, we can signoff right before merging and it doesn't block anything
assert isinstance(lm_head_layer.linear_method, | ||
assert isinstance(lm_head_layer.quant_method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to keep linear_method
for both lm_head and embedding layer, since calling it quant_method
doesn't make sense for the base case of unquantized methods. While I agree linear
isn't perfect for embeddings, there isn't a strong reason to change it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin Sorry we didn't make it clear in our notes but there is reason for this change now that you mentioned it.
Please check https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader/loader.py#L398
We believe this is a bug fix
of existing lm_head
quantized init code. If lm_head is quantized using sym=False
, as the ci-test was written by (me and @robertgshaw2-redhat) but using a model quantized (sym=False
) by me, it can't route to Marlin kernel since Marlin doesn't support sym=False
. Ci-test passes because it gets routed to fall-back cuda kernel. Without this fix, lm_head quantized that is compatible with Marlin kernel code will crash since it checks for quant_method
attribute for correct Marlin init. So we synced lm_head attr to be same name as other modules to fix following crash:
Mode: https://huggingface.co/ModelCloud/TinyLlama-1.1B-Chat-v1.0-GPTQ-4bits-dynamic-cfg-with-lm_head
tests/conftest.py:682: in __init__
self.model = LLM(
vllm/utils.py:1051: in inner
return fn(*args, **kwargs)
vllm/entrypoints/llm.py:242: in __init__
self.llm_engine = self.engine_class.from_engine_args(
vllm/engine/llm_engine.py:484: in from_engine_args
engine = cls(
vllm/engine/llm_engine.py:276: in __init__
self._initialize_kv_caches()
vllm/engine/llm_engine.py:416: in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
vllm/executor/executor_base.py:101: in determine_num_available_blocks
results = self.collective_rpc("determine_num_available_blocks")
vllm/executor/uniproc_executor.py:51: in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
vllm/utils.py:2220: in run_method
return func(*args, **kwargs)
/root/miniconda3/envs/gp/lib/python3.11/site-packages/torch/utils/_contextlib.py:116: in decorate_context
return func(*args, **kwargs)
vllm/worker/worker.py:229: in determine_num_available_blocks
self.model_runner.profile_run()
/root/miniconda3/envs/gp/lib/python3.11/site-packages/torch/utils/_contextlib.py:116: in decorate_context
return func(*args, **kwargs)
vllm/worker/model_runner.py:1235: in profile_run
self._dummy_run(max_num_batched_tokens, max_num_seqs)
vllm/worker/model_runner.py:1346: in _dummy_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
/root/miniconda3/envs/gp/lib/python3.11/site-packages/torch/utils/_contextlib.py:116: in decorate_context
return func(*args, **kwargs)
vllm/worker/model_runner.py:1765: in execute_model
logits = self.model.compute_logits(hidden_or_intermediate_states,
vllm/model_executor/models/qwen2.py:496: in compute_logits
logits = self.logits_processor(self.lm_head, hidden_states,
/root/miniconda3/envs/gp/lib/python3.11/site-packages/torch/nn/modules/module.py:1736: in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
/root/miniconda3/envs/gp/lib/python3.11/site-packages/torch/nn/modules/module.py:1747: in _call_impl
return forward_call(*args, **kwargs)
vllm/model_executor/layers/logits_processor.py:74: in forward
logits = self._get_logits(hidden_states, lm_head, embedding_bias)
vllm/model_executor/layers/logits_processor.py:111: in _get_logits
logits = lm_head.quant_method.apply(lm_head,
vllm/model_executor/layers/quantization/gptq_marlin.py:406: in apply
return self.kernel.apply_weights(layer, x, bias)
vllm/model_executor/layers/quantization/kernels/mixed_precision/marlin.py:129: in apply_weights
g_idx_sort_indices=layer.g_idx_sort_indices,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = ParallelLMHead(num_embeddings=151936, embedding_dim=2048, org_vocab_size=151936, num_embeddings_padded=151936, tp_size=1), name = 'g_idx_sort_indices'
def __getattr__(self, name: str) -> Any:
if "_parameters" in self.__dict__:
_parameters = self.__dict__["_parameters"]
if name in _parameters:
return _parameters[name]
if "_buffers" in self.__dict__:
_buffers = self.__dict__["_buffers"]
if name in _buffers:
return _buffers[name]
if "_modules" in self.__dict__:
modules = self.__dict__["_modules"]
if name in modules:
return modules[name]
> raise AttributeError(
f"'{type(self).__name__}' object has no attribute '{name}'"
)
E AttributeError: 'ParallelLMHead' object has no attribute 'g_idx_sort_indices'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we revert to linear_method
, we can bypass the bug by calling the process_weights_after_loading
manually. But using quant_method
attribute would allow existing code to do this without extra intervensions. It seems like a cleaner way since lm_head is treated like other modules vs doing it's own thing.
def __init__(self, quant_config: GPTQMarlinConfig) -> None: | ||
self.quant_config = quant_config | ||
def __init__(self, quant_config: GPTQMarlinConfig, prefix: str) -> None: | ||
self.quant_config = deepcopy(quant_config) | ||
self.prefix = prefix | ||
|
||
if len(self.quant_config.dynamic_cfg) > 0 and self.prefix: | ||
# gptqmodel per module/layer dynamic_cfg my override/change base | ||
# model quant config | ||
self.quant_config.override_config(prefix=self.prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you have enough information do perform this same quant_config copy and override in get_quant_method
, so why not keep the dynamism within that function where you already have the prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin Yes! We will push this override outside of __init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin Fixed.
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
@mgoin Ready for re-review. There were some small clarity fixes in terms of logic expression, var name, comments committed since your last review in addition to the requested change about moving config override to outside of marlin |
dynamic config
control per module/layerdynamic
quantization per module override/control
GPTQModel v0.9.10-dev0 main branch has merged dynamic, per layer/module support of different gptq bits, sym, desc_act using a regex style definition. This is a work in process and we are awaiting feedback before release. We are targeting both vllm and sglang compat with the quant so would like to work with vllm to see if what is the best way forward.
Previously a gptq model has a single config that applies to all layers and all modules within nested layers. This change allows pin-point targeting of different gptq quantization config for specific layers and/or specific modules within specific layers for better optimization.
Sample model: https://huggingface.co/ModelCloud/TinyLlama-1.1B-Chat-v1.0-dynamic-GPTQ-2024-8-3
full quant config for sample:
Dynamic config explained:
Same code to quantize using
dynamic
control: https://github.com/ModelCloud/GPTQModel/blob/main/tests/test_dynamic.pyDesign choices:
regex: str
key mapped todict[str, int or bool]
for both quantization and model inference/loading. Multiple regex/dynamic pairs can be defined and for matching, the rules are looped and first one that match, is applied.dynamic
(override) match and if matches, override the static quant config files for that layer/module.Compat Notes:
dynamic
config require that the model inference does not remerge the layers with differentdyanmic
/quant param values.MergedColumnParallel
in Llama model in vllm for example mergesmlp.gate
andmlp.up
. Dynamic override works but in this case, because they are fused/merged, these two layers must have exact same quant config values. Can't have one with 4bit and the other with 8bits.TODO:
dynamic
layer/module quant override config viaquantize_config
json