-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add gptq and awq int4 support in intel platform #2444
Conversation
Signed-off-by: Wang, Yi A <[email protected]>
@ErikKaum could you help review the PR? |
Hi @sywangyi 👋 Yes, let me run the tests in a separate branch so that we don't get the permission errors 👍 I should have time to do it today or tomorrow 👍 |
Signed-off-by: Wang, Yi A <[email protected]>
@sywangyi there seems still to be an error in the dockerfile:
|
@ErikKaum Could you help to retrigger the CI build/for intel-cpu, since we did not see the build error in previous ci and I have not made any change to the Dockerfile_intel in the new commits |
will rework it after #2517 is merged. since python is upgraded from 3.10 to 2.11 |
@ErikKaum rebase done, please retrigger the CI, review and merge it. |
Signed-off-by: Wang, Yi A <[email protected]>
seems the failure is not related with the PR |
@ErikKaum could you help retrigger it? |
this PR is also needed to make mllama output correct in ipex-cpu. since it will upgrade ipex. could anyone help merge it? |
@ErikKaum @Narsil ,please help. @yao-matrix |
Signed-off-by: Wang, Yi A <[email protected]>
@@ -321,7 +322,7 @@ def get_weights_row(self, weights: Weights, prefix: str): | |||
if g_idx is not None: | |||
if ( | |||
not torch.equal( | |||
g_idx.cpu(), | |||
(g_idx - g_idx[0]).cpu(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this is needed exactly? Probably as comment in code too.
Your code seems correct because this should be about sharding alignment.
However since desc_act
should be checked before the fact that this pathway is failing seems to indicate that something is wrong with the target model possibly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If desc_act is False, exllama should be used. But in sharding case. if g_idx is not in ascend order, use_exllama is set to False.
which means for model like https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ, it will not use exllama in TP case in previous logic. now ipex also implements function like exllama. but exllama kernel does not support intel cpu/xpu.
@@ -350,16 +352,16 @@ def get_weights_row(self, weights: Weights, prefix: str): | |||
else: | |||
log_once(logger.info, f"Using exllama kernels v{HAS_EXLLAMA}") | |||
|
|||
if use_exllama and self.groupsize != -1: | |||
if not desc_act and self.groupsize != -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep use_exllama
here.
Exllama is a really specific kernel. Purely about semantics it's easier for us to know what this is exllama specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_exllama is False since exllama only support cuda. but ipex quantization runtime kernel also implement similar logic like exllama. so we need such sharded logic for qzeros/scale/g_idx if desc_act is False as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't keeping use_exllama
and simply fixing the TP (with - g_idx[0]) in the conditional to fix the issues on IPEX ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/layers/gptq/__init__.py#L134,this line set the use_exllama to false, since in intel platform, exllama is not installed.
Signed-off-by: Wang, Yi A <[email protected]>
Signed-off-by: Wang, Yi A <[email protected]>
IT's merged from an updated PR I prepared for CI (#2665) (only minor fixes have been updated in the control flow and adding a few comments). |
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.