Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gptq and awq int4 support in intel platform #2444

Closed
wants to merge 8 commits into from

Conversation

sywangyi
Copy link
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sywangyi
Copy link
Contributor Author

@Narsil @danieldk please help review.

@sywangyi
Copy link
Contributor Author

sywangyi commented Sep 3, 2024

@ErikKaum could you help review the PR?

@ErikKaum
Copy link
Member

ErikKaum commented Sep 3, 2024

Hi @sywangyi 👋

Yes, let me run the tests in a separate branch so that we don't get the permission errors 👍 I should have time to do it today or tomorrow 👍

@sywangyi
Copy link
Contributor Author

@ErikKaum @Narsil upload fix for ci, please rerun the ci

@ErikKaum
Copy link
Member

@sywangyi there seems still to be an error in the dockerfile:

Dockerfile_intel:154
--------------------
 152 |     RUN git clone https://github.com/intel/intel-extension-for-pytorch && cd intel-extension-for-pytorch && git checkout f86e93e4890dc2c989024d148d415c9aa8a1649f
 153 |     RUN git clone https://github.com/intel/torch-ccl.git && cd torch-ccl && git checkout v2.4.0+cpu+rc0
 154 | >>> RUN cd intel-extension-for-pytorch && git submodule sync && git submodule update --init --recursive && python setup.py install
 155 |     RUN cd torch-ccl && git submodule sync && git submodule update --init --recursive && pip install .
 156 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c cd intel-extension-for-pytorch && git submodule sync && git submodule update --init --recursive && python setup.py install" did not complete successfully: exit code: 1

@sywangyi
Copy link
Contributor Author

sywangyi commented Sep 10, 2024

@ErikKaum Could you help to retrigger the CI build/for intel-cpu, since we did not see the build error in previous ci and I have not made any change to the Dockerfile_intel in the new commits

@sywangyi
Copy link
Contributor Author

will rework it after #2517 is merged. since python is upgraded from 3.10 to 2.11

@sywangyi
Copy link
Contributor Author

@ErikKaum rebase done, please retrigger the CI, review and merge it.

@sywangyi
Copy link
Contributor Author

seems the failure is not related with the PR
ERROR integration-tests/models/test_flash_medusa.py::test_flash_medusa_simple - RuntimeError: Launcher crashed
ERROR integration-tests/models/test_flash_medusa.py::test_flash_medusa_all_params - RuntimeError: Launcher crashed
ERROR integration-tests/models/test_flash_medusa.py::test_flash_medusa_load - RuntimeError: Launcher crashed

@sywangyi
Copy link
Contributor Author

@ErikKaum could you help retrigger it?

@sywangyi
Copy link
Contributor Author

sywangyi commented Oct 8, 2024

this PR is also needed to make mllama output correct in ipex-cpu. since it will upgrade ipex. could anyone help merge it?

@sywangyi
Copy link
Contributor Author

sywangyi commented Oct 8, 2024

@ErikKaum @Narsil ,please help. @yao-matrix

@@ -321,7 +322,7 @@ def get_weights_row(self, weights: Weights, prefix: str):
if g_idx is not None:
if (
not torch.equal(
g_idx.cpu(),
(g_idx - g_idx[0]).cpu(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is needed exactly? Probably as comment in code too.

Your code seems correct because this should be about sharding alignment.
However since desc_act should be checked before the fact that this pathway is failing seems to indicate that something is wrong with the target model possibly.

Copy link
Contributor Author

@sywangyi sywangyi Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If desc_act is False, exllama should be used. But in sharding case. if g_idx is not in ascend order, use_exllama is set to False.
which means for model like https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ, it will not use exllama in TP case in previous logic. now ipex also implements function like exllama. but exllama kernel does not support intel cpu/xpu.

@@ -350,16 +352,16 @@ def get_weights_row(self, weights: Weights, prefix: str):
else:
log_once(logger.info, f"Using exllama kernels v{HAS_EXLLAMA}")

if use_exllama and self.groupsize != -1:
if not desc_act and self.groupsize != -1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep use_exllama here.

Exllama is a really specific kernel. Purely about semantics it's easier for us to know what this is exllama specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_exllama is False since exllama only support cuda. but ipex quantization runtime kernel also implement similar logic like exllama. so we need such sharded logic for qzeros/scale/g_idx if desc_act is False as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't keeping use_exllama and simply fixing the TP (with - g_idx[0]) in the conditional to fix the issues on IPEX ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/layers/gptq/__init__.py#L134,this line set the use_exllama to false, since in intel platform, exllama is not installed.

server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
@Narsil
Copy link
Collaborator

Narsil commented Oct 18, 2024

IT's merged from an updated PR I prepared for CI (#2665) (only minor fixes have been updated in the control flow and adding a few comments).

@sywangyi sywangyi closed this Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants