-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support autoTP with weight only quantization in DS inference path #4750
base: master
Are you sure you want to change the base?
Conversation
@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code? |
Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, |
It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.
|
ds_output = pipe(query, **inf_kwargs) | ||
|
||
#print(local_rank, "baseline", bs_output) | ||
print(local_rank, "deepspeed", ds_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @ftian1 I have run this test. But the result I got is 'deepspeed [{'generated_text': 'DeepSpeed is the greatest,,,,,,,,,,,,,,,'}]'. This result is not right. Can you figure out what's wrong with this test? BTW, I can pass all tests in test_intX_quantization.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baodii may I know which device you are running on? cuda or cpu?
@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection? |
@loadams I have solved the merge conflicts. pls check it |
Signed-off-by: Feng Tian <[email protected]>
Signed-off-by: Feng Tian <[email protected]>
This PR is used to make weight only quantization work with autoTP.
The sample code is like below:
by this way, user can enable WOQ on multiple cards.