support autoTP with weight only quantization in DS inference path #4750

ftian1 · 2023-11-29T05:54:31Z

This PR is used to make weight only quantization work with autoTP.

The sample code is like below:

    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)

    ds_model = deepspeed.init_inference(model,
                                        mp_size=world_size,
                                        dtype=torch.float16,
                                        replace_with_kernel_inject=False) 

    model = ds_model.module
    from deepspeed.inference.quantization.quantization import _init_group_wise_weight_quantization
    ds_config = {
        "weight_quantization": {
            "post_init_quant": {
                '*': {
                    'num_bits': 4,
                    'group_size': 32,
                    'group_dim': 1,
                    'symmetric': False
                },
            }
        }
    }
    model = _init_group_wise_weight_quantization(model, ds_config)

by this way, user can enable WOQ on multiple cards.

delock · 2023-11-30T04:45:10Z

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

ftian1 · 2023-12-01T08:33:57Z

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124,
in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

delock · 2023-12-02T01:09:18Z

It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

baodii · 2023-12-12T05:23:01Z

tests/unit/inference/test_inference.py

+        ds_output = pipe(query, **inf_kwargs)
+
+        #print(local_rank, "baseline", bs_output)
+        print(local_rank, "deepspeed", ds_output)


Hi, @ftian1 I have run this test. But the result I got is 'deepspeed [{'generated_text': 'DeepSpeed is the greatest,,,,,,,,,,,,,,,'}]'. This result is not right. Can you figure out what's wrong with this test? BTW, I can pass all tests in test_intX_quantization.py.

@baodii may I know which device you are running on? cuda or cpu?

delock · 2023-12-19T01:37:25Z

@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection?

loadams · 2025-01-14T01:00:11Z

@delock and @ftian1 - should we keep this PR active? Or try to resolve the merge conflicts?

delock · 2025-01-14T06:57:38Z

@loadams let me check with @ftian1 on this PR status. Thanks for reminding!

ftian1 · 2025-01-16T08:13:13Z

@loadams I have solved the merge conflicts. pls check it

Signed-off-by: Feng Tian <[email protected]>

ftian1 requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2, arashb and tjruwase as code owners November 29, 2023 05:54

delock mentioned this pull request Dec 11, 2023

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Closed

25 tasks

baodii reviewed Dec 12, 2023

View reviewed changes

letonghan mentioned this pull request Jan 26, 2024

[NeuralChat] Add Multi-Socket LLM Inference Example intel/intel-extension-for-transformers#1073

Open

ftian1 force-pushed the master branch from 7fc6730 to 46f7ef2 Compare January 16, 2025 08:10

ftian1 requested review from hwchen2017, tohtana and loadams as code owners January 16, 2025 08:10

ftian1 added 3 commits January 16, 2025 15:19

support the wildcard * in the weight_only ds_config

b3edd2f

Signed-off-by: Feng Tian <[email protected]>

support autoTP with weight quantization in DS inference path

9b947a7

Signed-off-by: Feng Tian <[email protected]>

fix typo in LmHeadLinearAllreduce initialization

46f7ef2

loadams removed request for arashb, cmikeh2, awan-10, mrwyattii and RezaYazdaniAminabadi January 16, 2025 17:04

loadams self-assigned this Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support autoTP with weight only quantization in DS inference path #4750

support autoTP with weight only quantization in DS inference path #4750

ftian1 commented Nov 29, 2023 •

edited

Loading

delock commented Nov 30, 2023

ftian1 commented Dec 1, 2023

delock commented Dec 2, 2023

baodii Dec 12, 2023

ftian1 Dec 12, 2023

delock commented Dec 19, 2023

loadams commented Jan 14, 2025

delock commented Jan 14, 2025

ftian1 commented Jan 16, 2025

support autoTP with weight only quantization in DS inference path #4750

Are you sure you want to change the base?

support autoTP with weight only quantization in DS inference path #4750

Conversation

ftian1 commented Nov 29, 2023 • edited Loading

delock commented Nov 30, 2023

ftian1 commented Dec 1, 2023

delock commented Dec 2, 2023

baodii Dec 12, 2023

Choose a reason for hiding this comment

ftian1 Dec 12, 2023

Choose a reason for hiding this comment

delock commented Dec 19, 2023

loadams commented Jan 14, 2025

delock commented Jan 14, 2025

ftian1 commented Jan 16, 2025

ftian1 commented Nov 29, 2023 •

edited

Loading