Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vllm deploy medusa, draft acceptance rate: 0.000 #8620

Closed
xhjcxxl opened this issue Sep 19, 2024 · 3 comments
Closed

[Bug]: vllm deploy medusa, draft acceptance rate: 0.000 #8620

xhjcxxl opened this issue Sep 19, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@xhjcxxl
Copy link

xhjcxxl commented Sep 19, 2024

Your current environment

vllm==0.6.1

Model Input Dumps

when i use medusa train, medusa0,medusa1,medusa2 acc has 0.95, train result is ok,

but i try vllm to delpoy medusa, deploy is ok,

but test sample result not has accelerate, draft acceptance rate is 0.0

🐛 Describe the bug

Speculative metrics: Draft acceptance rate: 0.000, System efficiency: 0.250, Number of speculative tokens: 3, Number of accepted tokens: 0, Number of draft tokens: 483, Number of emitted tokens: 161.

@xhjcxxl xhjcxxl added the bug Something isn't working label Sep 19, 2024
@xhjcxxl
Copy link
Author

xhjcxxl commented Sep 20, 2024

i try medusa on TGI, it's work fine, but in vllm, it can't work, draft acceptance rate is 0.0, i want konw where is error, code is :

CUDA_VISIBLE_DEVICES=2 python3 -m vllm.entrypoints.openai.api_server --port 8010 \
  --served-model-name qwen2-7b \
  --model /mnt/user/deploy/qwen15_14b_finetuning_chatbot_v1_0914_deploy --dtype auto -tp 1 \
  --max-model-len 2048 --gpu-memory-utilization 0.9 \
  --max-num-seqs 1 \
  --speculative-model /mnt/user/deploy/qwen15_14b_finetuning_chatbot_v1_0914_deploy/medusa \
  --speculative-draft-tensor-parallel-size 1 \
  --num-speculative-tokens 3 \
  --use-v2-block-manager \
  --spec-decoding-acceptance-method typical_acceptance_sampler

@LiuXiaoxuanPKU
Copy link
Collaborator

Hi, since I don't have the qwen model, I tested medusa locally with the following command:

vllm serve lmsys/vicuna-7b-v1.3 \
    --disable-log-requests \
    --tensor-parallel-size 1 \
    --speculative-model abhigoyal/vllm-medusa-vicuna-7b-v1.3 \
    --num-speculative-tokens 3 \
    --use-v2-block-manager

It seems work and the acceptance rate is >0.

Could you double check your medusa model config is compatible with vllm's requirements? As shown here, the model config is different from original model config.

@xhjcxxl
Copy link
Author

xhjcxxl commented Sep 23, 2024

Hi, since I don't have the qwen model, I tested medusa locally with the following command:

vllm serve lmsys/vicuna-7b-v1.3 \
    --disable-log-requests \
    --tensor-parallel-size 1 \
    --speculative-model abhigoyal/vllm-medusa-vicuna-7b-v1.3 \
    --num-speculative-tokens 3 \
    --use-v2-block-manager

It seems work and the acceptance rate is >0.

Could you double check your medusa model config is compatible with vllm's requirements? As shown here, the model config is different from original model config.

thanks, i try again like your command, it's ok, i find remove typical_acceptance_sampler or use reject_sampler, it's work fine.

@xhjcxxl xhjcxxl closed this as completed Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants