Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Wrong "completion_tokens" counts in streaming usage #8625

Open
1 task done
yuhon0528 opened this issue Sep 19, 2024 · 0 comments
Open
1 task done

[Bug]: Wrong "completion_tokens" counts in streaming usage #8625

yuhon0528 opened this issue Sep 19, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@yuhon0528
Copy link

Your current environment

The output of `python collect_env.py`
Collecting environment information...
WARNING 09-19 16:50:11 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-25-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.4.99
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pynvml==11.5.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.0.dev0
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2@9ba0817ff1eb514f51cc6de9cb8e16c98d6ee44f
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

Model Input Dumps

No response

🐛 Describe the bug

I have used vllm==0.6.1.post2 to host Qwen2-VL-72B-Instruct with the command:
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-72B-Instruct --model /models/Qwen2-VL-72B-Instruct --tensor-parallel-size 4 --gpu-memory-utilization 0.9

However, when I curl the endpoint with streaming and include_usage=True, the completion_tokens is always 1 until the last token.

My curl:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2-VL-72B-Instruct",
"max_tokens": 1024,
"temperature": 0.1,
"stream": true,
"stream_options": {"include_usage": true},
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hi."}
]
}'

Output:

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":21,"completion_tokens":0}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" can"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" I"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" assist"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" you"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" today"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":"?"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}

data: [DONE]

Is this a bug? Or anything wrong with my setting?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant