[Bug]: Wrong "completion_tokens" counts in streaming usage #8625

yuhon0528 · 2024-09-19T09:01:24Z

Your current environment

The output of `python collect_env.py`

Collecting environment information...
WARNING 09-19 16:50:11 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-25-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.4.99
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pynvml==11.5.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.0.dev0
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2@9ba0817ff1eb514f51cc6de9cb8e16c98d6ee44f
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

Model Input Dumps

No response

🐛 Describe the bug

I have used vllm==0.6.1.post2 to host Qwen2-VL-72B-Instruct with the command:
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-72B-Instruct --model /models/Qwen2-VL-72B-Instruct --tensor-parallel-size 4 --gpu-memory-utilization 0.9

However, when I curl the endpoint with streaming and include_usage=True, the completion_tokens is always 1 until the last token.

My curl:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2-VL-72B-Instruct",
"max_tokens": 1024,
"temperature": 0.1,
"stream": true,
"stream_options": {"include_usage": true},
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hi."}
]
}'

Output:

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":21,"completion_tokens":0}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" can"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" I"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" assist"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" you"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":" today"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":"?"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":22,"completion_tokens":1}}

data: {"id":"chat-a440da300fcc4022947ccfe424f9c8ae","object":"chat.completion.chunk","created":1726735957,"model":"Qwen2-VL-72B-Instruct","choices":[],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}

data: [DONE]

Is this a bug? Or anything wrong with my setting?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

yuhon0528 added the bug Something isn't working label Sep 19, 2024

tdoublep mentioned this issue Oct 2, 2024

[Bug]: Continuous usage stats are incorrect when chunked prefill is enabled #9028

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Wrong "completion_tokens" counts in streaming usage #8625

[Bug]: Wrong "completion_tokens" counts in streaming usage #8625

yuhon0528 commented Sep 19, 2024

[Bug]: Wrong "completion_tokens" counts in streaming usage #8625

[Bug]: Wrong "completion_tokens" counts in streaming usage #8625

Comments

yuhon0528 commented Sep 19, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...