You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Collecting environment information...
WARNING 09-19 16:50:11 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-25-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.4.99
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pynvml==11.5.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.0.dev0
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2@9ba0817ff1eb514f51cc6de9cb8e16c98d6ee44f
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
Model Input Dumps
No response
🐛 Describe the bug
I have used vllm==0.6.1.post2 to host Qwen2-VL-72B-Instruct with the command: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-72B-Instruct --model /models/Qwen2-VL-72B-Instruct --tensor-parallel-size 4 --gpu-memory-utilization 0.9
However, when I curl the endpoint with streaming and include_usage=True, the completion_tokens is always 1 until the last token.
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
I have used
vllm==0.6.1.post2
to host Qwen2-VL-72B-Instruct with the command:python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-72B-Instruct --model /models/Qwen2-VL-72B-Instruct --tensor-parallel-size 4 --gpu-memory-utilization 0.9
However, when I curl the endpoint with
streaming
andinclude_usage=True
, thecompletion_tokens
is always 1 until the last token.My curl:
Output:
Is this a bug? Or anything wrong with my setting?
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: