How to benchmark vLLM a short tutorial #7181

samos123 · 2024-08-06T03:28:04Z

samos123
Aug 6, 2024

Let me know if part of this tutorial should be in the public docs..

source: https://substratus.ai/blog/how-to-benchmark-vllm

Learn to benchmark vLLM so you can optimize the performance of your models.
My experience has been that the performance can improve up to 20x depending
on the configuration and use case. So learning to benchmark is crucial.

vLLM provides a simple benchmarking script that can be used to measure the
performance of serving using the OpenAI API. It also supports other backends,
but for this blog post we will focus on the OpenAI API.

The benchmarking script is available here.

For this tutorial, we will deploy vLLM on a Kubernetes cluster using the vLLM helm chart. We can then run the benchmark from within the vLLM container to ensure that the benchmark is as accurate as possible.

Deploying Llama 3.1 8B Instruct in FP8 mode

This assumes you have a K8s cluster with at least a single 24GB GPU available.

Run the following command to deploy vLLM:

helm upgrade --install llama-3-1-8b-instruct substratusai/vllm -f - <<EOF
model: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
gpuMemoryUtilization: "0.90"
maxModelLen: 16384
image:
  tag: v0.5.3.post1
env:
- name: EXTRA_ARGS
  value: --kv-cache-dtype=auto --enable-prefix-caching --max-num-batched-tokens=16384
resources:
  limits:
    nvidia.com/gpu: "1"
EOF

After a few minutes the pod should report Running and you can proceed to the next step.

Running the benchmark

First get an interactive shell in the vLLM container:

kubectl exec -it $(kubectl get pods -l app.kubernetes.io/instance=llama-3-1-8b-instruct -o name) -- bash

Now that you are in the container itself, download the benchmark script:

git clone https://github.com/vllm-project/vllm.git
git checkout 16a1cc9bb2b4bba82d78f329e5a89b44a5523ac8
cd vllm/benchmarks

The easiest way to run the benchmark is to use the random dataset.
However, this dataset may not be representative of your use case.

You can now run the benchmark using the following command:

python3 benchmark_serving.py --backend openai \
    --base-url http://127.0.0.1:8080 \
    --dataset-name=random \
    --model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --seed 12345

This was the output I got when running the benchmark on an L4 GPU:

Namespace(backend='openai', base_url='http://127.0.0.1:8080', host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='random', dataset_path=None, model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=12345, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  460.12
Total input tokens:                      1024000
Total generated tokens:                  97886
Request throughput (req/s):              2.17
Input token throughput (tok/s):          2225.51
Output token throughput (tok/s):         212.74
---------------Time to First Token----------------
Mean TTFT (ms):                          204348.97
Median TTFT (ms):                        199900.52
P99 TTFT (ms):                           437925.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          311.95
Median TPOT (ms):                        169.56
P99 TPOT (ms):                           2874.86
---------------Inter-token Latency----------------
Mean ITL (ms):                           2803.83
Median ITL (ms):                         101.32
P99 ITL (ms):                            50446.18
==================================================

Conclusion

You now learned the basics of benchmarking vLLM using the random dataset. You can also use the ShareGPT dataset to benchmark on a more realistic dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to benchmark vLLM a short tutorial #7181

{{title}}

Replies: 0 comments

Select a reply

How to benchmark vLLM a short tutorial #7181

samos123 Aug 6, 2024

Deploying Llama 3.1 8B Instruct in FP8 mode

Running the benchmark

Conclusion

Replies: 0 comments

samos123
Aug 6, 2024