How to benchmark vLLM a short tutorial #7181
samos123
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Let me know if part of this tutorial should be in the public docs..
source: https://substratus.ai/blog/how-to-benchmark-vllm
Learn to benchmark vLLM so you can optimize the performance of your models.
My experience has been that the performance can improve up to 20x depending
on the configuration and use case. So learning to benchmark is crucial.
vLLM provides a simple benchmarking script that can be used to measure the
performance of serving using the OpenAI API. It also supports other backends,
but for this blog post we will focus on the OpenAI API.
The benchmarking script is available here.
For this tutorial, we will deploy vLLM on a Kubernetes cluster using the vLLM helm chart. We can then run the benchmark from within the vLLM container to ensure that the benchmark is as accurate as possible.
Deploying Llama 3.1 8B Instruct in FP8 mode
This assumes you have a K8s cluster with at least a single 24GB GPU available.
Run the following command to deploy vLLM:
After a few minutes the pod should report
Running
and you can proceed to the next step.Running the benchmark
First get an interactive shell in the vLLM container:
Now that you are in the container itself, download the benchmark script:
git clone https://github.com/vllm-project/vllm.git git checkout 16a1cc9bb2b4bba82d78f329e5a89b44a5523ac8 cd vllm/benchmarks
The easiest way to run the benchmark is to use the random dataset.
However, this dataset may not be representative of your use case.
You can now run the benchmark using the following command:
This was the output I got when running the benchmark on an L4 GPU:
Conclusion
You now learned the basics of benchmarking vLLM using the random dataset. You can also use the ShareGPT dataset to benchmark on a more realistic dataset.
Beta Was this translation helpful? Give feedback.
All reactions