Releases: vllm-project/vllm
Releases · vllm-project/vllm
v0.5.3
Highlights
Model Support
- vLLM now supports Meta Llama 3.1! Please checkout our blog here for initial details on running the model.
- Please checkout this thread for any known issues related to the model.
- The model runs on a single 8xH100 or 8xA100 node using FP8 quantization (#6606, #6547, #6487, #6593, #6511, #6515, #6552)
- The BF16 version of the model should run on multiple nodes using pipeline parallelism (docs). If you have fast network interconnect, you might want to consider full tensor paralellism as well. (#6599, #6598, #6529, #6569)
- In order to support long context, a new rope extension method has been added and chunked prefill has been turned on by default for Meta Llama 3.1 series of model. (#6666, #6553, #6673)
- Support Mistral-Nemo (#6548)
- Support Chameleon (#6633, #5770)
- Pipeline parallel support for Mixtral (#6516)
Hardware Support
Performance Enhancements
- Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (#6612)
- Progress towards refactoring for SPMD worker execution. (#6032)
- Progress in improving prepare inputs procedure. (#6164, #6338, #6596)
- Memory optimization for pipeline parallelism. (#6455)
Production Engine
- Correctness testing for pipeline parallel and CPU offloading (#6410, #6549)
- Support dynamically loading Lora adapter from HuggingFace (#6234)
- Pipeline Parallel using stdlib multiprocessing module (#6130)
Others
- A CPU offloading implementation, you can now use
--cpu-offload-gb
to control how much memory to "extend" the RAM with. (#6496) - The new
vllm
CLI is now ready for testing. It comes with three commands:serve
,complete
, andchat
. Feedback and improvements are greatly welcomed! (#6431) - The wheels now build on Ubuntu 20.04 instead of 22.04. (#6517)
What's Changed
- [Docs] Add Google Cloud to sponsor list by @WoosukKwon in #6450
- [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod by @WoosukKwon in #6289
- [CI/Build][TPU] Add TPU CI test by @WoosukKwon in #6277
- Pin sphinx-argparse version by @khluu in #6453
- [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug by @mzusman in #6425
- [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests by @g-eoj in #6419
- [Docs] Announce 5th meetup by @WoosukKwon in #6458
- [CI/Build] vLLM cache directory for images by @DarkLight1337 in #6444
- [Frontend] Support for chat completions input in the tokenize endpoint by @sasha0552 in #5923
- [Misc] Fix typos in spec. decode metrics logging. by @tdoublep in #6470
- [Core] Use numpy to speed up padded token processing by @peng1999 in #6442
- [CI/Build] Remove "boardwalk" image asset by @DarkLight1337 in #6460
- [doc][misc] remind users to cancel debugging environment variables after debugging by @youkaichao in #6481
- [Hardware][TPU] Support MoE with Pallas GMM kernel by @WoosukKwon in #6457
- [Doc] Fix the lora adapter path in server startup script by @Jeffwan in #6230
- [Misc] Log spec decode metrics by @comaniac in #6454
- [Kernel][Attention] Separate
Attention.kv_scale
intok_scale
andv_scale
by @mgoin in #6081 - [ci][distributed] add pipeline parallel correctness test by @youkaichao in #6410
- [misc][distributed] improve tests by @youkaichao in #6488
- [misc][distributed] add seed to dummy weights by @youkaichao in #6491
- [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization by @wushidonguc in #6455
- [ROCm] Cleanup Dockerfile and remove outdated patch by @hongxiayang in #6482
- [Misc][Speculative decoding] Typos and typing fixes by @ShangmingCai in #6467
- [Doc][CI/Build] Update docs and tests to use
vllm serve
by @DarkLight1337 in #6431 - [Bugfix] Fix for multinode crash on 4 PP by @andoorve in #6495
- [TPU] Remove multi-modal args in TPU backend by @WoosukKwon in #6504
- [Misc] Use
torch.Tensor
for type annotation by @WoosukKwon in #6505 - [Core] Refactor _prepare_model_input_tensors - take 2 by @comaniac in #6164
- [DOC] - Add docker image to Cerebrium Integration by @milo157 in #6510
- [Bugfix] Fix Ray Metrics API usage by @Yard1 in #6354
- [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step by @alexm-neuralmagic in #6338
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel by @varun-sundar-rabindranath in #6511
- [Model] Pipeline parallel support for Mixtral by @comaniac in #6516
- [ Kernel ] Fp8 Channelwise Weight Support by @robertgshaw2-neuralmagic in #6487
- [core][model] yet another cpu offload implementation by @youkaichao in #6496
- [BugFix] Avoid secondary error in ShmRingBuffer destructor by @njhill in #6530
- [Core] Introduce SPMD worker execution using Ray accelerated DAG by @ruisearch42 in #6032
- [Misc] Minor patch for draft model runner by @comaniac in #6523
- [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs by @njhill in #6227
- [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash by @noamgat in #6501
- [TPU] Refactor TPU worker & model runner by @WoosukKwon in #6506
- [ Misc ] Improve Min Capability Checking in
compressed-tensors
by @robertgshaw2-neuralmagic in #6522 - [ci] Reword Github bot comment by @khluu in #6534
- [Model] Support Mistral-Nemo by @mgoin in #6548
- Fix PR comment bot by @khluu in #6554
- [ci][test] add correctness test for cpu offloading by @youkaichao in #6549
- [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm by @tlrmchlsmth in #6552
- [CI/Build] Build on Ubuntu 20.04 instead of 22.04 by @tlrmchlsmth in #6517
- Add support for a rope extension method by @simon-mo in #6553
- [Core] Multiprocessing Pipeline Parallel support by @njhill in #6130
- [Bugfix] Make spec. decode respect per-request seed. by @tdoublep in #6034
- [ Misc ] non-uniform quantization via
compressed-tensors
forLlama
by @robertgshaw2-neuralmagic in #6515 - [Bugfix][Frontend] Fix missing
/metrics
endpoint by @DarkLight1337 in #6463 - [BUGFIX] Raise an error for no draft token case when draft_tp>1 by @wooyeonlee0 in #6369
- [Model] RowParallelLinear: pass bias to quant_method.apply by @tdoublep in #6327
- [Bugfix][Frontend] remove duplicate init logger by @dtrifiro in #6581
- [Misc] Small perf improvements by @Yard1 in #6520
- [Docs] Update docs for wheel location by @simon-mo in #6580
- [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection by @tdoublep in #6578
- [bugfix][distributed] fix multi-node bug for shared memory by @youkaichao in #6597
- [ Kernel ] Enable Dynamic Per Token
fp8
by @robertgshaw2-neuralmagic in #6547 - [Docs] Update PP docs by @andoorve in #6598
- [build] add ib so that multi-node support with infiniband can be supported out-of-the-box by @youkaichao in #6599
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub by @varun-sundar-rabindranath in #6593
- [Core] Allow specifying custom Executor by @Yard1 in #6557
- [Bugfix][Core]: Guard for KeyErrors that...
v0.5.2
Major Changes
- ❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
- The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.
Highlights
Model Support
- Add PaliGemma (#5189), Fuyu-8B (#3924)
- Support for soft tuned prompts (#4645)
- A new guide for adding multi-modal plugins (#6205)
Hardware
- AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)
Performance
- ZeroMQ fallback for broadcasting large objects (#6183)
- Simplify code to support pipeline parallel (#6406)
- Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
- Use CUTLASS kernels for the FP8 layers with Bias (#6270)
Features
- Enabling bonus token in speculative decoding for KV cache based models (#5765)
- Medusa Implementation with Top-1 proposer (#4978)
- An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)
Others
- Add support for multi-node on CI (#5955)
- Benchmark: add H100 suite (#6047)
- [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
- Build some nightly wheels (#6380)
What's Changed
- Update wheel builds to strip debug by @simon-mo in #6161
- Fix release wheel build env var by @simon-mo in #6162
- Move release wheel env var to Dockerfile instead by @simon-mo in #6163
- [Doc] Reorganize Supported Models by Type by @ywang96 in #6167
- [Doc] Move guide for multimodal model and other improvements by @DarkLight1337 in #6168
- [Model] Add PaliGemma by @ywang96 in #5189
- add benchmark for fix length input and output by @haichuan1221 in #5857
- [ Misc ] Support Fp8 via
llm-compressor
by @robertgshaw2-neuralmagic in #6110 - [misc][frontend] log all available endpoints by @youkaichao in #6195
- do not exclude
object
field in CompletionStreamResponse by @kczimm in #6196 - [Bugfix] FIx benchmark args for randomly sampled dataset by @haichuan1221 in #5947
- [Kernel] reloading fused_moe config on the last chunk by @avshalomman in #6210
- [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) by @afeldman-nm in #4888
- [Bugfix] use diskcache in outlines _get_guide #5436 by @ericperfect in #6203
- [Bugfix] Mamba cache Cuda Graph padding by @tomeras91 in #6214
- Add FlashInfer to default Dockerfile by @simon-mo in #6172
- [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability by @youkaichao in #6216
- [core][distributed] fix ray worker rank assignment by @youkaichao in #6235
- [Bugfix][TPU] Add missing None to model input by @WoosukKwon in #6245
- [Bugfix][TPU] Fix outlines installation in TPU Dockerfile by @WoosukKwon in #6256
- Add support for multi-node on CI by @khluu in #5955
- [CORE] Adding support for insertion of soft-tuned prompts by @SwapnilDreams100 in #4645
- [Docs] Docs update for Pipeline Parallel by @andoorve in #6222
- [Bugfix]fix and needs_scalar_to_array logic check by @qibaoyuan in #6238
- [Speculative Decoding] Medusa Implementation with Top-1 proposer by @abhigoyal1997 in #4978
- [core][distributed] add zmq fallback for broadcasting large objects by @youkaichao in #6183
- [Bugfix][TPU] Add prompt adapter methods to TPUExecutor by @WoosukKwon in #6279
- [Doc] Guide for adding multi-modal plugins by @DarkLight1337 in #6205
- [Bugfix] Support 2D input shape in MoE layer by @WoosukKwon in #6287
- [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. by @tdoublep in #6303
- [CI/Build] Enable mypy typing for remaining folders by @bmuskalla in #6268
- [Bugfix] OpenVINOExecutor abstractmethod error by @park12sj in #6296
- [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models by @sroy745 in #5765
- [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor by @WoosukKwon in #6313
- [Doc] Remove comments incorrectly copied from another project by @daquexian in #6286
- [Doc] Update description of vLLM support for CPUs by @DamonFool in #6003
- [BugFix]: set outlines pkg version by @xiangyang-95 in #6262
- [Bugfix] Fix snapshot download in serving benchmark by @ywang96 in #6318
- [Misc] refactor(config): clean up unused code by @aniaan in #6320
- [BugFix]: fix engine timeout due to request abort by @pushan01 in #6255
- [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. by @tdoublep in #6326
- [BugFix] get_and_reset only when scheduler outputs are not empty by @mzusman in #6266
- [ Misc ] Refactor Marlin Python Utilities by @robertgshaw2-neuralmagic in #6082
- Benchmark: add H100 suite by @simon-mo in #6047
- [bug fix] Fix llava next feature size calculation. by @xwjiang2010 in #6339
- [doc] update pipeline parallel in readme by @youkaichao in #6347
- [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy by @KuntaiDu in #5362
- [ BugFix ] Prompt Logprobs Detokenization by @robertgshaw2-neuralmagic in #6223
- [Misc] Remove flashinfer warning, add flashinfer tests to CI by @LiuXiaoxuanPKU in #6351
- [distributed][misc] keep consistent with how pytorch finds libcudart.so by @youkaichao in #6346
- [Bugfix] Fix usage stats logging exception warning with OpenVINO by @helena-intel in #6349
- [Model][Phi3-Small] Remove scipy from blocksparse_attention by @mgoin in #6343
- [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub by @adityagoel14 in #6350
- [ROCm][AMD][Bugfix] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count and fixed navi3x by @hongxiayang in #6352
- [ Misc ] Remove separate bias add by @robertgshaw2-neuralmagic in #6353
- [Misc][Bugfix] Update transformers for tokenizer issue by @ywang96 in #6364
- [ Misc ] Support Models With Bias in
compressed-tensors
integration by @robertgshaw2-neuralmagic in #6356 - [Bugfix] Fix dtype mismatch in PaliGemma by @DarkLight1337 in #6367
- [Build/CI] Checking/Waiting for the GPU's clean state by @Alexei-V-Ivanov-AMD in #6379
- [Misc] add fixture to guided processor tests by @kevinbu233 in #6341
- [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline by @khluu in #6365
- [ci] Add GHA workflows to enable full CI run by @khluu in #6381
- [MISC] Upgrade dependency to PyTorch 2.3.1 by @comaniac in #5327
- Build some nightly wheels by default by @simon-mo in #6380
- Fix release-pipeline.yaml by @simon-mo in #6388
- Fix interpolation in release pipeline by @simon-mo in #6389
- Fix release pipeline's -e flag by @simon-mo in #6390
- [Bugfix] Fix illegal memory access in FP8 MoE kernel by @comaniac in #6382
- [Misc] Add generated git commit hash as
vllm.__commit__
by @mgoin in #6386 - Fix release pipeline's dir permission by @simon-mo in #6391
- [Bugfix][TPU] Fix megacore setting...
v0.5.1
Highlights
- vLLM now has pipeline parallelism! (#4412, #5408, #6115, #6120). You can now run the API server with
--pipeline-parallel-size
. This feature is in early stage, please let us know your feedback.
Model Support
- Support Gemma 2 (#5908, #6051). Please note that for correctness, Gemma should run with FlashInfer backend which supports logits soft cap. The wheels for FlashInfer can be downloaded here
- Support Jamba (#4115). This is vLLM's first state space model!
- Support Deepseek-V2 (#4650). Please note that MLA (Multi-head Latent Attention) is not implemented and we are looking for contribution!
- Vision Language Model adding support for Phi3-Vision, dynamic image size, and a registry for processing model inputs (#4986, #5276, #5214)
- Notably, it has a breaking change that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in
<image>
into the prompt instead of complicated prompt formatting. See more here - There is also a new guide on adding VLMs! We would love your contribution for new models!
- Notably, it has a breaking change that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in
Hardware Support
Production Service
- Support for sharded tensorized models (#4990)
- Continous streaming of OpenAI response token stats (#5742)
Performance
- Enhancement in distributed communication via shared memory (#5399)
- Latency enhancement in block manager (#5584)
- Enhancements to
compressed-tensors
supporting Marlin, W4A16 (#5435, #5385) - Faster FP8 quantize kernel (#5396), FP8 on Ampere (#5975)
- Option to use FlashInfer for prefill, decode, and CUDA Graph for decode (#4628)
- Speculative Decoding
- Draft Model Runner (#5799)
Development Productivity
- Post merge benchmark is now available at perf.vllm.ai!
- Addition of A100 in CI environment (#5658)
- Step towards nightly wheel publication (#5610)
What's Changed
- [CI/Build] Add
is_quant_method_supported
to control quantization test configurations by @mgoin in #5253 - Revert "[CI/Build] Add
is_quant_method_supported
to control quantization test configurations" by @simon-mo in #5463 - [CI] Upgrade codespell version. by @rkooo567 in #5381
- [Hardware] Initial TPU integration by @WoosukKwon in #5292
- [Bugfix] Add device assertion to TorchSDPA by @bigPYJ1151 in #5402
- [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by @khluu in #5464
- [Kernel] Vectorized FP8 quantize kernel by @comaniac in #5396
- [Bugfix] TYPE_CHECKING for MultiModalData by @kimdwkimdw in #5444
- [Frontend] [Core] Support for sharded tensorized models by @tjohnson31415 in #4990
- [misc] add hint for AttributeError by @youkaichao in #5462
- [Doc] Update debug docs by @DarkLight1337 in #5438
- [Bugfix] Fix typo in scheduler.py (requeset -> request) by @mgoin in #5470
- [Frontend] Add "input speed" to tqdm postfix alongside output speed by @mgoin in #5425
- [Bugfix] Fix wrong multi_modal_input format for CPU runner by @Isotr0py in #5451
- [Core][Distributed] add coordinator to reduce code duplication in tp and pp by @youkaichao in #5293
- [ci] Use sccache to build images by @khluu in #5419
- [Bugfix]if the content is started with ":"(response of ping), client should i… by @sywangyi in #5303
- [Kernel]
w4a16
support forcompressed-tensors
by @dsikka in #5385 - [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by @mgoin in #5466
- [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by @wenyujin333 in #5497
- [Hardware][Intel] Optimize CPU backend and add more performance tips by @bigPYJ1151 in #4971
- [Docs] Add 4th meetup slides by @WoosukKwon in #5509
- [Misc] Add vLLM version getter to utils by @DarkLight1337 in #5098
- [CI/Build] Simplify OpenAI server setup in tests by @DarkLight1337 in #5100
- [Doc] Update LLaVA docs by @DarkLight1337 in #5437
- [Kernel] Factor out epilogues from cutlass kernels by @tlrmchlsmth in #5391
- [MISC] Remove FP8 warning by @comaniac in #5472
- Seperate dev requirements into lint and test by @Yard1 in #5474
- Revert "[Core] Remove unnecessary copies in flash attn backend" by @Yard1 in #5478
- [misc] fix format.sh by @youkaichao in #5511
- [CI/Build] Disable test_fp8.py by @tlrmchlsmth in #5508
- [Kernel] Disable CUTLASS kernels for fp8 by @tlrmchlsmth in #5505
- Add
cuda_device_count_stateless
by @Yard1 in #5473 - [Hardware][Intel] Support CPU inference with AVX2 ISA by @DamonFool in #5452
- [Bugfix]typofix by @AllenDou in #5507
- bump version to v0.5.0.post1 by @simon-mo in #5522
- [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with
perf-benchmarks
label by @KuntaiDu in #5073 - [CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in #5529
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in #5516
- [Misc] Fix arg names by @AllenDou in #5524
- [ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in #5432
- [Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in #5401
- [mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in #5546
- [Core] Remove duplicate processing in async engine by @DarkLight1337 in #5525
- [misc][distributed] fix benign error in
is_in_the_same_node
by @youkaichao in #5512 - [Docs] Add ZhenFund as a Sponsor by @simon-mo in #5548
- [Doc] Update documentation on Tensorizer by @sangstar in #5471
- [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in #5460
- [Bugfix] Fix typo in Pallas backend by @WoosukKwon in #5558
- [Core][Distributed] improve p2p cache generation by @youkaichao in #5528
- Add ccache to amd by @simon-mo in #5555
- [Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #5364
- [mypy] Enable type checking for test directory by @DarkLight1337 in #5017
- [CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in #5568
- [misc] Do not allow to use lora with chunked prefill. by @rkooo567 in #5538
- add gptq_marlin test for bug report #5088 by @alexm-neuralmagic in #5145
- [BugFix] Don't start a Ray cluster when not using Ray by @njhill in #5570
- [Fix] Correct OpenAI batch response format by @zifeitong in #5554
- Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in #5518
- [CI][BugFix] Flip is_quant_method_supported condition by @mgoin in #5577
- [build][misc] limit numpy version by @youkaichao in #5582
- [Doc] add debugging tips for crash and multi-node debugging by @youkaichao in #5581
- Fix w8a8 benchmark and add Llama-3-8B by @comaniac in #5562
- [Model] Rename Phi3 rope scaling type by @garg-amit in #5595
- Correct alignment in the seq_len diagram. by @CharlesRiggins in #5592
- [Kernel]
compressed-tensors
marlin 24 support by @dsikka in https://...
v0.5.0.post1
Highlights
- Add initial TPU integration (#5292)
- Fix crashes when using FlashAttention backend (#5478)
- Fix issues when using num_devices < num_available_devices (#5473)
What's Changed
- [CI/Build] Add
is_quant_method_supported
to control quantization test configurations by @mgoin in #5253 - Revert "[CI/Build] Add
is_quant_method_supported
to control quantization test configurations" by @simon-mo in #5463 - [CI] Upgrade codespell version. by @rkooo567 in #5381
- [Hardware] Initial TPU integration by @WoosukKwon in #5292
- [Bugfix] Add device assertion to TorchSDPA by @bigPYJ1151 in #5402
- [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by @khluu in #5464
- [Kernel] Vectorized FP8 quantize kernel by @comaniac in #5396
- [Bugfix] TYPE_CHECKING for MultiModalData by @kimdwkimdw in #5444
- [Frontend] [Core] Support for sharded tensorized models by @tjohnson31415 in #4990
- [misc] add hint for AttributeError by @youkaichao in #5462
- [Doc] Update debug docs by @DarkLight1337 in #5438
- [Bugfix] Fix typo in scheduler.py (requeset -> request) by @mgoin in #5470
- [Frontend] Add "input speed" to tqdm postfix alongside output speed by @mgoin in #5425
- [Bugfix] Fix wrong multi_modal_input format for CPU runner by @Isotr0py in #5451
- [Core][Distributed] add coordinator to reduce code duplication in tp and pp by @youkaichao in #5293
- [ci] Use sccache to build images by @khluu in #5419
- [Bugfix]if the content is started with ":"(response of ping), client should i… by @sywangyi in #5303
- [Kernel]
w4a16
support forcompressed-tensors
by @dsikka in #5385 - [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by @mgoin in #5466
- [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by @wenyujin333 in #5497
- [Hardware][Intel] Optimize CPU backend and add more performance tips by @bigPYJ1151 in #4971
- [Docs] Add 4th meetup slides by @WoosukKwon in #5509
- [Misc] Add vLLM version getter to utils by @DarkLight1337 in #5098
- [CI/Build] Simplify OpenAI server setup in tests by @DarkLight1337 in #5100
- [Doc] Update LLaVA docs by @DarkLight1337 in #5437
- [Kernel] Factor out epilogues from cutlass kernels by @tlrmchlsmth in #5391
- [MISC] Remove FP8 warning by @comaniac in #5472
- Seperate dev requirements into lint and test by @Yard1 in #5474
- Revert "[Core] Remove unnecessary copies in flash attn backend" by @Yard1 in #5478
- [misc] fix format.sh by @youkaichao in #5511
- [CI/Build] Disable test_fp8.py by @tlrmchlsmth in #5508
- [Kernel] Disable CUTLASS kernels for fp8 by @tlrmchlsmth in #5505
- Add
cuda_device_count_stateless
by @Yard1 in #5473 - [Hardware][Intel] Support CPU inference with AVX2 ISA by @DamonFool in #5452
- [Bugfix]typofix by @AllenDou in #5507
- bump version to v0.5.0.post1 by @simon-mo in #5522
New Contributors
- @kimdwkimdw made their first contribution in #5444
- @sywangyi made their first contribution in #5303
Full Changelog: v0.5.0...v0.5.0.post1
v0.5.0
Highlights
Production Features
- FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost. Please try it out and let us know your thoughts! (#5352, #5388, #5159, #5238, #5294, #5183, #5144, #5231)
- Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported. We are working on adding more models in the next release. (#5237, #5383, #4199, #5374, #4197)
- Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases. (#5400, #5157, #5137, #5324)
- Default to multiprocessing backend for single-node distributed case (#5230)
- Support bitsandbytes quantization and QLoRA (#4776)
Hardware Support
- Improvements to the Intel CPU CI (#4113, #5241)
- Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047)
Others
- Debugging tips documentation (#5409, #5430)
- Dynamic Per-Token Activation Quantization (#5037)
- Customizable RoPE theta (#5197)
- Enable passing multiple LoRA adapters at once to generate() (#5300)
- OpenAI
tools
support named functions (#5032) - Support
stream_options
for OpenAI protocol (#5319, #5135) - Update Outlines Integration from
FSM
toGuide
(#4109)
What's Changed
- [CI/Build] CMakeLists: build all extensions' cmake targets at the same time by @dtrifiro in #5034
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU by @tlrmchlsmth in #5137
- [Kernel] Update Cutlass fp8 configs by @varun-sundar-rabindranath in #5144
- [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py by @dashanji in #5151
- [Bugfix] Fix call to init_logger in openai server by @NadavShmayo in #4765
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA by @chenqianfzh in #4776
- [Bugfix] Remove deprecated @abstractproperty by @zhuohan123 in #5174
- [Bugfix]: Fix issues related to prefix caching example (#5177) by @Delviet in #5180
- [BugFix] Prevent
LLM.encode
for non-generation Models by @robertgshaw2-neuralmagic in #5184 - Update test_ignore_eos by @simon-mo in #4898
- [Frontend][OpenAI] Support for returning max_model_len on /v1/models response by @Avinash-Raj in #4643
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer by @divakar-amd in #4927
- [Misc] Simplify code and fix type annotations in
conftest.py
by @DarkLight1337 in #5118 - [Core] Support image processor by @DarkLight1337 in #4197
- [Core] Remove unnecessary copies in flash attn backend by @Yard1 in #5138
- [Kernel] Pass a device pointer into the quantize kernel for the scales by @tlrmchlsmth in #5159
- [CI/BUILD] enable intel queue for longer CPU tests by @zhouyuan in #4113
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 by @Kaiyang-Chen in #3834
- New CI template on AWS stack by @khluu in #5110
- [FRONTEND] OpenAI
tools
support named functions by @br3no in #5032 - [Bugfix] Support
prompt_logprobs==0
by @toslunar in #5217 - [Bugfix] Add warmup for prefix caching example by @zhuohan123 in #5235
- [Kernel] Enhance MoE benchmarking & tuning script by @WoosukKwon in #4921
- [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend by @afeldman-nm in #5210
- [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor by @zifeitong in #5229
- [CI/Build] Add inputs tests by @DarkLight1337 in #5215
- [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend by @DamonFool in #5249
- [Kernel] Add back batch size 1536 and 3072 to MoE tuning by @WoosukKwon in #5242
- [CI/Build] Simplify model loading for
HfRunner
by @DarkLight1337 in #5251 - [CI/Build] Reducing CPU CI execution time by @bigPYJ1151 in #5241
- [CI] mark AMD test as softfail to prevent blockage by @simon-mo in #5256
- [Misc] Add transformers version to collect_env.py by @mgoin in #5259
- [Misc] update collect env by @youkaichao in #5261
- [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True by @zifeitong in #5226
- [Misc] Add CustomOp interface for device portability by @WoosukKwon in #5255
- [Misc] Fix docstring of get_attn_backend by @WoosukKwon in #5271
- [Frontend] OpenAI API server: Add
add_special_tokens
to ChatCompletionRequest (default False) by @tomeras91 in #5278 - [CI] Add nightly benchmarks by @simon-mo in #5260
- [misc] benchmark_serving.py -- add ITL results and tweak TPOT results by @tlrmchlsmth in #5263
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size by @tlrmchlsmth in #5157
- [Model] Correct Mixtral FP8 checkpoint loading by @comaniac in #5231
- [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM by @DriverSong in #5207
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 by @pcmoritz in #5238
- [Docs] Add Sequoia as sponsors by @simon-mo in #5287
- [Speculative Decoding] Add
ProposerWorkerBase
abstract class by @njhill in #5252 - [BugFix] Fix log message about default max model length by @njhill in #5284
- [Bugfix] Make EngineArgs use named arguments for config construction by @mgoin in #5285
- [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. by @wuisawesome in #5290
- [Misc] Skip for logits_scale == 1.0 by @WoosukKwon in #5291
- [Docs] Add Ray Summit CFP by @simon-mo in #5295
- [CI] Disable flash_attn backend for spec decode by @simon-mo in #5286
- [Frontend][Core] Update Outlines Integration from
FSM
toGuide
by @br3no in #4109 - [CI/Build] Update vision tests by @DarkLight1337 in #5307
- Bugfix: fix broken of download models from modelscope by @liuyhwangyh in #5233
- [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 by @pcmoritz in #5294
- [Frontend] enable passing multiple LoRA adapters at once to generate() by @mgoldey in #5300
- [Core] Avoid copying prompt/output tokens if no penalties are used by @Yard1 in #5289
- [Core] Change LoRA embedding sharding to support loading methods by @Yard1 in #5038
- [Misc] Missing error message for custom ops import by @DamonFool in #5282
- [Feature][Frontend]: Add support for
stream_options
inChatCompletionRequest
by @Etelis in #5135 - [Misc][Utils] allow get_open_port to be called for multiple times by @youkaichao in #5333
- [Kernel] Switch fp8 layers to use the CUTLASS kernels by @tlrmchlsmth in #5183
- Remove Ray health check by @Yard1 in #4693
- Addition of lacked ignored_seq_groups in _schedule_chunked_prefill by @JamesLim-sy in #5296
- [Kernel] Dynamic Per-Token Activation Quantization by @dsikka in #5037
- [Frontend] Add OpenAI Vision API Support by @ywang96 in #5237
- [Misc] Remove unused cuda_utils.h in CPU backend by @DamonFool in #5345
- fix DbrxFusedNormAttention missing cache_config by @Calvinnncy97 in https://github.com/vllm-...
v0.4.3
Highlights
Model Support
LLM
- Added support for Falcon (#5069)
- Added support for IBM Granite Code models (#4636)
- Added blocksparse flash attention kernel and Phi-3-Small model (#4799)
- Added Snowflake arctic model implementation (#4652, #4889, #4690)
- Supported Dynamic RoPE scaling (#4638)
- Supported for long context lora (#4787)
Embedding Models
- Intial support for Embedding API with e5-mistral-7b-instruct (#3734)
- Cross-attention KV caching and memory-management towards encoder-decoder model support (#4837)
Vision Language Model
- Add base class for vision-language models (#4809)
- Consolidate prompt arguments to LLM engines (#4328)
- LLaVA model refactor (#4910)
Hardware Support
AMD
- Add fused_moe Triton configs (#4951)
- Add support for Punica kernels (#3140)
- Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
Production Engine
Batch API
- Support OpenAI batch file format (#4794)
Making Ray Optional
- Add
MultiprocessingGPUExecutor
(#4539) - Eliminate parallel worker per-step task scheduling overhead (#4894)
Automatic Prefix Caching
- Accelerating the hashing function by avoiding deep copies (#4696)
Speculative Decoding
- CUDA graph support (#4295)
- Enable TP>1 speculative decoding (#4840)
- Improve n-gram efficiency (#4724)
Performance Optimization
Quantization
- Add GPTQ Marlin 2:4 sparse structured support (#4790)
- Initial Activation Quantization Support (#4525)
- Marlin prefill performance improvement (about better on average) (#4983)
- Automatically Detect SparseML models (#5119)
Better Attention Kernel
- Use flash-attn for decoding (#3648)
FP8
- Improve FP8 linear layer performance (#4691)
- Add w8a8 CUTLASS kernels (#4749)
- Support for CUTLASS kernels in CUDA graphs (#4954)
- Load FP8 kv-cache scaling factors from checkpoints (#4893)
- Make static FP8 scaling more robust (#4570)
- Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
Optimize Distributed Communication
- change python dict to pytorch tensor (#4607)
- change python dict to pytorch tensor for blocks to swap (#4659)
- improve paccess check (#4992)
- remove vllm-nccl (#5091)
- support both cpu and device tensor in broadcast tensor dict (#4660)
Extensible Architecture
Pipeline Parallelism
- refactor custom allreduce to support multiple tp groups (#4754)
- refactor pynccl to hold multiple communicators (#4591)
- Support PP PyNCCL Groups (#4988)
What's Changed
- Disable cuda version check in vllm-openai image by @zhaoyang-star in #4530
- [Bugfix] Fix
asyncio.Task
not being subscriptable by @DarkLight1337 in #4623 - [CI] use ccache actions properly in release workflow by @simon-mo in #4629
- [CI] Add retry for agent lost by @cadedaniel in #4633
- Update lm-format-enforcer to 0.10.1 by @noamgat in #4631
- [Kernel] Make static FP8 scaling more robust by @pcmoritz in #4570
- [Core][Optimization] change python dict to pytorch tensor by @youkaichao in #4607
- [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. by @Alexei-V-Ivanov-AMD in #4642
- [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora by @FurtherAI in #4609
- [Core][Optimization] change copy-on-write from dict[int, list] to list by @youkaichao in #4648
- [Bug fix][Core] fixup ngram not setup correctly by @leiwen83 in #4551
- [Core][Distributed] support both cpu and device tensor in broadcast tensor dict by @youkaichao in #4660
- [Core] Optimize sampler get_logprobs by @rkooo567 in #4594
- [CI] Make mistral tests pass by @rkooo567 in #4596
- [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi by @DefTruth in #4573
- [Misc] Add
get_name
method to attention backends by @WoosukKwon in #4685 - [Core] Faster startup for LoRA enabled models by @Yard1 in #4634
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap by @youkaichao in #4659
- [CI/Test] fix swap test for multi gpu by @youkaichao in #4689
- [Misc] Use vllm-flash-attn instead of flash-attn by @WoosukKwon in #4686
- [Dynamic Spec Decoding] Auto-disable by the running queue size by @comaniac in #4592
- [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs by @cadedaniel in #4672
- [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin by @alexm-neuralmagic in #4626
- [Frontend] add tok/s speed metric to llm class when using tqdm by @MahmoudAshraf97 in #4400
- [Frontend] Move async logic outside of constructor by @DarkLight1337 in #4674
- [Misc] Remove unnecessary ModelRunner imports by @WoosukKwon in #4703
- [Misc] Set block size at initialization & Fix test_model_runner by @WoosukKwon in #4705
- [ROCm] Add support for Punica kernels on AMD GPUs by @kliuae in #3140
- [Bugfix] Fix CLI arguments in OpenAI server docs by @DarkLight1337 in #4709
- [Bugfix] Update grafana.json by @robertgshaw2-neuralmagic in #4711
- [Bugfix] Add logs for all model dtype casting by @mgoin in #4717
- [Model] Snowflake arctic model implementation by @sfc-gh-hazhang in #4652
- [Kernel] [FP8] Improve FP8 linear layer performance by @pcmoritz in #4691
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support by @comaniac in #4535
- [Core][Distributed] refactor pynccl to hold multiple communicators by @youkaichao in #4591
- [Misc] Keep only one implementation of the create_dummy_prompt function. by @AllenDou in #4716
- chunked-prefill-doc-syntax by @simon-mo in #4603
- [Core]fix type annotation for
swap_blocks
by @jikunshang in #4726 - [Misc] Apply a couple g++ cleanups by @stevegrubb in #4719
- [Core] Fix circular reference which leaked llm instance in local dev env by @rkooo567 in #4737
- [Bugfix] Fix CLI arguments in OpenAI server docs by @AllenDou in #4729
- [Speculative decoding] CUDA graph support by @heeju-kim2 in #4295
- [CI] Nits for bad initialization of SeqGroup in testing by @robertgshaw2-neuralmagic in #4748
- [Core][Test] fix function name typo in custom allreduce by @youkaichao in #4750
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API by @CatherineSue in #3734
- [Model] Add support for IBM Granite Code models by @yikangshen in #4636
- [CI/Build] Tweak Marlin Nondeterminism Issues In CI by @robertgshaw2-neuralmagic in #4713
- [CORE] Improvement in ranks code by @SwapnilDreams100 in #4718
- [Core][Distributed] refactor custom allreduce to support multiple tp groups by @youkaichao in #4754
- [CI/Build] Move
test_utils.py
totests/utils.py
by @DarkLight1337 in #4425 - [Scheduler] Warning upon preemption and Swapping by @rkooo567 in #4647
- [Misc] Enhance attention selector by @WoosukKwon in #4751
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update
tensorizer
to version 2.9.0 by @sangstar in #4208 - [Speculative decoding] Improve n-gram efficiency by @comaniac in #4724
- [Kernel] Use flash-attn for decoding by @skrider in #3648
- [Bugfix] Fix dynamic FP8 quantization for Mixtral by @pcmoritz in #4793
- [Doc] Shorten README by removing supported model list by @zhuohan123 in #4796
- [Doc] Add API reference for offline inference by @DarkLight1337 in #4710
- [Doc] Add meetups to the doc by @zhuohan123 in #4798
- [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies by @KuntaiDu in #4696
- [Bugfix][Doc] Fix CI failure in...
v0.4.2
Highlights
Features
- Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
- Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
- Support FlashInfer as attention backend (#4353)
Models and Enhancements
- Add support for Phi-3-mini (#4298, #4372, #4380)
- Add more histogram metrics (#2764, #4523)
- Full tensor parallelism for LoRA layers (#3524)
- Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)
Dependency Upgrade
- Upgrade to
torch==2.3.0
(#4454) - Upgrade to
tensorizer==2.9.0
(#4467) - Expansion of AMD test suite (#4267)
Progress and Dev Experience
- Centralize and document all environment variables (#4548, #4574)
- Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
- Progress towards pipeline parallelism (#4512, #4444, #4566)
- Progress towards multiprocessing based executors (#4348, #4402, #4419)
- Progress towards FP8 support (#4343, #4332, 4527)
What's Changed
- [Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in #4318
- [Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in #4279
- [Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in #4218
- [Doc] Add note for docker user by @youkaichao in #4340
- [Misc] Use public API in benchmark_throughput by @zifeitong in #4300
- [Model] Adds Phi-3 support by @caiom in #4298
- [Core] Move ray_utils.py from
engine
toexecutor
package by @njhill in #4347 - [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in #4324
- [CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in #4213
- [Doc] README Phi-3 name fix. by @caiom in #4372
- [Core]refactor aqlm quant ops by @jikunshang in #4351
- [Mypy] Typing lora folder by @rkooo567 in #4337
- [Misc] Optimize flash attention backend log by @esmeetu in #4368
- [Core] Add
shutdown()
method toExecutorBase
by @njhill in #4349 - [Core] Move function tracing setup to util function by @njhill in #4352
- [ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in #4376
- [Bugfix] Fix parameter name in
get_tokenizer
by @DarkLight1337 in #4107 - [Frontend] Add --log-level option to api server by @normster in #4377
- [CI] Disable non-lazy string operation on logging by @rkooo567 in #4326
- [Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in #4309
- [Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in #4373
- [Misc] add RFC issue template by @youkaichao in #4401
- [Core] Introduce
DistributedGPUExecutor
abstract class by @njhill in #4348 - [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in #4343
- [Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in #4355
- [Misc] Fix logger format typo by @esmeetu in #4396
- [ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in #4406
- [Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in #3524
- [Model] Phi-3 4k sliding window temp. fix by @caiom in #4380
- [Bugfix][Core] Fix get decoding config from ray by @esmeetu in #4335
- [Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in #4363
- [BugFix] Fix
min_tokens
wheneos_token_id
is None by @njhill in #4389 - ✨ support local cache for models by @prashantgupta24 in #4374
- [BugFix] Fix return type of executor execute_model methods by @njhill in #4402
- [BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in #4418
- [Misc] fix typo in llm_engine init logging by @DefTruth in #4428
- Add more Prometheus metrics by @ronensc in #2764
- [CI] clean docker cache for neuron by @simon-mo in #4441
- [mypy][5/N] Support all typing on model executor by @rkooo567 in #4427
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in #3922
- [CI] hotfix: soft fail neuron test by @simon-mo in #4458
- [Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in #4444
- [Misc] Upgrade to
torch==2.3.0
by @mgoin in #4454 - [Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in #4463
- [Core]Refactor gptq_marlin ops by @jikunshang in #4466
- [BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in #4165
- [Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in #4456
- [Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in #4332
- [Frontend] Support complex message content for chat completions endpoint by @fgreinacher in #3467
- [Frontend] [Core] Tensorizer: support dynamic
num_readers
, update version by @alpayariyak in #4467 - [Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in #4468
- fix_tokenizer_snapshot_download_bug by @kingljl in #4493
- Unable to find Punica extension issue during source code installation by @kingljl in #4494
- [Core] Centralize GPU Worker construction by @njhill in #4419
- [Misc][Typo] type annotation fix by @HarryWu99 in #4495
- [Misc] fix typo in block manager by @Juelianqvq in #4453
- Allow user to define whitespace pattern for outlines by @robcaulk in #4305
- [Misc]Add customized information for models by @jeejeelee in #4132
- [Test] Add ignore_eos test by @rkooo567 in #4519
- [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in #4173
- [Bugfix] Fix 307 Redirect for
/metrics
by @robertgshaw2-neuralmagic in #4523 - [Doc] update(example model): for OpenAI compatible serving by @fpaupier in #4503
- [Bugfix] Use random seed if seed is -1 by @sasha0552 in #4531
- [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in #4534
- [Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in #4237
- [Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in #4142
- [Core] Add
multiproc_worker_utils
for multiprocessing-based workers by @njhill in #4357 - [Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in #4457
- [Bugfix] Add validation for seed by @sasha0552 in #4529
- [Bugfix][Core] Fix and refactor logging stats by @esmeetu in #4336
- [Core][Distributed] fix pynccl del error by @youkaichao in #4508
- [Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in #4543
- [Misc] Fix expert_ids shape in MoE by @WoosukKwon in #4517
- [MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in #4273
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens i...
v0.4.1
Highlights
Features
- Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
- Support private model registration, and updating our support policy (#3871, 3948)
- Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
- Add option for using LM Format Enforcer for guided decoding (#3868)
- Add option for optionally initialize tokenizer and detokenizer (#3748)
- Add option for load model using
tensorizer
(#3476)
Enhancements
- vLLM is now mostly type checked by
mypy
(#3816, #4006, #4161, #4043) - Progress towards chunked prefill scheduler (#3550, #3853, #4280, #3884)
- Progress towards speculative decoding (#3250, #3706, #3894)
- Initial support with dynamic per-tensor scaling via FP8 (#4118)
Hardwares
- Intel CPU inference backend is added (#3993, #3634)
- AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)
What's Changed
- [Kernel] Layernorm performance optimization by @mawong-amd in #3662
- [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
- [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
- [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
- [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
- [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
- [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
- [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
- [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
- [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
- [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
- [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
- [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
- [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
- Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
- [Bugfix] Add
__init__.py
files forvllm/core/block/
andvllm/spec_decode/
by @mgoin in #3798 - [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
- [Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
- [BugFix] Use different mechanism to get vllm version in
is_cpu()
by @njhill in #3804 - [Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
- [Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
- [3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
- Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
- [Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
- Fixes the argument for local_tokenizer_group by @sighingnow in #3754
- [Core] Enable hf_transfer by default if available by @michaelfeil in #3817
- [Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
- [Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
- [Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
- [Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
- [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
- [Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
- [Model] Cohere CommandR+ by @saurabhdash2512 in #3829
- [Core] improve robustness of pynccl by @youkaichao in #3860
- [Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
- [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
- [Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
- [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
- [Bugfix] Fixing requirements.txt by @noamgat in #3865
- [Misc] Define common requirements by @WoosukKwon in #3841
- Add option to completion API to truncate prompt tokens by @tdoublep in #3144
- [Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
- [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
- [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
- [Core] enable out-of-tree model register by @youkaichao in #3871
- [WIP][Core] latency optimization by @youkaichao in #3890
- [Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
- [Model] add minicpm by @SUDA-HLT-ywfang in #3893
- [Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
- [Bugfix] Enable Proper
attention_bias
Usage in Llama Model Configuration by @Ki6an in #3767 - [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
- [BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
- [Core] separate distributed_init from worker by @youkaichao in #3904
- [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
- [Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
- [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
- [Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
- [Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
- [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
- [Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
- [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
- [Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
- [Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
- [Doc] Add doc to state our model support policy by @youkaichao in #3948
- [Bugfix] Remove key sorting for
guided_json
parameter in OpenAi compatible Server by @dmarasco in #3945 - [Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
- [Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
- [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
- [Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
- [Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
- [Test] Add xformer and flash attn tests by @rkooo567 in #3961
- [Misc] refactor ops and cache_ops layer by @jikunshang in #3913
- [Doc][Installation] delete python setup.py develop by @youkaichao in #3989
- [Ke...
v0.4.0.post1, restore sm70/75 support
Highlight
v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.
What's Changed
- [Kernel] Layernorm performance optimization by @mawong-amd in #3662
- [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
- [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
- [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
- [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
- [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
- [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
- [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
- [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
- [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
- [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
- [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
- [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
- [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
- Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
- [Bugfix] Add
__init__.py
files forvllm/core/block/
andvllm/spec_decode/
by @mgoin in #3798 - [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
New Contributors
- @mawong-amd made their first contribution in #3662
- @Qubitium made their first contribution in #3689
- @bigPYJ1151 made their first contribution in #3634
- @A-Mahla made their first contribution in #3788
Full Changelog: v0.4.0...v0.4.0.post1
v0.4.0
Major changes
Models
- New models: Command+R(#3433), Qwen2 MoE(#3346), DBRX(#3660), XVerse (#3610), Jais (#3183).
- New vision language model: LLaVA (#3042)
Production features
- Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag
--enable-prefix-caching
to turn it on. - Support
json_object
in OpenAI server for arbitrary JSON,--use-delay
flag to improve time to first token across many requests, andmin_tokens
to EOS suppression. - Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
- Custom all reduce kernel has been re-enabled after more robustness fixes.
- Replaced cupy dependency due to its bugs.
Hardware
- Improved Neuron support for AWS Inferentia.
- CMake based build system for extensibility.
Ecosystem
What's Changed
- allow user chose log level by --log-level instead of fixed 'info'. by @AllenDou in #3109
- Reorder kv dtype check to avoid nvcc not found error on AMD platform by @cloudhan in #3104
- Add Automatic Prefix Caching by @SageMoore in #2762
- Add vLLM version info to logs and openai API server by @jasonacox in #3161
- [FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark by @zhuohan123 in #3158
- Make it easy to profile workers with nsight by @pcmoritz in #3162
- [DOC] add setup document to support neuron backend by @liangfu in #2777
- [Minor Fix] Remove unused code in benchmark_prefix_caching.py by @gty111 in #3171
- Add document for vllm paged attention kernel. by @pian13131 in #2978
- enable --gpu-memory-utilization in benchmark_throughput.py by @AllenDou in #3175
- [Minor fix] The domain dns.google may cause a socket.gaierror exception by @ttbachyinsda in #3176
- Push logprob generation to LLMEngine by @Yard1 in #3065
- Add health check, make async Engine more robust by @Yard1 in #3015
- Fix the openai benchmarking requests to work with latest OpenAI apis by @wangchen615 in #2992
- [ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs by @hongxiayang in #3123
- Store
eos_token_id
inSequence
for easy access by @njhill in #3166 - [Fix] Avoid pickling entire LLMEngine for Ray workers by @njhill in #3207
- [Tests] Add block manager and scheduler tests by @rkooo567 in #3108
- [Testing] Fix core tests by @cadedaniel in #3224
- A simple addition of
dynamic_ncols=True
by @chujiezheng in #3242 - Add GPTQ support for Gemma by @TechxGenus in #3200
- Update requirements-dev.txt to include package for benchmarking scripts. by @wangchen615 in #3181
- Separate attention backends by @WoosukKwon in #3005
- Measure model memory usage by @mgoin in #3120
- Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) by @jacobthebanana in #3263
- Fix auto prefix bug by @ElizaWszola in #3239
- Connect engine healthcheck to openai server by @njhill in #3260
- Feature add lora support for Qwen2 by @whyiug in #3177
- [Minor Fix] Fix comments in benchmark_serving by @gty111 in #3252
- [Docs] Fix Unmocked Imports by @ywang96 in #3275
- [FIX] Make
flash_attn
optional by @WoosukKwon in #3269 - Move model filelocks from
/tmp/
to~/.cache/vllm/locks/
dir by @mgoin in #3241 - [FIX] Fix prefix test error on main by @zhuohan123 in #3286
- [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling by @cadedaniel in #3103
- Enhance lora tests with more layer and rank variations by @tterrysun in #3243
- [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA by @dllehr-amd in #3262
- [BugFix] Fix get tokenizer when using ray by @esmeetu in #3301
- [Fix] Fix best_of behavior when n=1 by @njhill in #3298
- Re-enable the 80 char line width limit by @zhuohan123 in #3305
- [docs] Add LoRA support information for models by @pcmoritz in #3299
- Add distributed model executor abstraction by @zhuohan123 in #3191
- [ROCm] Fix warp and lane calculation in blockReduceSum by @kliuae in #3321
- Support Mistral Model Inference with transformers-neuronx by @DAIZHENWEI in #3153
- docs: Add BentoML deployment doc by @Sherlock113 in #3336
- Fixes #1556 double free by @br3no in #3347
- Add kernel for GeGLU with approximate GELU by @WoosukKwon in #3337
- [Fix] fix quantization arg when using marlin by @DreamTeamWangbowen in #3319
- add hf_transfer to requirements.txt by @RonanKMcGovern in #3031
- fix bias in if, ambiguous by @hliuca in #3259
- [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build by @chenxu2048 in #3256
- Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. by @orsharir in #3350
- Add batched RoPE kernel by @tterrysun in #3095
- Fix lint by @Yard1 in #3388
- [FIX] Simpler fix for async engine running on ray by @zhuohan123 in #3371
- [Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion by @simon-mo in #3383
- allow user to chose which vllm's merics to display in grafana by @AllenDou in #3393
- [Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 by @youkaichao in #3389
- Install
flash_attn
in Docker image by @tdoublep in #3396 - Add args for mTLS support by @declark1 in #3410
- [issue templates] add some issue templates by @youkaichao in #3412
- Fix assertion failure in Qwen 1.5 with prefix caching enabled by @chenxu2048 in #3373
- fix marlin config repr by @qeternity in #3414
- Feature: dynamic shared mem moe_align_block_size_kernel by @akhoroshev in #3376
- [Misc] add HOST_IP env var by @youkaichao in #3419
- Add chat templates for Falcon by @Dinghow in #3420
- Add chat templates for ChatGLM by @Dinghow in #3418
- Fix
dist.broadcast
stall without group argument by @GindaChen in #3408 - Fix tie_word_embeddings for Qwen2. by @fyabc in #3344
- [Fix] Add args for mTLS support by @declark1 in #3430
- Fixes the misuse/mixuse of time.time()/time.monotonic() by @sighingnow in #3220
- [Misc] add error message in non linux platform by @youkaichao in #3438
- Fix issue templates by @hmellor in #3436
- fix document error for value and v_vec illustration by @laneeeee in #3421
- Asynchronous tokenization by @Yard1 in #2879
- Removed Extraneous Print Message From OAI Server by @robertgshaw2-neuralmagic in #3440
- [Misc] PR templates by @youkaichao in #3413
- Fixes the incorrect argument in the prefix-prefill test cases by @sighingnow in #3246
- Replace
lstrip()
withremoveprefix()
to fix Ruff linter warning by @ronensc in #2958 - Fix Baichuan chat template by @Dinghow in #3340
- ...