Releases · vllm-project/vllm

20 Feb 17:08

github-actions

v0.7.3

ed6e907

v0.7.3 Latest

Latest

Highlights

🎉 253 commits from 93 contributors, including 29 new contributors!

Deepseek enhancements:
- Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755)
- AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199)
- Using FlashAttention3 for MLA (#12807)
- Align the expert selection code path with official implementation (#13474)
- Optimize moe_align_block_size for deepseek_v3 (#12850)
V1 Engine:
- LoRA Support (#10957, #12883)
- Logprobs and prompt logprobs support (#9880), min_p sampling support (#13191), logit_bias in v1 Sampler (#13079)
- Use msgpack for core request serialization (#12918)
- Pipeline parallelism support (#12996, #13353, #13472, #13417, #13315)
- Metrics enhancements: GPU prefix cache hit rate % gauge (#12592), iteration_tokens_total histogram (#13288), several request timing histograms (#12644)
- Initial speculative decoding support with ngrams (#12193, #13365)

Model Support

Enhancement to Qwen2.5-VL: BNB support (#12944), LoRA (#13261), Optimizations (#13155)
Support Unsloth Dynamic 4bit BnB quantization (#12974)
IBM/NASA Prithvi Geospatial model (#12830)
Support Mamba2 (Codestral Mamba) (#9292), Bamba Model (#10909)
Ultravox Model: Support v0.5 Release (#12912)
transformers backend
- Enable quantization support for transformers backend (#12960)
- Set torch_dtype in TransformersModel (#13088)
VLM:
- Implement merged multimodal processor for Mllama (#11427), GLM4V (#12449), Molmo (#12966)
- Separate text-only and vision variants of the same model architecture (#13157)

Hardware Support

Pluggable platform-specific scheduler (#13161)
NVIDIA: Support nvfp4 quantization (#12784)
AMD:
- Per-Token-Activation Per-Channel-Weight FP8 (#12501)
- Tuning for Mixtral on MI325 and Qwen MoE on MI300 (#13503), Mixtral8x7B on MI300 (#13577)
- Add intial ROCm support to V1 (#12790)
TPU: V1 Support (#13049)
Neuron: Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
Gaudi:
- Support Contiguous Cache Fetch (#12139)
- Enable long-contexts + LoRA support (#12812)

Engine Feature

Add sleep and wake up endpoint and v1 support (#12987)
Add /v1/audio/transcriptions OpenAI API endpoint (#12909)

Performance

Reduce TTFT with concurrent partial prefills (#10235)
LoRA - Refactor sgmv kernels (#13110)

Others

Make vLLM compatible with veRL (#12824)
Fixes for cases of FA2 illegal memory access error (#12848)
choice-based structured output with xgrammar (#12632)
Run v1 benchmark and integrate with PyTorch OSS benchmark database (#13068)

What's Changed

[Misc] Update w2 scale loading for GPTQMarlinMoE by @dsikka in #12757
[Docs] Add Google Cloud Slides by @simon-mo in #12814
[Attention] Use FA3 for MLA on Hopper by @LucasWilkinson in #12807
[misc] Reduce number of config file requests to HuggingFace by @khluu in #12797
[Misc] Remove unnecessary decode call by @DarkLight1337 in #12833
[Kernel] Make rotary_embedding ops more flexible with input shape by @Isotr0py in #12777
[torch.compile] PyTorch 2.6 and nightly compatibility by @youkaichao in #12393
[Doc] double quote cmake package in build.inc.md by @jitseklomp in #12840
[Bugfix] Fix unsupported FA version check for Turing GPU by @Isotr0py in #12828
[V1] LoRA Support by @varun-sundar-rabindranath in #10957
Add Bamba Model by @fabianlim in #10909
[MISC] Check space in the file names in the pre commit checks by @houseroad in #12804
[misc] Revert # 12833 by @khluu in #12857
[Bugfix] FA2 illegal memory access by @LucasWilkinson in #12848
Make vllm compatible with verl by @ZSL98 in #12824
[Bugfix] Missing quant_config in deepseek embedding layer by @SzymonOzog in #12836
Prevent unecessary requests to huggingface hub by @maxdebayser in #12837
[MISC][EASY] Break check file names into entry and args in the pre-commit hooks by @houseroad in #12880
[Misc] Remove unnecessary detokenization in multimodal processing by @DarkLight1337 in #12868
[Model] Add support for partial rotary embeddings in Phi3 model by @garg-amit in #12718
[V1] Logprobs and prompt logprobs support by @afeldman-nm in #9880
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing by @tjtanaa in #12501
[V1] LM Eval With Streaming Integration Tests by @robertgshaw2-redhat in #11590
[Bugfix] Fix disagg hang caused by the prefill and decode communication issues by @houseroad in #12723
[V1][Minor] Remove outdated comment by @WoosukKwon in #12928
[V1] Move KV block hashes from Request to KVCacheManager by @WoosukKwon in #12922
[Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping by @jeejeelee in #12905
[Misc] Fix typo in the example file by @DK-DARKmatter in #12896
[Bugfix] Fix multi-round chat error when mistral tokenizer is used by @zifeitong in #12859
[bugfix] respect distributed_executor_backend in world_size=1 by @youkaichao in #12934
[Misc] Add offline test for disaggregated prefill by @Shaoting-Feng in #12418
[V1][Minor] Move cascade attn logic outside _prepare_inputs by @WoosukKwon in #12943
[Build] Make pypi install work on CPU platform by @wangxiyuan in #12874
[Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi by @SanjuCSudhakaran in #12812
[misc] Add LoRA to benchmark_serving by @varun-sundar-rabindranath in #12898
[Misc] Log time consumption on weight downloading by @waltforme in #12926
[CI] Resolve transformers-neuronx version conflict by @liangfu in #12925
[Doc] Correct HF repository for TeleChat2 models by @waltforme in #12949
[Misc] Add qwen2.5-vl BNB support by @Isotr0py in #12944
[CI/Build] Auto-fix Markdown files by @DarkLight1337 in #12941
[Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU by @ShangmingCai in #12935
[bugfix] fix early import of flash attention by @youkaichao in #12959
[VLM] Merged multi-modal processor for GLM4V by @jeejeelee in #12449
[V1][Minor] Remove outdated comment by @WoosukKwon in #12968
[RFC] [Mistral] FP8 format by @patrickvonplaten in #10130
[V1] Cache uses_mrope in GPUModelRunner by @WoosukKwon in #12969
[core] port pynvml into vllm codebase by @youkaichao in #12963
[MISC] Always import version library first in the vllm package by @houseroad in #12979
[core] improve error handling when wake up from sleep mode by @youkaichao in #12981
[core][rlhf] add colocate example for RLHF by @youkaichao in #12984
[V1] Use msgpack for core request serialization by @njhill in #12918
[Bugfix][Platform] Check whether selected backend is None in get_attn_backend_cls() by @terrytangyuan in #12975
[core] fix sleep mode and pytorch checkpoint compatibility by @youkaichao in #13001
[Doc] Add link to tool_choice tracking issue in tool_calling.md by @terrytangyuan in #13003
[misc] Add retries with exponential backoff for HF file existence check by @khluu in #13008
[Bugfix] Clean up and fix multi-modal processors by @DarkLight1337 in #13012
Fix seed parameter behavior in vLLM by @SmartManoj in #13007
[Model] Ultravox Model: Support v0.5 Release by @farzadab in #12912
[misc] Fix setup.py condition to avoid AMD from being mistaken with CPU by @khluu in #13022
[V1][Minor] Move scheduler outputs to a separate file by @WoosukKwon in https://github.com/vllm-project/vllm...

Contributors

markmc, rasmith, and 91 other contributors

Assets 5

06 Feb 07:30

github-actions

v0.7.2

0408efc

v0.7.2

Highlights

Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face transformers library at the moment (#12604)
Add transformers backend support via --model-impl=transformers. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727).
Performance enhancement to DeepSeek models.
- Align KV caches entries to start 256 byte boundaries, yielding 43% throughput enhancement (#12676)
- Apply torch.compile to fused_moe/grouped_topk, yielding 5% throughput enhancement (#12637)
- Enable MLA for DeepSeek VL2 (#12729)
- Enable DeepSeek model on ROCm (#12662)

Core Engine

Use VLLM_LOGITS_PROCESSOR_THREADS to speed up structured decoding in high batch size scenarios (#12368)

Security Update

Improve hash collision avoidance in prefix caching (#12621)
Add SPDX-License-Identifier headers to python source files (#12628)

Other

Enable FusedSDPA support for Intel Gaudi (HPU) (#12359)

What's Changed

Apply torch.compile to fused_moe/grouped_topk by @mgoin in #12637
doc: fixing minor typo in readme.md by @vicenteherrera in #12643
[Bugfix] fix moe_wna16 get_quant_method by @jinzhen-lin in #12648
[Core] Silence unnecessary deprecation warnings by @russellb in #12620
[V1][Minor] Avoid frequently creating ConstantList by @WoosukKwon in #12653
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager by @ShawnD200 in #12608
[Hardware][Intel GPU] add XPU bf16 support by @jikunshang in #12392
[Misc] Add SPDX-License-Identifier headers to python source files by @russellb in #12628
[doc][misc] clarify VLLM_HOST_IP for multi-node inference by @youkaichao in #12667
[Doc] Deprecate Discord by @zhuohan123 in #12668
[Kernel] port sgl moe_align_block_size kernels by @chenyang78 in #12574
make sure mistral_common not imported for non-mistral models by @youkaichao in #12669
Properly check if all fused layers are in the list of targets by @eldarkurtic in #12666
Fix for attention layers to remain unquantized during moe_wn16 quant by @srikanthsrnvs in #12570
[cuda] manually import the correct pynvml module by @youkaichao in #12679
[ci/build] fix gh200 test by @youkaichao in #12681
[Model]: Add transformers backend support by @ArthurZucker in #11330
[Misc] Fix improper placement of SPDX header in scripts by @russellb in #12694
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm by @tlrmchlsmth in #12696
Squelch MLA warning for Compressed-Tensors Models by @kylesayrs in #12704
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 by @kushanam in #12707
[MISC] Remove model input dumping when exception by @comaniac in #12582
[V1] Revert uncache_blocks and support recaching full blocks by @comaniac in #12415
[Core] Improve hash collision avoidance in prefix caching by @russellb in #12621
Support Pixtral-Large HF by using llava multimodal_projector_bias config by @mgoin in #12710
[Doc] Replace ibm-fms with ibm-ai-platform by @tdoublep in #12709
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs by @kylesayrs in #12711
[AMD][ROCm] Enable DeepSeek model on ROCm by @hongxiayang in #12662
[Misc] Add BNB quantization for Whisper by @jeejeelee in #12381
[VLM] Merged multi-modal processor for InternVL-based models by @DarkLight1337 in #12553
[V1] Remove constraints on partial requests by @WoosukKwon in #12674
[VLM] Implement merged multimodal processor and V1 support for idefics3 by @Isotr0py in #12660
[Model] [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small by @mgtk77 in #12689
Avoid unnecessary multi-modal input data copy when len(batch) == 1 by @imkero in #12722
[Build] update requirements of no-device for plugin usage by @sducouedic in #12630
[Bugfix] Fix CI failures for InternVL and Mantis models by @DarkLight1337 in #12728
[V1][Metrics] Add request_success_total counter, labelled with finish reason by @markmc in #12579
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) by @LucasWilkinson in #12676
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS by @akeshet in #12368
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling by @maleksan85 in #12713
Refactor Linear handling in TransformersModel by @hmellor in #12727
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models by @Isotr0py in #12729
[Misc] Bump the compressed-tensors version by @dsikka in #12736
[Model][Quant] Fix GLM, Fix fused module mappings for quantization by @kylesayrs in #12634
[Doc] Update PR Reminder with link to Developer Slack by @mgoin in #12748
[Bugfix] Fix OpenVINO model runner by @hmellor in #12750
[V1][Misc] Shorten FinishReason enum and use constant strings by @njhill in #12760
[Doc] Remove performance warning for auto_awq.md by @mgoin in #12743
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 by @Akashcodes732 in #12546
[core][distributed] exact ray placement control by @youkaichao in #12732
[Kernel] Use self.kv_cache and forward_context.attn_metadata in Attention.forward by @heheda12345 in #12536
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) by @SanjuCSudhakaran in #12359
Add: Support for Sparse24Bitmask Compressed Models by @rahul-tuli in #12097
[VLM] Use shared field to pass token ids to model by @DarkLight1337 in #12767
[Docs] Drop duplicate [source] links by @russellb in #12780
[VLM] Qwen2.5-VL by @ywang96 in #12604
[VLM] Update compatibility with transformers 4.49 by @DarkLight1337 in #12781
Quantization and MoE configs for GH200 machines by @arvindsun in #12717
[ROCm][Kernel] Using the correct warp_size value by @gshtras in #12789
[Bugfix] Better FP8 supported defaults by @LucasWilkinson in #12796
[Misc][Easy] Remove the space from the file name by @houseroad in #12799
[Model] LoRA Support for Ultravox model by @thedebugger in #11253
[Bugfix] Fix the test_ultravox.py's license by @houseroad in #12806
Improve TransformersModel UX by @hmellor in #12785
[Misc] Remove duplicated DeepSeek V2/V3 model definition by @mgoin in #12793
[Misc] Improve error message for incorrect pynvml by @youkaichao in #12809

New Contributors

@vicenteherrera made their first contribution in #12643
@chenyang78 made their first contribution in #12574
@srikanthsrnvs made their first contribution in #12570
@ArthurZucker made their first contribution in #11330
@mgtk77 made their first contribution in #12689
@sducouedic made their first contribution in #12630
@akeshet made their first contribution in #12368
@arvindsun made their first contribution in #12717
@thedebugger made their first contribution in ht...

Contributors

markmc, russellb, and 39 other contributors

Assets 5

01 Feb 18:02

github-actions

v0.7.1

4f4d427

v0.7.1

Highlights

This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism

MLA Kernel (#12601, #12642,#12528).
FP8 Kernels (#11589, #11868, #12587)

V1

For the V1 architecture, we

Added a new design document for zero overhead prefix caching here (#12598)
Add metrics and enhance logging for V1 engine (#12569, #12561, #12416, #12516, #12530, #12478)

Models

New Model: MiniCPM-o (text outputs only) (#12069)

Hardwares

Neuron: NKI-based flash-attention kernel with paged KV cache (#11277)
AMD: llama 3.2 support upstreaming (#12421)

Others

Support override generation config in engine arguments (#12409)
Support reasoning content in API for deepseek R1 (#12473)

What's Changed

[Bugfix] Fix missing seq_start_loc in xformers prefill metadata by @Isotr0py in #12464
[V1][Minor] Minor optimizations for update_from_output by @WoosukKwon in #12454
[Bugfix] Fix gpt2 GGUF inference by @Isotr0py in #12467
[Build] Only build 9.0a for scaled_mm and sparse kernels by @LucasWilkinson in #12339
[V1][Metrics] Add initial Prometheus logger by @markmc in #12416
[V1][CI/Test] Do basic test for top-p & top-k sampling by @WoosukKwon in #12469
[FlashInfer] Upgrade to 0.2.0 by @abmfy in #11194
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill by @NickLucche in #10132
Update pre-commit hooks by @hmellor in #12475
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache by @liangfu in #11277
Fix bad path in prometheus example by @mgoin in #12481
[CI/Build] Fixed the xla nightly issue report in #12451 by @hosseinsarshar in #12453
[FEATURE] Enables offline /score for embedding models by @gmarinho2 in #12021
[CI] fix pre-commit error by @MengqingCao in #12494
Update README.md with V1 alpha release by @ywang96 in #12495
[V1] Include Engine Version in Logs by @robertgshaw2-redhat in #12496
[Core] Make raw_request optional in ServingCompletion by @schoennenbeck in #12503
[VLM] Merged multi-modal processor and V1 support for Qwen-VL by @DarkLight1337 in #12504
[Doc] Fix typo for x86 CPU installation by @waltforme in #12514
[V1][Metrics] Hook up IterationStats for Prometheus metrics by @markmc in #12478
Replace missed warning_once for rerank API by @mgoin in #12472
Do not run suggestion pre-commit hook multiple times by @hmellor in #12521
[V1][Metrics] Add per-request prompt/generation_tokens histograms by @markmc in #12516
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels by @fenghuizhang in #12482
[TPU] Add example for profiling TPU inference by @mgoin in #12531
[Frontend] Support reasoning content for deepseek r1 by @gaocegege in #12473
[Doc] Convert docs to use colon fences by @hmellor in #12471
[V1][Metrics] Add TTFT and TPOT histograms by @markmc in #12530
Bugfix for whisper quantization due to fake k_proj bias by @mgoin in #12524
[V1] Improve Error Message for Unsupported Config by @robertgshaw2-redhat in #12535
Fix the pydantic logging validator by @maxdebayser in #12420
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense by @tjohnson31415 in #12347
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM by @HwwwwwwwH in #12069
[Frontend] Support override generation config in args by @liuyanyi in #12409
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. by @pavanimajety in #11787
[Kernel] add triton fused moe kernel for gptq/awq by @jinzhen-lin in #12185
Revert "[Build/CI] Fix libcuda.so linkage" by @tlrmchlsmth in #12552
[V1][BugFix] Free encoder cache for aborted requests by @WoosukKwon in #12545
[Misc][MoE] add Deepseek-V3 moe tuning support by @divakar-amd in #12558
[V1][Metrics] Add GPU cache usage % gauge by @markmc in #12561
Set ?device={device} when changing tab in installation guides by @hmellor in #12560
[Misc] fix typo: add missing space in lora adapter error message by @Beim in #12564
[Kernel] Triton Configs for Fp8 Block Quantization by @robertgshaw2-redhat in #11589
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies by @npanpaliya in #12555
[V1][Log] Add max request concurrency log to V1 by @mgoin in #12569
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling by @LucasWilkinson in #11868
[ROCm][AMD][Model] llama 3.2 support upstreaming by @maleksan85 in #12421
[Attention] MLA decode optimizations by @LucasWilkinson in #12528
[Bugfix] Gracefully handle huggingface hub http error by @ywang96 in #12571
Add favicon to docs by @hmellor in #12611
[BugFix] Fix Torch.Compile For DeepSeek by @robertgshaw2-redhat in #12594
[Git] Automatically sign-off commits by @comaniac in #12595
[Docs][V1] Prefix caching design by @comaniac in #12598
[v1][Bugfix] Add extra_keys to block_hash for prefix caching by @heheda12345 in #12603
[release] Add input step to ask for Release version by @khluu in #12631
[Bugfix] Revert MoE Triton Config Default by @robertgshaw2-redhat in #12629
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 by @tlrmchlsmth in #12587
[Feature] Fix guided decoding blocking bitmask memcpy by @xpbowler in #12563
[Doc] Improve installation signposting by @hmellor in #12575
[Doc] int4 w4a16 example by @brian-dellabetta in #12585
[V1] Bugfix: Validate Model Input Length by @robertgshaw2-redhat in #12600
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 by @sleepwalker2017 in #11161
Fix target matching for fused layers with compressed-tensors by @eldarkurtic in #12617
[ci] Upgrade transformers to 4.48.2 in CI dependencies by @khluu in #12599
[Bugfix/CI] Fixup benchmark_moe.py by @tlrmchlsmth in #12562
Fix: Respect sparsity_config.ignore in Cutlass Integration by @rahul-tuli in #12517
[Attention] Deepseek v3 MLA support with FP8 compute by @LucasWilkinson in #12601
[CI/Build] Add label automation for structured-output, speculative-decoding, v1 by @russellb in #12280
Disable chunked prefill and/or prefix caching when MLA is enabled by @simon-mo in #12642

New Contributors

@abmfy made their first contribution in #11194
@hosseinsarshar made their first contribution in #12453
@gmarinho2 made their first contribution in #12021
@waltforme made their first contribution in #12514
@fenghuizhang made their first contribution in #12482
@gaocegege made their first contribution in #12473
@Beim made their first contribution in https://github.com/vllm-pro...

Contributors

markmc, russellb, and 38 other contributors

Assets 5

27 Jan 05:50

github-actions

v0.7.0

5204ff5

v0.7.0

Highlights

vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable VLLM_USE_V1=1. See our blog for more details. (44 commits).
New methods (LLM.sleep, LLM.wake_up, LLM.collective_rpc, LLM.reset_prefix_cache) in vLLM for the post training frameworks! (#12361, #12084, #12284).
torch.compile is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via -O3 engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).

This release features

400 commits from 132 contributors, including 57 new contributors.
- 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
- 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
- more than 161 bug fixes and miscellaneous enhancements

Features

Models

New generative models: CogAgent (#11742), Deepseek-VL2 (#11578, #12068, #12169), fairseq2 Llama (#11442), InternLM3 (#12037), Whisper (#11280)
New pooling models: Qwen2 PRM (#12202), InternLM2 reward models (#11571)
VLM: Merged multi-modal processor is now ready for model developers! (#11620, #11900, #11682, #11717, #11669, #11396)
- Any model that implements merged multi-modal processor and the get_*_embeddings methods according to this guide is automatically supported by V1 engine.

Hardwares

Apple: Native support for macOS Apple Silicon (#11696)
AMD: MI300 FP8 format for block_quant (#12134), Tuned MoE configurations for multiple models (#12408, #12049), block size heuristic for avg 2.8x speedup for int8 models (#11698)
TPU: support for W8A8 (#11785)
x86: Multi-LoRA (#11100) and MoE Support (#11831)
Progress in out-of-tree hardware support (#12009, #11981, #11948, #11609, #12264, #11516, #11503, #11369, #11602)

Features

Distributed:
- Support torchrun and SPMD-style offline inference (#12071)
- New collective_rpc abstraction (#12151, #11256)
API Server: Jina- and Cohere-compatible Rerank API (#12376)
Kernels:
- Flash Attention 3 Support (#12093)
- Punica prefill kernels fusion (#11234)
- For Deepseek V3: optimize moe_align_block_size for cuda graph and large num_experts (#12222)

Others

Benchmark: new script for CPU offloading (#11533)
Security: Set weights_only=True when using torch.load() (#12366)

What's Changed

[Docs] Document Deepseek V3 support by @simon-mo in #11535
Update openai_compatible_server.md by @robertgshaw2-redhat in #11536
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
[V1] Fix yapf by @WoosukKwon in #11538
[CI] Fix broken CI by @robertgshaw2-redhat in #11543
[misc] fix typing by @youkaichao in #11540
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-redhat in #11534
[BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-redhat in #11547
[Platform] Move model arch check to platform by @MengqingCao in #11503
Update deploying_with_k8s.md with AMD ROCm GPU example by @AlexHe99 in #11465
[Bugfix] Fix TeleChat2ForCausalLM weights mapper by @jeejeelee in #11546
[Misc] Abstract out the logic for reading and writing media content by @DarkLight1337 in #11527
[Doc] Add xgrammar in doc by @Chen-0210 in #11549
[VLM] Support caching in merged multi-modal processor by @DarkLight1337 in #11396
[MODEL] Update LoRA modules supported by Jamba by @ErezSC42 in #11209
[Misc]Add BNB quantization for MolmoForCausalLM by @jeejeelee in #11551
[Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix by @Isotr0py in #11566
[Bugfix] Fix for ROCM compressed tensor support by @selalipop in #11561
[Doc] Update mllama example based on official doc by @heheda12345 in #11567
[V1] [4/N] API Server: ZMQ/MP Utilities by @robertgshaw2-redhat in #11541
[Bugfix] Last token measurement fix by @rajveerb in #11376
[Model] Support InternLM2 Reward models by @Isotr0py in #11571
[Model] Remove hardcoded image tokens ids from Pixtral by @ywang96 in #11582
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version by @hj-wei in #11515
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor by @WoosukKwon in #11581
[Doc] Minor documentation fixes by @DarkLight1337 in #11580
[bugfix] interleaving sliding window for cohere2 model by @youkaichao in #11583
[V1] [5/N] API Server: unify Detokenizer and EngineCore input by @robertgshaw2-redhat in #11545
[Doc] Convert list tables to MyST by @DarkLight1337 in #11594
[v1][bugfix] fix cudagraph with inplace buffer assignment by @youkaichao in #11596
[Misc] Use registry-based initialization for KV cache transfer connector. by @KuntaiDu in #11481
Remove print statement in DeepseekScalingRotaryEmbedding by @mgoin in #11604
[v1] fix compilation cache by @youkaichao in #11598
[Docker] bump up neuron sdk v2.21 by @liangfu in #11593
[Build][Kernel] Update CUTLASS to v3.6.0 by @tlrmchlsmth in #11607
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels by @bigPYJ1151 in #11618
[platforms] enable platform plugins by @youkaichao in #11602
[VLM] Abstract out multi-modal data parsing in merged processor by @DarkLight1337 in #11620
[V1] [6/N] API Server: Better Shutdown by @robertgshaw2-redhat in #11586
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel by @whyiug in #11631
[benchmark] Remove dependency for H100 benchmark step by @khluu in #11572
[Model][LoRA]LoRA support added for MolmoForCausalLM by @ayylemao in #11439
[Bugfix] Fix OpenAI parallel sampling when using xgrammar by @mgoin in #11637
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) by @JohnGiorgi in #6909
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. by @sakunkun in #11565
[V1] Simpify vision block hash for prefix caching by removing offset from hash by @heheda12345 in #11646
[V1][VLM] V1 support for selected single-image models. by @ywang96 in #11632
[Benchmark] Add benchmark script for CPU offloading by @ApostaC in #11533
[Bugfix][Refactor] Unify model management in frontend by @joerunde in #11660
[VLM] Add max-count checking in data parser for single image models by @DarkLight1337 in #11661
[Misc] Optimize Qwen2-VL LoRA test by @jeejeelee in #11663
[Misc] Replace space with - in the file names by @houseroad in #11667
[Doc] Fix typo by @serihiro in #11666
[V1] Implement Cascade Attention by @WoosukKwon in #11635
[VLM] Move supported limits and max tokens to merged multi-modal processor by @DarkLight1337 in #11669
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input by @DarkLight1337 in #11674
[mypy] Pass type checking in vllm/inputs by @CloseChoice in #11680
[VLM] Merged multi-modal processor for LLaVA-NeXT by @DarkLight1337 in #11682
According to vllm.EngineArgs, the name should be distributed_executor_backend by @chunyang-wen in #11689
[Bugfix] Free cross attention block table for preempted-for-recompute sequence group. by @kathyyu-google in #10013
[V1]...

Contributors

zhouyuan, janimo, and 130 other contributors

Assets 4

27 Dec 06:24

github-actions

v0.6.6.post1

2339d59

v0.6.6.post1

This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 .

What's Changed

[Docs] Document Deepseek V3 support by @simon-mo in #11535
Update openai_compatible_server.md by @robertgshaw2-neuralmagic in #11536
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
[V1] Fix yapf by @WoosukKwon in #11538
[CI] Fix broken CI by @robertgshaw2-neuralmagic in #11543
[misc] fix typing by @youkaichao in #11540
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-neuralmagic in #11534
[BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-neuralmagic in #11547

Full Changelog: v0.6.6...v0.6.6.post1

Contributors

simon-mo, youkaichao, and 2 other contributors

Assets 3

27 Dec 00:12

github-actions

v0.6.6

f49777b

v0.6.6

Highlights

Support Deepseek V3 (#11523, #11502) model.
- On 8xH200s or MI300x: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192. The context length can be increased to about 32K beyond running into memory issue.
- For other devices, follow our distributed inference guide to enable tensor parallel and/or pipeline parallel inference
- We are just getting started for enhancing the support and unlock more performance. See #11539 for planned work.
Last mile stretch for V1 engine refactoring: API Server (#11529, #11530), penalties for sampler (#10681), prefix caching for vision language models (#11187, #11305), TP Ray executor (#11107,#11472)
Breaking change: X-Request-ID echoing is now opt-in instead of on by default for performance reason. Set --enable-request-id-headers to enable it.

Model Support

IBM Granite 3.1 (#11307), JambaForSequenceClassification model (#10860)
Add QVQ and QwQ to the list of supported models (#11509)

Performance

Cutlass 2:4 Sparsity + FP8/INT8 Quant Support (#10995)

Production Engine

Support streaming model from S3 using RunAI Model Streamer as optional loader (#10192)
Online Pooling API (#11457)
Load video from base64 (#11492)

Others

Add pypi index for every commit and nightly build (#11404)

What's Changed

[Bugfix] Set temperature=0.7 in test_guided_choice_chat by @mgoin in #11264
[V1] Prefix caching for vision language models by @comaniac in #11187
[Bugfix] Restore support for larger block sizes by @kzawora-intel in #11259
[Bugfix] Fix guided decoding with tokenizer mode mistral by @wallashss in #11046
[MISC][XPU]update ipex link for CI fix by @yma11 in #11278
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support by @dsikka in #10995
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests by @Isotr0py in #11263
[CI][Misc] Remove Github Action Release Workflow by @simon-mo in #11274
[FIX] update openai version by @jikunshang in #11287
[Bugfix] fix minicpmv test by @joerunde in #11304
[V1] VLM - enable processor cache by default by @alexm-neuralmagic in #11305
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) by @tlrmchlsmth in https://github.com//pull/11311
[Model] IBM Granite 3.1 by @tjohnson31415 in #11307
[CI] Expand test_guided_generate to test all backends by @mgoin in #11313
[V1] Simplify prefix caching logic by removing num_evictable_computed_blocks by @heheda12345 in #11310
[VLM] Merged multimodal processor for Qwen2-Audio by @DarkLight1337 in #11303
[Kernel] Refactor Cutlass c3x by @varun-sundar-rabindranath in #10049
[Misc] Optimize ray worker initialization time by @ruisearch42 in #11275
[misc] benchmark_throughput : Add LoRA by @varun-sundar-rabindranath in #11267
[Feature] Add load generation config from model by @liuyanyi in #11164
[Bugfix] Cleanup Pixtral HF code by @DarkLight1337 in #11333
[Model] Add JambaForSequenceClassification model by @yecohn in #10860
[V1] Fix multimodal profiling for Molmo by @ywang96 in #11325
[Model] Refactor Qwen2-VL to use merged multimodal processor by @Isotr0py in #11258
[Misc] Clean up and consolidate LRUCache by @DarkLight1337 in #11339
[Bugfix] Fix broken CPU compressed-tensors test by @Isotr0py in #11338
[Misc] Remove unused vllm/block.py by @Ghjk94522 in #11336
[CI] Adding CPU docker pipeline by @zhouyuan in #11261
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 by @Akashcodes732 in #11331
[ci][gh200] dockerfile clean up by @youkaichao in #11351
[Misc] Add tqdm progress bar during graph capture by @mgoin in #11349
[Bugfix] Fix spec decoding when seed is none in a batch by @wallashss in #10863
[misc] add early error message for custom ops by @youkaichao in #11355
[doc] backward compatibility for 0.6.4 by @youkaichao in #11359
[V1] Fix profiling for models with merged input processor by @ywang96 in #11370
[CI/Build] fix pre-compiled wheel install for exact tag by @dtrifiro in #11373
[Core] Loading model from S3 using RunAI Model Streamer as optional loader by @omer-dayan in #10192
[Bugfix] Don't log OpenAI field aliases as ignored by @mgoin in #11378
[doc] explain nccl requirements for rlhf by @youkaichao in #11381
Add ray[default] to wget to run distributed inference out of box by @Jeffwan in #11265
[V1][Bugfix] Skip hashing empty or None mm_data by @WoosukKwon in #11386
[Bugfix] update should_ignore_layer by @horheynm in #11354
[V1] Make AsyncLLMEngine v1-v0 opaque by @rickyyx in #11383
[Bugfix] Fix issues for Pixtral-Large-Instruct-2411 by @ywang96 in #11393
[CI] Fix flaky entrypoint tests by @ywang96 in #11403
[cd][release] add pypi index for every commit and nightly build by @youkaichao in #11404
[cd][release] fix race conditions by @youkaichao in #11407
[Bugfix] Fix fully sharded LoRAs with Mixtral by @n1hility in #11390
[CI] Unboock H100 Benchmark by @simon-mo in #11419
[misc][perf] remove old code by @youkaichao in #11425
mypy type checking for vllm/worker by @lucas-tucker in #11418
[Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF by @mgoin in #11389
[Bugfix] torch nightly version in ROCm installation guide by @terrytangyuan in #11423
[Misc] Add assertion and helpful message for marlin24 compressed models by @dsikka in #11388
[Misc] add w8a8 asym models by @dsikka in #11075
[CI] Expand OpenAI test_chat.py guided decoding tests by @mgoin in #11048
[Bugfix] Add kv cache scales to gemma2.py by @mgoin in #11269
[Doc] Fix typo in the help message of '--guided-decoding-backend' by @yansh97 in #11440
[Docs] Convert rST to MyST (Markdown) by @rafvasq in #11145
[V1] TP Ray executor by @ruisearch42 in #11107
[Misc]Suppress irrelevant exception stack trace information when CUDA… by @shiquan1988 in #11438
[Frontend] Online Pooling API by @DarkLight1337 in #11457
[Bugfix] Fix Qwen2-VL LoRA weight loading by @jeejeelee in #11430
[Bugfix][Hardware][CPU] Fix CPU input_positions creation for text-only inputs with mrope by @Isotr0py in #11434
[OpenVINO] Fixed installation conflicts by @ilya-lavrenov in #11458
[attn][tiny fix] fix attn backend in MultiHeadAttention by @MengqingCao in #11463
[Misc] Move weights mapper by @jeejeelee in #11443
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 by @terrytangyuan in #11435
[Model] Automatic conversion of classification and reward models by @DarkLight1337 in #11469
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor by @ruisearch42 in #11472
[Misc] Update disaggregation benchmark scripts and test logs by @Jeffwan in #11456
[Frontend] Enable decord to load video from base64 by @DarkLight1337 in #11492
[Doc] Improve GitHub links by @DarkLight1337 in #11491
[Misc] Move ...

Contributors

zhouyuan, n1hility, and 39 other contributors

Assets 3

17 Dec 23:10

github-actions

v0.6.5

2d1b9ba

v0.6.5

Highlights

Significant progress on the V1 engine refactor and multimodal support: New model executable interfaces for text-only and multimodal models, multiprocessing, improved configuration handling, and profiling enhancements (#10374, #10570, #10699, #11074, #11076, #10382, #10665, #10564, #11125, #11185, #11242).
Major improvements in torch.compile integration: Support for all attention backends, encoder-based models, dynamic FP8 fusion, shape specialization fixes, and performance optimizations (#10558, #10613, #10121, #10383, #10399, #10406, #10437, #10460, #10552, #10622, #10722, #10620, #10906, #11108, #11059, #11005, #10838, #11081, #11110).
Expanded model support, including Aria, Cross Encoders, GLM-4, OLMo November 2024, Telechat2, LoRA improvements and multimodal Granite models (#10514, #10400, #10561, #10503, #10311, #10291, #9057, #10418, #5064).
Use xgrammar as the default guided decoding backend (#10785)
Improved hardware enablement for AMD ROCm, ARM AARCH64, TPU prefix caching, XPU AWQ/GPTQ, and various CPU/Gaudi/HPU/NVIDIA enhancements (#10254, #9228, #10307, #10107, #10667, #10565, #10239, #11016, #9735, #10355, #10700).
Note: Changed default temperature for ChatCompletionRequest from 0.7 to 1.0 to align with OpenAI (#11219)

Model Support

Added Aria (#10514), Cross Encoder (#10400), GLM-4 (#10561), OLMo (#10503), Telechat2 (#10311), Cohere R7B (#11203), GritLM embeddings (#10816)
LoRA support for Internlm2, glm-4v, Pixtral-HF (#5064, #10418, #10795).
Improved quantization (BNB, bitsandbytes) for multiple models (#10795, #10842, #10682, #10549)
Expanded multimodal support (#10291, #11142).

Hardware Support

AMD ROCm GGUF quantization (#10254), ARM AARCH64 enablement (#9228), TPU prefix caching (#10307), XPU AWQ/GPTQ (#10107), CPU/Gaudi/HPU enhancements (#10355, #10667, #10565, #10239, #11016, #9735, #10541, #10394, #10700).

Performance & Scheduling

Prefix-cache aware scheduling (#10128), sliding window support (#10462), disaggregated prefill enhancements (#10502, #10884), evictor optimization (#7209).

Benchmark & Frontend

Benchmark structured outputs and vision datasets (#10804, #10557, #10880, #10547).
Frontend: Automatic chat format detection (#9919), input_audio support (#11027), CLI --version (#10369), extra fields in requests (#10463).

Documentation & Plugins

Architecture overview (#10368), Helm chart (#9199), KubeAI integration (#10837), plugin system docs (#10372), disaggregated prefilling (#11197), structured outputs (#9943), usage section (#10827).

Bugfixes & Misc

Updated defaults for chunked prefill (#10544)
Add GH200 support (#11212, #11244)

What's Changed

Add default value to avoid Falcon crash (#5363) by @wchen61 in #10347
[Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in #10349
[Doc] Remove float32 choice from --lora-dtype by @xyang16 in #10348
[Bugfix] Fix fully sharded LoRA bug by @jeejeelee in #10352
[Misc] Fix some help info of arg_utils to improve readability by @ShangmingCai in #10362
[core][misc] keep compatibility for old-style classes by @youkaichao in #10356
[Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer by @gcalmettes in #10363
[Misc] Bump up test_fused_moe tolerance by @ElizaWszola in #10364
[Misc] bump mistral common version by @simon-mo in #10367
[Docs] Add Nebius as sponsors by @simon-mo in #10371
[Frontend] Add --version flag to CLI by @russellb in #10369
[Doc] Move PR template content to docs by @russellb in #10159
[Docs] Misc updates to TPU installation instructions by @mikegre-google in #10165
[Frontend] Automatic detection of chat content format from AST by @DarkLight1337 in #9919
[doc] add doc for the plugin system by @youkaichao in #10372
[misc][plugin] improve log messages by @youkaichao in #10386
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel by @rasmith in #10385
[Misc] Update benchmark to support image_url file or http by @kakao-steve-ai in #10287
[Misc] Medusa supports custom bias by @skylee-01 in #10361
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled by @imkero in #10388
[V1] Add code owners for V1 by @WoosukKwon in #10397
[2/N][torch.compile] make compilation cfg part of vllm cfg by @youkaichao in #10383
[V1] Refactor model executable interface for all text-only language models by @ywang96 in #10374
[CI/Build] Fix IDC hpu [Device not found] issue by @xuechendi in #10384
[Bugfix][Hardware][CPU] Fix CPU embedding runner with tensor parallel by @Isotr0py in #10394
[platforms] refactor cpu code by @youkaichao in #10402
[Hardware] [HPU]add mark_step for hpu by @jikunshang in #10239
[Bugfix] Fix mrope_position_delta in non-last prefill chunk by @imkero in #10403
[Misc] Enhance offline_inference to support user-configurable paramet… by @wchen61 in #10392
[Misc] Add uninitialized params tracking for AutoWeightsLoader by @Isotr0py in #10327
[Bugfix] Ignore ray reinit error when current platform is ROCm or XPU by @HollowMan6 in #10375
[4/N][torch.compile] clean up set_torch_compile_backend by @youkaichao in #10401
[VLM] Report multi_modal_placeholders in output by @lk-chen in #10407
[Model] Remove redundant softmax when using PoolingType.STEP by @Maybewuss in #10415
[Model][LoRA]LoRA support added for glm-4v by @B-201 in #10418
[Model] Remove transformers attention porting in VITs by @Isotr0py in #10414
[Doc] Update doc for LoRA support in GLM-4V by @B-201 in #10425
[5/N][torch.compile] torch.jit.script --> torch.compile by @youkaichao in #10406
[Doc] Add documentation for Structured Outputs by @ismael-dm in #9943
Fix open_collective value in FUNDING.yml by @andrew in #10426
[Model][Bugfix] Support TP for PixtralHF ViT by @mgoin in #10405
[Hardware][XPU] AWQ/GPTQ support for xpu backend by @yma11 in #10107
[Kernel] Explicitly specify other value in tl.load calls by @angusYuhao in #9014
[Kernel] Initial Machete W4A8 support + Refactors by @LucasWilkinson in #9855
[3/N][torch.compile] consolidate custom op logging by @youkaichao in #10399
[ci][bugfix] fix kernel tests by @youkaichao in #10431
[misc] Allow partial prefix benchmarking & random input generation for prefix benchmarking by @rickyyx in #9929
[ci/build] Have dependabot ignore all patch update by @khluu in #10436
[Bugfix]Fix Phi-3 BNB online quantization by @jeejeelee in #10417
[Platform][Refactor] Extract func get_default_attn_backend to Platform by @MengqingCao in #10358
Add openai.beta.chat.completions.parse example to structured_outputs.rst by @mgoin in #10433
[Bugfix] Guard for negative counter metrics to prevent crash by @tjohnson31415 in #10430
[Misc] Avoid misleading warning messages by @jeejeelee in #10438
[Doc] Add the start of an arch overview page by @russellb in #10368
[misc][plugin] improve plugin loading by @youkaichao in #10443
[CI][CPU] adding numa node number as container name suffix by @zhouyuan in #10441
[BugFix] Fix hermes tool parser output error stream arguments in some cases (#10395) by @xiyuan-lee in #10398
[Pixtral-Large] Pixtral actually has no bias in vision-lang adapter by @patrickvonplaten in #10449
Fix: Build error seen on Power Architecture by @mikejuliet13 in #10421
[Doc] fix link for page that was renamed by @russellb in #10455
[6/N] to...

Contributors

andrew, markmc, and 121 other contributors

Assets 3

15 Nov 17:50

github-actions

v0.6.4.post1

a6221a1

v0.6.4.post1

This patch release covers bug fixes (#10347, #10349, #10348, #10352, #10363), keep compatibility for vLLMConfig usage in out of tree models (#10356)

What's Changed

Add default value to avoid Falcon crash (#5363) by @wchen61 in #10347
[Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in #10349
[Doc] Remove float32 choice from --lora-dtype by @xyang16 in #10348
[Bugfix] Fix fully sharded LoRA bug by @jeejeelee in #10352
[Misc] Fix some help info of arg_utils to improve readability by @ShangmingCai in #10362
[core][misc] keep compatibility for old-style classes by @youkaichao in #10356
[Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer by @gcalmettes in #10363
[Misc] Bump up test_fused_moe tolerance by @ElizaWszola in #10364
[Misc] bump mistral common version by @simon-mo in #10367

New Contributors

@wchen61 made their first contribution in #10347

Full Changelog: v0.6.4...v0.6.4.post1

Contributors

gcalmettes, jeejeelee, and 7 other contributors

Assets 3

15 Nov 07:32

github-actions

v0.6.4

02dbf30

v0.6.4

Highlights

Significant progress in V1 engine core refactor (#9826, #10135, #10288, #10211, #10225, #10228, #10268, #9954, #10272, #9971, #10224, #10166, #9289, #10058, #9888, #9972, #10059, #9945, #9679, #9871, #10227, #10245, #9629, #10097, #10203, #10148). You can checkout more details regarding the design and plan ahead in our recent meetup slides
Signficant progress in torch.compile support. Many models now support torch compile with TorchInductor. You can checkout our meetup slides for more details. (#9775, #9614, #9639, #9641, #9876, #9946, #9589, #9896, #9637, #9300, #9947, #9138, #9715, #9866, #9632, #9858, #9889)

Model Support

New LLMs and VLMs: Idefics3 (#9767), H2OVL-Mississippi (#9747), Qwen2-Audio (#9248), Pixtral models in the HF Transformers format (#9036), FalconMamba (#9325), Florence-2 language backbone (#9555)
New encoder-decoder embedding models: BERT (#9056), RoBERTa & XLM-RoBERTa (#9387)
Expanded task support: Llama embeddings (#9806), Math-Shepherd (Mistral reward modeling) (#9697), Qwen2 classification (#9704), Qwen2 embeddings (#10184), VLM2Vec (Phi-3-Vision embeddings) (#9303), E5-V (LLaVA-NeXT embeddings) (#9576), Qwen2-VL embeddings (#9944)
- Add user-configurable --task parameter for models that support both generation and embedding (#9424)
- Chat-based Embeddings API (#9759)
Tool calling parser for Granite 3.0 (#9027), Jamba (#9154), granite-20b-functioncalling (#8339)
LoRA support for Granite 3.0 MoE (#9673), Idefics3 (#10281), Llama embeddings (#10071), Qwen (#9622), Qwen2-VL (#10022)
BNB quantization support for Idefics3 (#10310), Mllama (#9720), Qwen2 (#9467, #9574), MiniCPMV (#9891)
Unified multi-modal processor for VLM (#10040, #10044)
Simplify model interface (#9933, #10237, #9938, #9958, #10007, #9978, #9983, #10205)

Hardware Support

Gaudi: Add Intel Gaudi (HPU) inference backend (#6143)
CPU: Add embedding models support for CPU backend (#10193)
TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)

Performance

Combine chunked prefill with speculative decoding (#9291)
fused_moe Performance Improvement (#9384)

Engine Core

Override HF config.json via CLI (#5836)
Add goodput metric support (#9338)
Move parallel sampling out from vllm core, paving way for V1 engine (#9302)
Add stateless process group for easier integration with RLHF and disaggregated prefill (#10216, #10072)

Others

Improvements to the pull request experience with DCO, mergify, stale bot, etc. (#9436, #9512, #9513, #9259, #10082, #10285, #9803)
Dropped support for Python 3.8 (#10038, #8464)
Basic Integration Test For TPU (#9968)
Document the class hierarchy in vLLM (#10240), explain the integration with Hugging Face (#10173).
Benchmark throughput now supports image input (#9851)

What's Changed

[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
[Frontend] merge beam search implementations by @LunrEclipse in #9296
[Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
[Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
[Frontend] Clarify model_type error messages by @stevegrubb in #9345
[Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
[Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
[BugFix] Fix chat API continuous usage stats by @njhill in #9357
pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
[Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
[Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
[Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
[Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
[Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
[CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
[Core] Rename input data types by @DarkLight1337 in #8688
[Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
[Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
Support mistral interleaved attn by @patrickvonplaten in #9414
[Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
[Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
[CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
Add notes on the use of Slack by @terrytangyuan in #9442
[Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
[Misc] Print stack trace using logger.exception by @DarkLight1337 in #9461
[misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
[Bugfix] Allow prefill of assistant response when using mistral_common by @sasha0552 in #9446
[TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
[Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
[CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
[Misc] Remove commit id file by @DarkLight1337 in #9470
[torch.compile] Fine-grained CustomOp enabling mechanism by @ProExpertProg in #9300
[Bugfix] Fix support for dimension like integers and ScalarType by @bnellnm in #9299
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by @wukaixingxp in #9013
[Bugfix] Print warnings related to mistral_common tokenizer only once by @sasha0552 in #9468
[Hardwware][Neuron] Simplify model load for transformers-neuronx library by @sssrijan-amazon in #9380
Support BERTModel (first encoder-only embedding model) by @robertgshaw2-neuralmagic in #9056
[BugFix] Stop silent failures on compressed-tensors parsing by @dsikka in #9381
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by @joerunde in #9352
[Qwen2.5] Support bnb quant for Qwen2.5 by @blueyo0 in #9467
[CI/Build] Use commit hash references for github actions by @russellb in #9430
[BugFix] Typing fixes to RequestOutput.prompt and beam search by @njhill in #9473
[Frontend][Feature] Add jamba tool parser by @tomeras91 in #9154
[BugFix] Fix and simplify completion API usage streaming by @njhill in #9475
[CI/Build] Fix lint errors in mistral tokenizer by @DarkLight1337 in #9504
[Bugfix] Fix offline_inference_with_prefix.py by @tlrmchlsmth in #9505
[Misc] benchmark: Add option to set max concurrency by @russellb in #9390
[Model] Add user-configurable task for models that support both generation and embedding by @DarkLight1337 in #9424
[CI/Build] Add error matching config for mypy by @russellb in #9512
[Model] Support Pixtral models ...

Contributors

zhouyuan, rasmith, and 147 other contributors

Assets 3

17 Oct 17:26

github-actions

v0.6.3.post1

a2c71c5

v0.6.3.post1

Highlights

New Models

Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
Support multiple and interleaved images for Llama3.2 (#9095)
Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)

Important bug fix

Fix chat API continuous usage stats (#9357)
Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
Fix Molmo text-only input bug (#9397)
Fix CUDA 11.8 Build (#9386)
Fix _version.py not found issue (#9375)

Other Enhancements

Remove block manager v1 and make block manager v2 default (#8704)
Spec Decode Optimize ngram lookup performance (#9333)

What's Changed

[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
[Frontend] merge beam search implementations by @LunrEclipse in #9296
[Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
[Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
[Frontend] Clarify model_type error messages by @stevegrubb in #9345
[Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
[Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
[BugFix] Fix chat API continuous usage stats by @njhill in #9357
pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
[Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
[Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
[Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
[Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
[Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
[CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
[Core] Rename input data types by @DarkLight1337 in #8688
[Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
[Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
Support mistral interleaved attn by @patrickvonplaten in #9414
[Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
[Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
[CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
Add notes on the use of Slack by @terrytangyuan in #9442
[Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
[Misc] Print stack trace using logger.exception by @DarkLight1337 in #9461
[misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
[Bugfix] Allow prefill of assistant response when using mistral_common by @sasha0552 in #9446
[TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
[Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
[CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375

New Contributors

@gracehonv made their first contribution in #9349
@streaver91 made their first contribution in #9396

Full Changelog: v0.6.3...v0.6.3.post1

Contributors

rasmith, russellb, and 24 other contributors

Assets 2

Releases: vllm-project/vllm

v0.7.3

Highlights

Model Support

Hardware Support

Engine Feature

Performance

Others

What's Changed

Contributors

v0.7.2

Highlights

Core Engine

Security Update

Other

What's Changed

New Contributors

Contributors

v0.7.1

Highlights

V1

Models

Hardwares

Others

What's Changed

New Contributors

Contributors

v0.7.0

Highlights

Features

Others

What's Changed

Contributors

v0.6.6.post1

What's Changed

Contributors

v0.6.6

Highlights

Model Support

Performance

Production Engine

Others

What's Changed

Contributors

v0.6.5

Highlights

Model Support

Hardware Support

Performance & Scheduling

Benchmark & Frontend

Documentation & Plugins

Bugfixes & Misc

What's Changed

Contributors

v0.6.4.post1

What's Changed

New Contributors

Contributors

v0.6.4

Highlights

Model Support

Hardware Support

Performance

Engine Core

Others

What's Changed

Contributors

v0.6.3.post1

Highlights

New Models

Important bug fix

Other Enhancements

What's Changed

New Contributors

Contributors