Releases: vllm-project/vllm
v0.6.3.post1
Highlights
New Models
- Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
- Support multiple and interleaved images for Llama3.2 (#9095)
- Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)
Important bug fix
- Fix chat API continuous usage stats (#9357)
- Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
- Fix Molmo text-only input bug (#9397)
- Fix CUDA 11.8 Build (#9386)
- Fix
_version.py
not found issue (#9375)
Other Enhancements
- Remove block manager v1 and make block manager v2 default (#8704)
- Spec Decode Optimize ngram lookup performance (#9333)
What's Changed
- [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
- [Frontend] merge beam search implementations by @LunrEclipse in #9296
- [Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
- [Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
- [Frontend] Clarify model_type error messages by @stevegrubb in #9345
- [Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
- [Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
- [BugFix] Fix chat API continuous usage stats by @njhill in #9357
- pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
- [Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
- [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
- [Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
- [Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
- [Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
- [Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
- [CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
- [Core] Rename input data types by @DarkLight1337 in #8688
- [Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
- [Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
- Support mistral interleaved attn by @patrickvonplaten in #9414
- [Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
- [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
- [Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
- [CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
- [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
- [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
- Add notes on the use of Slack by @terrytangyuan in #9442
- [Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
- [Misc] Print stack trace using
logger.exception
by @DarkLight1337 in #9461 - [misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
- [Bugfix] Allow prefill of assistant response when using
mistral_common
by @sasha0552 in #9446 - [TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
- [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
- [Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
- [CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
New Contributors
- @gracehonv made their first contribution in #9349
- @streaver91 made their first contribution in #9396
Full Changelog: v0.6.3...v0.6.3.post1
v0.6.3
Highlights
Model Support
- New Models:
- Expansion in functionality:
- Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (#9108)
Documentation
- New compatibility matrix for mutual exclusive features (#8512)
- Reorganized installation doc, note that we publish a per-commit docker image (#8931)
Hardware Support:
- Cross-attention and Encoder-Decoder models support on x86 CPU backend (#9089)
- Support AWQ for CPU backend (#7515)
- Add async output processor for xpu (#8897)
- Add on-device sampling support for Neuron (#8746)
Architectural Enhancements
- Progress in vLLM's refactoring to a core core:
- Spec decode removing batch expansion (#8839, #9298).
- We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (#8678).
- Moving beam search from the core to the API level (#9105, #9087, #9117, #8928)
- Move guided decoding params into sampling params (#8252)
- Torch Compile:
- You can now set an env var
VLLM_TORCH_COMPILE_LEVEL
to controltorch.compile
various levels of compilation control and integration (#9058). Along with various improvements (#8982, #9258, #906, #8875), usingVLLM_TORCH_COMPILE_LEVEL=3
can turn on Inductor's full graph compilation without vLLM's custom ops.
- You can now set an env var
Others
- Performance enhancements to turn on multi-step scheeduling by default (#8804, #8645, #8378)
- Enhancements towards priority scheduling (#8965, #8956, #8850)
What's Changed
- [Misc] Update config loading for Qwen2-VL and remove Granite by @ywang96 in #8837
- [Build/CI] Upgrade to gcc 10 in the base build Docker image by @tlrmchlsmth in #8814
- [Docs] Add README to the build docker image by @mgoin in #8825
- [CI/Build] Fix missing ci dependencies by @fyuan1316 in #8834
- [misc][installation] build from source without compilation by @youkaichao in #8818
- [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM by @khluu in #8872
- [Bugfix] Include encoder prompts len to non-stream api usage response by @Pernekhan in #8861
- [Misc] Change dummy profiling and BOS fallback warns to log once by @mgoin in #8820
- [Bugfix] Fix print_warning_once's line info by @tlrmchlsmth in #8867
- fix validation: Only set tool_choice
auto
if at least one tool is provided by @chiragjn in #8568 - [Bugfix] Fixup advance_step.cu warning by @tlrmchlsmth in #8815
- [BugFix] Fix test breakages from transformers 4.45 upgrade by @njhill in #8829
- [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility by @DarkLight1337 in #8764
- [Feature] Add support for Llama 3.1 and 3.2 tool use by @maxdebayser in #8343
- [Core] Rename
PromptInputs
andinputs
with backward compatibility by @DarkLight1337 in #8876 - [misc] fix collect env by @youkaichao in #8894
- [MISC] Fix invalid escape sequence '' by @panpan0000 in #8830
- [Bugfix][VLM] Fix Fuyu batching inference with
max_num_seqs>1
by @Isotr0py in #8892 - [TPU] Update pallas.py to support trillium by @bvrockwell in #8871
- [torch.compile] use empty tensor instead of None for profiling by @youkaichao in #8875
- [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method by @ProExpertProg in #7271
- [Bugfix] fix for deepseek w4a16 by @LucasWilkinson in #8906
- [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path by @varun-sundar-rabindranath in #8378
- [misc][distributed] add VLLM_SKIP_P2P_CHECK flag by @youkaichao in #8911
- [Core] Priority-based scheduling in async engine by @schoennenbeck in #8850
- [misc] fix wheel name by @youkaichao in #8919
- [Bugfix][Intel] Fix XPU Dockerfile Build by @tylertitsworth in #7824
- [Misc] Remove vLLM patch of
BaichuanTokenizer
by @DarkLight1337 in #8921 - [Bugfix] Fix code for downloading models from modelscope by @tastelikefeet in #8443
- [Bugfix] Fix PP for Multi-Step by @varun-sundar-rabindranath in #8887
- [CI/Build] Update models tests & examples by @DarkLight1337 in #8874
- [Frontend] Make beam search emulator temperature modifiable by @nFunctor in #8928
- [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 by @heheda12345 in #8891
- [doc] organize installation doc and expose per-commit docker by @youkaichao in #8931
- [Core] Improve choice of Python multiprocessing method by @russellb in #8823
- [Bugfix] Block manager v2 with preemption and lookahead slots by @sroy745 in #8824
- [Bugfix] Fix Marlin MoE act order when is_k_full == False by @ElizaWszola in #8741
- [CI/Build] Add test decorator for minimum GPU memory by @DarkLight1337 in #8925
- [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching by @tlrmchlsmth in #8930
- [Model] Support Qwen2.5-Math-RM-72B by @zhuzilin in #8896
- [Model][LoRA]LoRA support added for MiniCPMV2.5 by @jeejeelee in #7199
- [BugFix] Fix seeded random sampling with encoder-decoder models by @njhill in #8870
- [Misc] Fix typo in BlockSpaceManagerV1 by @juncheoll in #8944
- [Frontend] Added support for HF's new
continue_final_message
parameter by @danieljannai21 in #8942 - [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model by @mzusman in #8533
- [Model] support input embeddings for qwen2vl by @whyiug in #8856
- [Misc][CI/Build] Include
cv2
viamistral_common[opencv]
by @ywang96 in #8951 - [Model][LoRA]LoRA support added for MiniCPMV2.6 by @jeejeelee in #8943
- [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg by @Isotr0py in #8946
- [Core] Make scheduling policy settable via EngineArgs by @schoennenbeck in #8956
- [Misc] Adjust max_position_embeddings for LoRA compatibility by @jeejeelee in #8957
- [ci] Add CODEOWNERS for test directories by @khluu in #8795
- [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. by @LiuXiaoxuanPKU in #8975
- [Frontend][Core] Move guided decoding params into sampling params by @joerunde in #8252
- [CI/Build] Fix machete generated kernel files ordering by @khluu in #8976
- [torch.compile] fix tensor alias by @youkaichao in #8982
- [Misc] add process_weights_after_loading for DummyLoader by @divakar-amd in #8969
- [Bugfix] Fix Fuyu tensor parallel inference by @Isotr0py in #8986
- [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders by @alex-jw-brooks in #8991
- [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API by @schoennenbeck in #8965
- [Doc] Update list of supported models by @DarkLight1337 in #8987
- Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows by @vlsav in https://github.com...
v0.6.2
Highlights
Model Support
-
Support Llama 3.2 models (#8811, #8822)
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
-
Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)
-
⚠️ You will see the following error now, this is breaking change!Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the
vllm.LLM.use_beam_search
method for dedicated beam search instead, or set the environment variableVLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1
to suppress this error. For more details, see #8306
-
-
Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)
-
Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)
Hardware Support
- TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
- CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
- AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)
Production Engine
- Initial support for priority sheduling (#5958)
- Support Lora lineage and base model metadata management (#6315)
- Batch inference for llm.chat() API (#8648)
Performance
- Introduce
MQLLMEngine
for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584) - Multi-step scheduling enhancements
- Add cuda graph support during decoding for encoder-decoder models (#7631)
Others
- Support sample from HF datasets and image input for benchmark_serving (#8495)
- Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)
What's Changed
- [MISC] Dump model runner inputs when crashing by @comaniac in #8305
- [misc] remove engine_use_ray by @youkaichao in #8126
- [TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
- Fix the AMD weight loading tests by @mgoin in #8390
- [Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
- [Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
- [Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
- [Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
- [torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
- [Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
- [Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
- [Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
- [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
- [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
- [Bugfix] Offline mode fix by @joerunde in #8376
- [multi-step] add flashinfer backend by @SolitaryThinker in #7928
- [Core] Add engine option to return only deltas or final output by @njhill in #7381
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
- [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
- [CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
- [Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
- [Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
- [Misc] Update Pixtral example by @ywang96 in #8431
- [BugFix] fix group_topk by @dsikka in #8430
- [Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
- [Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
- [Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
- [CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
- [Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
- [bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
- bump version to v0.6.1.post1 by @simon-mo in #8440
- [CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in #8437
- [doc] recommend pip instead of conda by @youkaichao in #8446
- [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in #8442
- [misc][ci] fix quant test by @youkaichao in #8449
- [Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in #8456
- [plugin][torch.compile] allow to add custom compile backend by @youkaichao in #8445
- [CI/Build] Reorganize models tests by @DarkLight1337 in #7820
- [Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in #8467
- [HotFix] Fix final output truncation with stop string + streaming by @njhill in #8468
- bump version to v0.6.1.post2 by @simon-mo in #8473
- [Hardware][intel GPU] bump up ipex version to 2.3 by @jikunshang in #8365
- [Kernel][Hardware][Amd]Custom paged attention kernel for rocm by @charlifu in #8310
- [Model] support minicpm3 by @SUDA-HLT-ywfang in #8297
- [torch.compile] fix functionalization by @youkaichao in #8480
- [torch.compile] add a flag to disable custom op by @youkaichao in #8488
- [TPU] Implement multi-step scheduling by @WoosukKwon in #8489
- [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations by @chrisociepa in #8490
- [Bugfix][Kernel] Add
IQ1_M
quantization implementation to GGUF kernel by @Isotr0py in #8357 - [Kernel] Enable 8-bit weights in Fused Marlin MoE by @ElizaWszola in #8032
- [Frontend] Expose revision arg in OpenAI server by @lewtun in #8501
- [BugFix] Fix clean shutdown issues by @njhill in #8492
- [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel by @sasha0552 in #8506
- [Kernel] AQ AZP 3/4: Asymmetric quantization kernels by @ProExpertProg in #7270
- [doc] update doc on testing and debugging by @youkaichao in #8514
- [Bugfix] Bind api server port before starting engine by @kevin314 in #8491
- [perf bench] set timeout to debug hanging by @simon-mo in #8516
- [misc] small qol fixes for release process by @simon-mo in #8517
- [Bugfix] Fix 3.12 builds on main by @joerunde in #8510
- [refactor] remove triton based sampler by @simon-mo in #8524
- [Frontend] Improve Nullable kv Arg Parsing by @alex-jw-brooks in #8525
- [Misc][Bugfix] Disable guided decoding for mistral tokenizer by @ywang96 in #8521
- [torch.compile] register allreduce operations as custom ops by @youkaichao in #8526
- [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change by @ruisearch42 in #8509
- [Benchmark] Support sample from HF datasets and image input for benchmark_serving by @Isotr0py in #8495
- [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models by @sroy745 in #7631
- [Feature][kernel] tensor parallelism with bitsandbytes quantizati...
v0.6.1.post2
Highlights
- This release contains an important bugfix related to token streaming combined with stop string (#8468)
What's Changed
- [CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in #8437
- [doc] recommend pip instead of conda by @youkaichao in #8446
- [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in #8442
- [misc][ci] fix quant test by @youkaichao in #8449
- [Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in #8456
- [plugin][torch.compile] allow to add custom compile backend by @youkaichao in #8445
- [CI/Build] Reorganize models tests by @DarkLight1337 in #7820
- [Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in #8467
- [HotFix] Fix final output truncation with stop string + streaming by @njhill in #8468
- bump version to v0.6.1.post2 by @simon-mo in #8473
Full Changelog: v0.6.1.post1...v0.6.1.post2
v0.6.1.post1
Highlights
This release features important bug fixes and enhancements for
- Pixtral models. (#8415, #8425, #8399, #8431)
- Chunked scheduling has been turned off for vision models. Please replace
--max_num_batched_tokens 16384
with--max-model-len 16384
- Chunked scheduling has been turned off for vision models. Please replace
- Multistep scheduling. (#8417, #7928, #8427)
- Tool use. (#8423, #8366)
Also
- support multiple images for qwen-vl (#8247)
- removes
engine_use_ray
(#8126) - add engine option to return only deltas or final output (#7381)
- add bitsandbytes support for Gemma2 (#8338)
What's Changed
- [MISC] Dump model runner inputs when crashing by @comaniac in #8305
- [misc] remove engine_use_ray by @youkaichao in #8126
- [TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
- Fix the AMD weight loading tests by @mgoin in #8390
- [Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
- [Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
- [Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
- [Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
- [torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
- [Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
- [Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
- [Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
- [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
- [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
- [Bugfix] Offline mode fix by @joerunde in #8376
- [multi-step] add flashinfer backend by @SolitaryThinker in #7928
- [Core] Add engine option to return only deltas or final output by @njhill in #7381
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
- [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
- [CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
- [Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
- [Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
- [Misc] Update Pixtral example by @ywang96 in #8431
- [BugFix] fix group_topk by @dsikka in #8430
- [Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
- [Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
- [Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
- [CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
- [Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
- [bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
- bump version to v0.6.1.post1 by @simon-mo in #8440
New Contributors
- @blueyo0 made their first contribution in #8338
- @lnykww made their first contribution in #8403
- @vegaluisjose made their first contribution in #8423
Full Changelog: v0.6.1...v0.6.1.post1
v0.6.1
Highlights
Model Support
- Added support for Pixtral (
mistralai/Pixtral-12B-2409
). (#8377, #8168) - Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
- Multi-input support for LLaVA (#8238), InternVL2 models (#8201)
Performance Enhancements
- Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)
Production Engine
- Support load and unload LoRA in api server (#6566)
- Add progress reporting to batch runner (#8060)
- Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)
Others
- Update the docker image to use Python 3.12 for small performance bump. (#8133)
- Added CODE_OF_CONDUCT.md (#8161)
What's Changed
- [Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in #8161
- [bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in #8169
- [Misc] Clean up RoPE forward_native by @WoosukKwon in #8076
- [ci] Mark LoRA test as soft-fail by @khluu in #8160
- [Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in #8173
- [Doc] Add multi-image input example and update supported models by @DarkLight1337 in #8181
- Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in #7860
- [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in #8029
- Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in #8165
- [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in #7962
- [Core] Support load and unload LoRA in api server by @Jeffwan in #6566
- [BugFix] Fix Granite model configuration by @njhill in #8216
- [Frontend] Add --logprobs argument to
benchmark_serving.py
by @afeldman-nm in #8191 - [Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in #7938
- [CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in #8203
- [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in #8248
- [Misc] Remove
SqueezeLLM
by @dsikka in #8220 - [Model] Allow loading from original Mistral format by @patrickvonplaten in #8168
- [misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in #7943
- [Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in #8256
- [Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in #8238
- Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in #8241
- [tpu][misc] fix typo by @youkaichao in #8260
- [Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in #8258
- [Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in #8201
- [Model][VLM] Decouple weight loading logic for
Paligemma
by @Isotr0py in #8269 - ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in #8026
- [CI/Build] Use python 3.12 in cuda image by @joerunde in #8133
- [Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in #8267
- [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in #8272
- [Frontend] Add progress reporting to run_batch.py by @alugowski in #8060
- [Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in #8292
- [Misc] GPTQ Activation Ordering by @kylesayrs in #8135
- [Misc] Fused MoE Marlin support for GPTQ by @dsikka in #8217
- Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in #8319
- [Bugfix] Fix missing
post_layernorm
in CLIP by @DarkLight1337 in #8155 - [CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in #8327
- [Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in #8314
- [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in #8130
- Fix ppc64le buildkite job by @sumitd2 in #8309
- [Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in #8224
- [Misc] remove peft as dependency for prompt models by @prashantgupta24 in #8162
- [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in #8342
- [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in #8340
- [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in #8172
- [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in #8043
- [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in #8329
- [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in #8299
- [Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in #6112
- [model] Support for Llava-Next-Video model by @TKONIY in #7559
- [Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in #8347
- [Model][VLM] Add Qwen2-VL model support by @fyabc in #7905
- [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in #7257
- [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in #8373
- [Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in #8364
- [Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in #6917
- [Misc] Move device options to a single place by @akx in #8322
- [Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in #8317
- Pixtral by @patrickvonplaten in #8377
- Bump version to v0.6.1 by @simon-mo in #8379
New Contributors
- @mmcelaney made their first contribution in #8161
- @elfiegg made their first contribution in #8173
- @Manikandan-Thangaraj-ZS0321 made their first contribution in #7860
- @sumitd2 made their first contribution in #8026
- @alugowski made their first contribution in #8060
- @vladislavkruglikov made their first contribution in #8292
- @kevin314 made their first contribution in #8224
- @TKONIY made their first contribution in #7559
- @akx made their first contribution in #8322
Full Changelog: v0.6.0...v0.6.1
v0.6.0
Highlights
Performance Update
- We are excited to announce a faster vLLM delivering 2x more throughput compared to v0.5.3. The default parameters should achieve great speed up, but we recommend also try out turning on multi step scheduling. You can do so by setting
--num-scheduler-steps 8
in the engine arguments. Please note that it still have some limitations and being actively hardened, see #7528 for known issues.- Multi-step scheduler now supports LLMEngine and log_probs (#7789, #7652)
- Asynchronous output processor overlaps the output data structures construction with GPU works, delivering 12% throughput increase. (#7049, #7911, #7921, #8050)
- Using FlashInfer backend for FP8 KV Cache (#7798, #7985), rejection sampling in Speculative Decoding (#7244)
Model Support
- Support bitsandbytes 8-bit and FP4 quantized models (#7445)
- New LLMs: Exaone (#7819), Granite (#7436), Phi-3.5-MoE (#7729)
- A new tokenizer mode for mistral models to use the native mistral-commons package (#7739)
- Multi-modality:
Hardware Support
- NVIDIA GPU: extend cuda graph size for H200 (#7894)
- AMD: Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
- Intel GPU: pipeline parallel support (#7810)
- Neuron: context lengths and token generation buckets (#7885, #8062)
- TPU: single and multi-host TPUs on GKE (#7613), Async output processing (#8011)
Production Features
- OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models! (#5649)
- Add json_schema support from OpenAI protocol (#7654)
- Enable chunked prefill and prefix caching together (#7753, #8120)
- Multimodal support in offline chat (#8098), and multiple multi-modal items in the OpenAI frontend (#8049)
Misc
- Support benchmarking async engine in benchmark_throughput.py (#7964)
- Progress in integration with
torch.compile
: avoid Dynamo guard evaluation overhead (#7898), skip compile for profiling (#7796)
What's Changed
- [Core] Add multi-step support to LLMEngine by @alexm-neuralmagic in #7789
- [Bugfix] Fix run_batch logger by @pooyadavoodi in #7640
- [Frontend] Publish Prometheus metrics in run_batch API by @pooyadavoodi in #7641
- [Frontend] add json_schema support from OpenAI protocol by @rockwotj in #7654
- [misc][core] lazy import outlines by @youkaichao in #7831
- [ci][test] exclude model download time in server start time by @youkaichao in #7834
- [ci][test] fix RemoteOpenAIServer by @youkaichao in #7838
- [Bugfix] Fix Phi-3v crash when input images are of certain sizes by @zifeitong in #7840
- [Model][VLM] Support multi-images inputs for Phi-3-vision models by @Isotr0py in #7783
- [Misc] Remove snapshot_download usage in InternVL2 test by @Isotr0py in #7835
- [misc][cuda] improve pynvml warning by @youkaichao in #7852
- [Spec Decoding] Streamline batch expansion tensor manipulation by @njhill in #7851
- [Bugfix]: Use float32 for base64 embedding by @HollowMan6 in #7855
- [CI/Build] Avoid downloading all HF files in
RemoteOpenAIServer
by @DarkLight1337 in #7836 - [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule by @comaniac in #7822
- [Misc] Update
qqq
to use vLLMParameters by @dsikka in #7805 - [Misc] Update
gptq_marlin_24
to use vLLMParameters by @dsikka in #7762 - [misc] fix custom allreduce p2p cache file generation by @youkaichao in #7853
- [Bugfix] neuron: enable tensor parallelism by @omrishiv in #7562
- [Misc] Update compressed tensors lifecycle to remove
prefix
fromcreate_weights
by @dsikka in #7825 - [Core] Asynchronous Output Processor by @megha95 in #7049
- [Tests] Disable retries and use context manager for openai client by @njhill in #7565
- [core][torch.compile] not compile for profiling by @youkaichao in #7796
- Revert #7509 by @comaniac in #7887
- [Model] Add Mistral Tokenization to improve robustness and chat encoding by @patrickvonplaten in #7739
- [CI/Build][VLM] Cleanup multiple images inputs model test by @Isotr0py in #7897
- [Hardware][Intel GPU] Add intel GPU pipeline parallel support. by @jikunshang in #7810
- [CI/Build][ROCm] Enabling tensorizer tests for ROCm by @alexeykondrat in #7237
- [Bugfix] Fix phi3v incorrect image_idx when using async engine by @Isotr0py in #7916
- [cuda][misc] error on empty CUDA_VISIBLE_DEVICES by @youkaichao in #7924
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in #7766
- [benchmark] Update TGI version by @philschmid in #7917
- [Model] Add multi-image input support for LLaVA-Next offline inference by @zifeitong in #7230
- [mypy] Enable mypy type checking for
vllm/core
by @jberkhahn in #7229 - [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt by @petersalas in #7902
- [hardware][rocm] allow rocm to override default env var by @youkaichao in #7926
- [Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. by @bnellnm in #7886
- [mypy][CI/Build] Fix mypy errors by @DarkLight1337 in #7929
- [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) by @alexm-neuralmagic in #7911
- [Performance] Enable chunked prefill and prefix caching together by @comaniac in #7753
- [ci][test] fix pp test failure by @youkaichao in #7945
- [Doc] fix the autoAWQ example by @stas00 in #7937
- [Bugfix][VLM] Fix incompatibility between #7902 and #7230 by @DarkLight1337 in #7948
- [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. by @pavanimajety in #7798
- [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ by @rasmith in #7386
- [TPU] Upgrade PyTorch XLA nightly by @WoosukKwon in #7967
- [Doc] fix 404 link by @stas00 in #7966
- [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM by @mzusman in #7651
- [Bugfix] Make torch registration of punica ops optional by @bnellnm in #7970
- [torch.compile] avoid Dynamo guard evaluation overhead by @youkaichao in #7898
- Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test by @mgoin in #7961
- [Frontend] Minor optimizations to zmq decoupled front-end by @njhill in #7957
- [torch.compile] remove reset by @youkaichao in #7975
- [VLM][Core] Fix exceptions on ragged NestedTensors by @petersalas in #7974
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." by @youkaichao in #7982
- [Bugfix] Unify rank computation across regular decoding and speculative decoding by @jmkuebler in #7899
- [Core] Combine async postprocessor and multi-step by @alexm-neuralmagic in #7921
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto by @pavanimajety in #7985
- extend cuda graph size for H200 by @kushanam in #7894
- [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism by @Isotr0py in #7954
- [misc] update tpu int8 to use new vLLM Parameters by @dsikka in #7973
- [Neuron] Adding support for context-lenght, token-gen buckets. by @hbikki in #7885
- support bitsandbytes 8-bit and FP4 quantized models by @chenqianfzh in #7445
- Add more percentiles and latencies by @...
v0.5.5
Highlights
Performance Update
- We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set
--num-scheduler-steps 8
as a parameter to the API server (viavllm serve
) orAsyncLLMEngine
. We are working on expanding the coverage toLLM
class and aiming to turning it on by default - Various enhancements:
Model Support
- Support Jamba 1.5 (#7415, #7601, #6739)
- Support for the first audio model
UltravoxModel
(#7615, #7446) - Improvements to vision models:
- Support loading GGUF model (#5191) with tensor parallelism (#7520)
- Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)
Hardware Support
- AMD: Add fp8 Linear Layer for rocm (#7210)
- Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
- Intel: various refactoring for worker, executor, and model runner (#7686, #7712)
Others
- Optimize prefix caching performance (#7193)
- Speculative decoding
- Entrypoints
- Quantizations
torch.compile
: register custom ops for kernels (#7591, #7594, #7536)
What's Changed
- [ci][frontend] deduplicate tests by @youkaichao in #7101
- [Doc] [SpecDecode] Update MLPSpeculator documentation by @tdoublep in #7100
- [Bugfix] Specify device when loading LoRA and embedding tensors by @jischein in #7129
- [MISC] Use non-blocking transfer in prepare_input by @comaniac in #7172
- [Core] Support loading GGUF model by @Isotr0py in #5191
- [Build] Add initial conditional testing spec by @simon-mo in #6841
- [LoRA] Relax LoRA condition by @jeejeelee in #7146
- [Model] Support SigLIP encoder and alternative decoders for LLaVA models by @DarkLight1337 in #7153
- [BugFix] Fix DeepSeek remote code by @dsikka in #7178
- [ BugFix ] Fix ZMQ when
VLLM_PORT
is set by @robertgshaw2-neuralmagic in #7205 - [Bugfix] add gguf dependency by @kpapis in #7198
- [SpecDecode] [Minor] Fix spec decode sampler tests by @LiuXiaoxuanPKU in #7183
- [Kernel] Add per-tensor and per-token AZP epilogues by @ProExpertProg in #5941
- [Core] Optimize evictor-v2 performance by @xiaobochen123 in #7193
- [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by @afeldman-nm in #4942
- [Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by @mgoin in #7225
- [BugFix] Overhaul async request cancellation by @njhill in #7111
- [Doc] Mock new dependencies for documentation by @ywang96 in #7245
- [BUGFIX]: top_k is expected to be an integer. by @Atllkks10 in #7227
- [Frontend] Gracefully handle missing chat template and fix CI failure by @DarkLight1337 in #7238
- [distributed][misc] add specialized method for cuda platform by @youkaichao in #7249
- [Misc] Refactor linear layer weight loading; introduce
BasevLLMParameter
andweight_loader_v2
by @dsikka in #5874 - [ BugFix ] Move
zmq
frontend to IPC instead of TCP by @robertgshaw2-neuralmagic in #7222 - Fixes typo in function name by @rafvasq in #7275
- [Bugfix] Fix input processor for InternVL2 model by @Isotr0py in #7164
- [OpenVINO] migrate to latest dependencies versions by @ilya-lavrenov in #7251
- [Doc] add online speculative decoding example by @stas00 in #7243
- [BugFix] Fix frontend multiprocessing hang by @maxdebayser in #7217
- [Bugfix][FP8] Fix dynamic FP8 Marlin quantization by @mgoin in #7219
- [ci] Make building wheels per commit optional by @khluu in #7278
- [Bugfix] Fix gptq failure on T4s by @LucasWilkinson in #7264
- [FrontEnd] Make
merge_async_iterators
is_cancelled
arg optional by @njhill in #7282 - [Doc] Update supported_hardware.rst by @mgoin in #7276
- [Kernel] Fix Flashinfer Correctness by @LiuXiaoxuanPKU in #7284
- [Misc] Fix typos in scheduler.py by @ruisearch42 in #7285
- [Frontend] remove max_num_batched_tokens limit for lora by @NiuBlibing in #7288
- [Bugfix] Fix LoRA with PP by @andoorve in #7292
- [Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by @jeejeelee in #7273
- [Bugfix][Kernel] Increased atol to fix failing tests by @ProExpertProg in #7305
- [Frontend] Kill the server on engine death by @joerunde in #6594
- [Bugfix][fast] Fix the get_num_blocks_touched logic by @zachzzc in #6849
- [Doc] Put collect_env issue output in a block by @mgoin in #7310
- [CI/Build] Dockerfile.cpu improvements by @dtrifiro in #7298
- [Bugfix] Fix new Llama3.1 GGUF model loading by @Isotr0py in #7269
- [Misc] Temporarily resolve the error of BitAndBytes by @jeejeelee in #7308
- Add Skywork AI as Sponsor by @simon-mo in #7314
- [TPU] Add Load-time W8A16 quantization for TPU Backend by @lsy323 in #7005
- [Core] Support serving encoder/decoder models by @DarkLight1337 in #7258
- [TPU] Fix dockerfile.tpu by @WoosukKwon in #7331
- [Performance] Optimize e2e overheads: Reduce python allocations by @alexm-neuralmagic in #7162
- [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by @tjohnson31415 in #7218
- [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by @SolitaryThinker in #6971
- [Core] Streamline stream termination in
AsyncLLMEngine
by @njhill in #7336 - [Model][Jamba] Mamba cache single buffer by @mzusman in #6739
- [VLM][Doc] Add
stop_token_ids
to InternVL example by @Isotr0py in #7354 - [Performance] e2e overheads reduction: Small followup diff by @alexm-neuralmagic in #7364
- [Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by @alexm-neuralmagic in #7360
- [Frontend] Support embeddings in the run_batch API by @pooyadavoodi in #7132
- [Bugfix] Fix ITL recording in serving benchmark by @ywang96 in #7372
- [Core] Add span metrics for model_forward, scheduler and sampler time by @sfc-gh-mkeralapura in #7089
- [Bugfix] Fix
PerTensorScaleParameter
weight loading for fused models by @dsikka in #7376 - [Misc] Add numpy implementation of
compute_slot_mapping
by @Yard1 in #7377 - [Core] Fix edge case in chunked prefill + block manager v2 by @cadedaniel in #7380
- [Bugfix] Fix phi3v batch inference when images have different aspect ratio by @Isotr0py in #7392
- [TPU] Use mark_dynamic to reduce compilation time by @WoosukKwon in #7340
- Updating LM Format Enforcer version to v0.10.6 by @noamgat in https:/...
v0.5.4
Highlights
Model Support
- Enhanced pipeline parallelism support for DeepSeek v2 (#6519), Qwen (#6974), Qwen2 (#6924), and Nemotron (#6863)
- Enhanced vision language model support for InternVL2 (#6514, #7067), BLIP-2 (#5920), MiniCPM-V (#4087, #7122).
- Added H2O Danube3-4b (#6451)
- Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611)
Hardware Support
- TPU enhancements: collective communication, TP for async engine, faster compile time (#6891, #6933, #6856, #6813, #5871)
- Intel CPU: enable multiprocessing and tensor parallelism (#6125)
Performance
We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.
- Separated OpenAI Server's HTTP request handling and model inference loop with
zeromq
. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (#6883) - Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (#6779)
- Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (#6532)
- Optimize
get_seqs
function, bring 2% throughput enhancements. (#7051)
Production Features
- Enhancements to speculative decoding: FlashInfer in DraftModelRunner (#6926), observability (#6963), and benchmarks (#6964)
- Refactor the punica kernel based on Triton (#5036)
- Support for guided decoding for offline LLM (#6878)
Quantization
- Support W4A8 quantization for vllm (#5218)
- Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (#6677, #6996, #6848)
- Support reading bitsandbytes pre-quantized model (#5753)
What's Changed
- [Docs] Announce llama3.1 support by @WoosukKwon in #6688
- [doc][distributed] fix doc argument order by @youkaichao in #6691
- [Bugfix] Fix a log error in chunked prefill by @WoosukKwon in #6694
- [BugFix] Fix RoPE error in Llama 3.1 by @WoosukKwon in #6693
- Bump version to 0.5.3.post1 by @simon-mo in #6696
- [Misc] Add ignored layers for
fp8
quantization by @mgoin in #6657 - [Frontend] Add Usage data in each chunk for chat_serving. #6540 by @yecohn in #6652
- [Model] Pipeline Parallel Support for DeepSeek v2 by @tjohnson31415 in #6519
- Bump
transformers
version for Llama 3.1 hotfix and patch Chameleon by @ywang96 in #6690 - [build] relax wheel size limit by @youkaichao in #6704
- [CI] Add smoke test for non-uniform AutoFP8 quantization by @mgoin in #6702
- [Bugfix] StatLoggers: cache spec decode metrics when they get collected. by @tdoublep in #6645
- [bitsandbytes]: support read bnb pre-quantized model by @thesues in #5753
- [Bugfix] fix flashinfer cudagraph capture for PP by @SolitaryThinker in #6708
- [SpecDecoding] Update MLPSpeculator CI tests to use smaller model by @njhill in #6714
- [Bugfix] Fix token padding for chameleon by @ywang96 in #6724
- [Docs][ROCm] Detailed instructions to build from source by @WoosukKwon in #6680
- [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. by @Alexei-V-Ivanov-AMD in #6711
- [Bugfix]fix modelscope compatible issue by @liuyhwangyh in #6730
- Adding f-string to validation error which is missing by @luizanao in #6748
- [Bugfix] Fix speculative decode seeded test by @njhill in #6743
- [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. by @AllenDou in #6686
- [Frontend] split run_server into build_server and run_server by @dtrifiro in #6740
- [Kernels] Add fp8 support to
reshape_and_cache_flash
by @Yard1 in #6667 - [Core] Tweaks to model runner/input builder developer APIs by @Yard1 in #6712
- [Bugfix] Bump transformers to 4.43.2 by @mgoin in #6752
- [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users by @hongxiayang in #6754
- [core][distributed] fix zmq hang by @youkaichao in #6759
- [Frontend] Represent tokens with identifiable strings by @ezliu in #6626
- [Model] Adding support for MiniCPM-V by @HwwwwwwwH in #4087
- [Bugfix] Fix decode tokens w. CUDA graph by @comaniac in #6757
- [Bugfix] Fix awq_marlin and gptq_marlin flags by @alexm-neuralmagic in #6745
- [Bugfix] Fix encoding_format in examples/openai_embedding_client.py by @CatherineSue in #6755
- [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V by @HwwwwwwwH in #6787
- [ Misc ]
fp8-marlin
channelwise viacompressed-tensors
by @robertgshaw2-neuralmagic in #6524 - [Bugfix] Fix
kv_cache_dtype=fp8
without scales for FP8 checkpoints by @mgoin in #6761 - [Bugfix] Add synchronize to prevent possible data race by @tlrmchlsmth in #6788
- [Doc] Add documentations for nightly benchmarks by @KuntaiDu in #6412
- [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors by @LucasWilkinson in #6798
- [doc][distributed] improve multinode serving doc by @youkaichao in #6804
- [Docs] Publish 5th meetup slides by @WoosukKwon in #6799
- [Core] Fix ray forward_dag error mssg by @rkooo567 in #6792
- [ci][distributed] fix flaky tests by @youkaichao in #6806
- [ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check by @khluu in #6810
- Fix ReplicatedLinear weight loading by @qingquansong in #6793
- [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. by @eaplatanios in #6770
- [Core] Use array to speedup padding by @peng1999 in #6779
- [doc][debugging] add known issues for hangs by @youkaichao in #6816
- [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) by @mgoin in #6611
- [Bugfix][Kernel] Promote another index to int64_t by @tlrmchlsmth in #6838
- [Build/CI][ROCm] Minor simplification to Dockerfile.rocm by @WoosukKwon in #6811
- [Misc][TPU] Support TPU in initialize_ray_cluster by @WoosukKwon in #6812
- [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation by @bigPYJ1151 in #6125
- [Doc] Add Nemotron to supported model docs by @mgoin in #6843
- [Doc] Update SkyPilot doc for wrong indents and instructions for update service by @Michaelvll in #4283
- Update README.md by @gurpreet-dhami in #6847
- enforce eager mode with bnb quantization temporarily by @chenqianfzh in #6846
- [TPU] Support collective communications in XLA devices by @WoosukKwon in #6813
- [Frontend] Factor out code for running uvicorn by @DarkLight1337 in #6828
- [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b by @LucasWilkinson in #6852
- [Bugfix]: Fix Tensorizer test failures by @sangstar in #6835
- [ROCm] Upgrade PyTorch nightly version by @WoosukKwon in #6845
- [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron by @omrishiv in #6844
- [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba by @tomeras91 in #6784
- [Model] H2O Danube3-4b by @g-eoj in #6451
- [Hardware][TPU] Implement tensor parallelism with Ray by @WoosukKwon in #5871
- [Doc] Add missing mock import to docs
conf.py
by @hmellor in #6834 - [Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor by @tjohnson31415 in https://github.com...
v0.5.3.post1
Highlights
- We fixed an configuration incompatibility between vLLM (which tested against pre-released version) and the published Meta Llama 3.1 weights (#6693)
What's Changed
- [Docs] Announce llama3.1 support by @WoosukKwon in #6688
- [doc][distributed] fix doc argument order by @youkaichao in #6691
- [Bugfix] Fix a log error in chunked prefill by @WoosukKwon in #6694
- [BugFix] Fix RoPE error in Llama 3.1 by @WoosukKwon in #6693
- Bump version to 0.5.3.post1 by @simon-mo in #6696
Full Changelog: v0.5.3...v0.5.3.post1