Skip to content

Releases: vllm-project/vllm

v0.6.3.post1

17 Oct 17:26
a2c71c5
Compare
Choose a tag to compare

Highlights

New Models

  • Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
  • Support multiple and interleaved images for Llama3.2 (#9095)
  • Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)

Important bug fix

  • Fix chat API continuous usage stats (#9357)
  • Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
  • Fix Molmo text-only input bug (#9397)
  • Fix CUDA 11.8 Build (#9386)
  • Fix _version.py not found issue (#9375)

Other Enhancements

  • Remove block manager v1 and make block manager v2 default (#8704)
  • Spec Decode Optimize ngram lookup performance (#9333)

What's Changed

New Contributors

Full Changelog: v0.6.3...v0.6.3.post1

v0.6.3

14 Oct 20:20
fd47e57
Compare
Choose a tag to compare

Highlights

Model Support

  • New Models:
  • Expansion in functionality:
    • Add Gemma2 embedding model (#9004)
    • Support input embeddings for qwen2vl (#8856), minicpmv (#9237)
    • LoRA:
      • LoRA support for MiniCPMV2.5 (#7199), MiniCPMV2.6 (#8943)
      • Expand lora modules for mixtral (#9008)
    • Pipeline parallelism support to remaining text and embedding models (#7168, #9090)
    • Expanded bitsandbytes quantization support for Falcon, OPT, Gemma, Gemma2, and Phi (#9148)
    • Tool use:
      • Add support for Llama 3.1 and 3.2 tool use (#8343)
      • Support tool calling for InternLM2.5 (#8405)
  • Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (#9108)

Documentation

  • New compatibility matrix for mutual exclusive features (#8512)
  • Reorganized installation doc, note that we publish a per-commit docker image (#8931)

Hardware Support:

  • Cross-attention and Encoder-Decoder models support on x86 CPU backend (#9089)
  • Support AWQ for CPU backend (#7515)
  • Add async output processor for xpu (#8897)
  • Add on-device sampling support for Neuron (#8746)

Architectural Enhancements

  • Progress in vLLM's refactoring to a core core:
    • Spec decode removing batch expansion (#8839, #9298).
    • We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (#8678).
    • Moving beam search from the core to the API level (#9105, #9087, #9117, #8928)
    • Move guided decoding params into sampling params (#8252)
  • Torch Compile:
    • You can now set an env var VLLM_TORCH_COMPILE_LEVEL to control torch.compile various levels of compilation control and integration (#9058). Along with various improvements (#8982, #9258, #906, #8875), using VLLM_TORCH_COMPILE_LEVEL=3 can turn on Inductor's full graph compilation without vLLM's custom ops.

Others

  • Performance enhancements to turn on multi-step scheeduling by default (#8804, #8645, #8378)
  • Enhancements towards priority scheduling (#8965, #8956, #8850)

What's Changed

Read more

v0.6.2

25 Sep 21:50
7193774
Compare
Choose a tag to compare

Highlights

Model Support

  • Support Llama 3.2 models (#8811, #8822)

     vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
    
  • Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)

    • ⚠️ You will see the following error now, this is breaking change!

      Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the vllm.LLM.use_beam_search method for dedicated beam search instead, or set the environment variable VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 to suppress this error. For more details, see #8306

  • Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)

  • Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)

Hardware Support

  • TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
  • CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
  • AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)

Production Engine

  • Initial support for priority sheduling (#5958)
  • Support Lora lineage and base model metadata management (#6315)
  • Batch inference for llm.chat() API (#8648)

Performance

  • Introduce MQLLMEngine for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584)
  • Multi-step scheduling enhancements
    • Prompt logprobs support in Multi-step (#8199)
    • Add output streaming support to multi-step + async (#8335)
    • Add flashinfer backend (#7928)
  • Add cuda graph support during decoding for encoder-decoder models (#7631)

Others

  • Support sample from HF datasets and image input for benchmark_serving (#8495)
  • Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)

What's Changed

Read more

v0.6.1.post2

13 Sep 18:35
9ba0817
Compare
Choose a tag to compare

Highlights

  • This release contains an important bugfix related to token streaming combined with stop string (#8468)

What's Changed

Full Changelog: v0.6.1.post1...v0.6.1.post2

v0.6.1.post1

13 Sep 04:40
acda0b3
Compare
Choose a tag to compare

Highlights

This release features important bug fixes and enhancements for

  • Pixtral models. (#8415, #8425, #8399, #8431)
    • Chunked scheduling has been turned off for vision models. Please replace --max_num_batched_tokens 16384 with --max-model-len 16384
  • Multistep scheduling. (#8417, #7928, #8427)
  • Tool use. (#8423, #8366)

Also

  • support multiple images for qwen-vl (#8247)
  • removes engine_use_ray (#8126)
  • add engine option to return only deltas or final output (#7381)
  • add bitsandbytes support for Gemma2 (#8338)

What's Changed

New Contributors

Full Changelog: v0.6.1...v0.6.1.post1

v0.6.1

11 Sep 21:44
3fd2b0d
Compare
Choose a tag to compare

Highlights

Model Support

  • Added support for Pixtral (mistralai/Pixtral-12B-2409). (#8377, #8168)
  • Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
  • Multi-input support for LLaVA (#8238), InternVL2 models (#8201)

Performance Enhancements

  • Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)

Production Engine

  • Support load and unload LoRA in api server (#6566)
  • Add progress reporting to batch runner (#8060)
  • Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)

Others

  • Update the docker image to use Python 3.12 for small performance bump. (#8133)
  • Added CODE_OF_CONDUCT.md (#8161)

What's Changed

New Contributors

Full Changelog: v0.6.0...v0.6.1

v0.6.0

04 Sep 23:35
32e7db2
Compare
Choose a tag to compare

Highlights

Performance Update

  • We are excited to announce a faster vLLM delivering 2x more throughput compared to v0.5.3. The default parameters should achieve great speed up, but we recommend also try out turning on multi step scheduling. You can do so by setting --num-scheduler-steps 8 in the engine arguments. Please note that it still have some limitations and being actively hardened, see #7528 for known issues.
    • Multi-step scheduler now supports LLMEngine and log_probs (#7789, #7652)
    • Asynchronous output processor overlaps the output data structures construction with GPU works, delivering 12% throughput increase. (#7049, #7911, #7921, #8050)
    • Using FlashInfer backend for FP8 KV Cache (#7798, #7985), rejection sampling in Speculative Decoding (#7244)

Model Support

  • Support bitsandbytes 8-bit and FP4 quantized models (#7445)
  • New LLMs: Exaone (#7819), Granite (#7436), Phi-3.5-MoE (#7729)
  • A new tokenizer mode for mistral models to use the native mistral-commons package (#7739)
  • Multi-modality:
    • multi-image input support for LLaVA-Next (#7230), Phi-3-vision models (#7783)
    • Ultravox support for multiple audio chunks (#7963)
    • TP support for ViTs (#7186)

Hardware Support

  • NVIDIA GPU: extend cuda graph size for H200 (#7894)
  • AMD: Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
  • Intel GPU: pipeline parallel support (#7810)
  • Neuron: context lengths and token generation buckets (#7885, #8062)
  • TPU: single and multi-host TPUs on GKE (#7613), Async output processing (#8011)

Production Features

  • OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models! (#5649)
  • Add json_schema support from OpenAI protocol (#7654)
  • Enable chunked prefill and prefix caching together (#7753, #8120)
  • Multimodal support in offline chat (#8098), and multiple multi-modal items in the OpenAI frontend (#8049)

Misc

  • Support benchmarking async engine in benchmark_throughput.py (#7964)
  • Progress in integration with torch.compile: avoid Dynamo guard evaluation overhead (#7898), skip compile for profiling (#7796)

What's Changed

  • [Core] Add multi-step support to LLMEngine by @alexm-neuralmagic in #7789
  • [Bugfix] Fix run_batch logger by @pooyadavoodi in #7640
  • [Frontend] Publish Prometheus metrics in run_batch API by @pooyadavoodi in #7641
  • [Frontend] add json_schema support from OpenAI protocol by @rockwotj in #7654
  • [misc][core] lazy import outlines by @youkaichao in #7831
  • [ci][test] exclude model download time in server start time by @youkaichao in #7834
  • [ci][test] fix RemoteOpenAIServer by @youkaichao in #7838
  • [Bugfix] Fix Phi-3v crash when input images are of certain sizes by @zifeitong in #7840
  • [Model][VLM] Support multi-images inputs for Phi-3-vision models by @Isotr0py in #7783
  • [Misc] Remove snapshot_download usage in InternVL2 test by @Isotr0py in #7835
  • [misc][cuda] improve pynvml warning by @youkaichao in #7852
  • [Spec Decoding] Streamline batch expansion tensor manipulation by @njhill in #7851
  • [Bugfix]: Use float32 for base64 embedding by @HollowMan6 in #7855
  • [CI/Build] Avoid downloading all HF files in RemoteOpenAIServer by @DarkLight1337 in #7836
  • [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule by @comaniac in #7822
  • [Misc] Update qqq to use vLLMParameters by @dsikka in #7805
  • [Misc] Update gptq_marlin_24 to use vLLMParameters by @dsikka in #7762
  • [misc] fix custom allreduce p2p cache file generation by @youkaichao in #7853
  • [Bugfix] neuron: enable tensor parallelism by @omrishiv in #7562
  • [Misc] Update compressed tensors lifecycle to remove prefix from create_weights by @dsikka in #7825
  • [Core] Asynchronous Output Processor by @megha95 in #7049
  • [Tests] Disable retries and use context manager for openai client by @njhill in #7565
  • [core][torch.compile] not compile for profiling by @youkaichao in #7796
  • Revert #7509 by @comaniac in #7887
  • [Model] Add Mistral Tokenization to improve robustness and chat encoding by @patrickvonplaten in #7739
  • [CI/Build][VLM] Cleanup multiple images inputs model test by @Isotr0py in #7897
  • [Hardware][Intel GPU] Add intel GPU pipeline parallel support. by @jikunshang in #7810
  • [CI/Build][ROCm] Enabling tensorizer tests for ROCm by @alexeykondrat in #7237
  • [Bugfix] Fix phi3v incorrect image_idx when using async engine by @Isotr0py in #7916
  • [cuda][misc] error on empty CUDA_VISIBLE_DEVICES by @youkaichao in #7924
  • [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in #7766
  • [benchmark] Update TGI version by @philschmid in #7917
  • [Model] Add multi-image input support for LLaVA-Next offline inference by @zifeitong in #7230
  • [mypy] Enable mypy type checking for vllm/core by @jberkhahn in #7229
  • [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt by @petersalas in #7902
  • [hardware][rocm] allow rocm to override default env var by @youkaichao in #7926
  • [Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. by @bnellnm in #7886
  • [mypy][CI/Build] Fix mypy errors by @DarkLight1337 in #7929
  • [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) by @alexm-neuralmagic in #7911
  • [Performance] Enable chunked prefill and prefix caching together by @comaniac in #7753
  • [ci][test] fix pp test failure by @youkaichao in #7945
  • [Doc] fix the autoAWQ example by @stas00 in #7937
  • [Bugfix][VLM] Fix incompatibility between #7902 and #7230 by @DarkLight1337 in #7948
  • [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. by @pavanimajety in #7798
  • [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ by @rasmith in #7386
  • [TPU] Upgrade PyTorch XLA nightly by @WoosukKwon in #7967
  • [Doc] fix 404 link by @stas00 in #7966
  • [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM by @mzusman in #7651
  • [Bugfix] Make torch registration of punica ops optional by @bnellnm in #7970
  • [torch.compile] avoid Dynamo guard evaluation overhead by @youkaichao in #7898
  • Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test by @mgoin in #7961
  • [Frontend] Minor optimizations to zmq decoupled front-end by @njhill in #7957
  • [torch.compile] remove reset by @youkaichao in #7975
  • [VLM][Core] Fix exceptions on ragged NestedTensors by @petersalas in #7974
  • Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." by @youkaichao in #7982
  • [Bugfix] Unify rank computation across regular decoding and speculative decoding by @jmkuebler in #7899
  • [Core] Combine async postprocessor and multi-step by @alexm-neuralmagic in #7921
  • [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto by @pavanimajety in #7985
  • extend cuda graph size for H200 by @kushanam in #7894
  • [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism by @Isotr0py in #7954
  • [misc] update tpu int8 to use new vLLM Parameters by @dsikka in #7973
  • [Neuron] Adding support for context-lenght, token-gen buckets. by @hbikki in #7885
  • support bitsandbytes 8-bit and FP4 quantized models by @chenqianfzh in #7445
  • Add more percentiles and latencies by @...
Read more

v0.5.5

23 Aug 18:37
09c7792
Compare
Choose a tag to compare

Highlights

Performance Update

  • We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set --num-scheduler-steps 8 as a parameter to the API server (via vllm serve) or AsyncLLMEngine. We are working on expanding the coverage to LLM class and aiming to turning it on by default
  • Various enhancements:
    • Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (#7137)
    • Reduce Python allocations, leading to 24% throughput speedup (#7162, 7364)
    • Improvements to the zeromq based decoupled frontend (#7570, #7716, #7484)

Model Support

  • Support Jamba 1.5 (#7415, #7601, #6739)
  • Support for the first audio model UltravoxModel (#7615, #7446)
  • Improvements to vision models:
    • Support image embeddings as input (#6613)
    • Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
  • Support loading GGUF model (#5191) with tensor parallelism (#7520)
  • Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)

Hardware Support

  • AMD: Add fp8 Linear Layer for rocm (#7210)
  • Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
  • Intel: various refactoring for worker, executor, and model runner (#7686, #7712)

Others

  • Optimize prefix caching performance (#7193)
  • Speculative decoding
    • Use target model max length as default for draft model (#7706)
    • EAGLE Implementation with Top-1 proposer (#6830)
  • Entrypoints
    • A new chat method in the LLM class (#5049)
    • Support embeddings in the run_batch API (#7132)
    • Support prompt_logprobs in Chat Completion (#7453)
  • Quantizations
    • Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
    • Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
  • torch.compile: register custom ops for kernels (#7591, #7594, #7536)

What's Changed

Read more

v0.5.4

05 Aug 22:38
4db5176
Compare
Choose a tag to compare

Highlights

Model Support

  • Enhanced pipeline parallelism support for DeepSeek v2 (#6519), Qwen (#6974), Qwen2 (#6924), and Nemotron (#6863)
  • Enhanced vision language model support for InternVL2 (#6514, #7067), BLIP-2 (#5920), MiniCPM-V (#4087, #7122).
  • Added H2O Danube3-4b (#6451)
  • Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611)

Hardware Support

  • TPU enhancements: collective communication, TP for async engine, faster compile time (#6891, #6933, #6856, #6813, #5871)
  • Intel CPU: enable multiprocessing and tensor parallelism (#6125)

Performance

We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.

  • Separated OpenAI Server's HTTP request handling and model inference loop with zeromq. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (#6883)
  • Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (#6779)
  • Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (#6532)
  • Optimize get_seqs function, bring 2% throughput enhancements. (#7051)

Production Features

  • Enhancements to speculative decoding: FlashInfer in DraftModelRunner (#6926), observability (#6963), and benchmarks (#6964)
  • Refactor the punica kernel based on Triton (#5036)
  • Support for guided decoding for offline LLM (#6878)

Quantization

  • Support W4A8 quantization for vllm (#5218)
  • Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (#6677, #6996, #6848)
  • Support reading bitsandbytes pre-quantized model (#5753)

What's Changed

Read more

v0.5.3.post1

23 Jul 17:09
38c4b7e
Compare
Choose a tag to compare

Highlights

  • We fixed an configuration incompatibility between vLLM (which tested against pre-released version) and the published Meta Llama 3.1 weights (#6693)

What's Changed

Full Changelog: v0.5.3...v0.5.3.post1