Skip to content

CoLearn-Dev/deserve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Dec 21, 2024
416c75c · Dec 21, 2024

History

90 Commits
Dec 19, 2024
Dec 19, 2024
Dec 19, 2024
Dec 21, 2024
Dec 19, 2024
Dec 19, 2024
Dec 19, 2024
Dec 19, 2024
Jul 27, 2024
Dec 19, 2024
Dec 19, 2024

Repository files navigation

DeServe

DeServe is a offline-serving framework for decentralized inference of large language models. Benefiting from following technologies, DeServe can achieve up to 12.6x throughput improvement in high-latency network compared to the vLLM with pipeline parallelism. Following features are key to the performance:

  • KV Cache Swapping: Maximizes GPU computation utilization by enlarging the KV cache size through swapping microbatch memory between CPU and GPU.
  • Microbatch Scheduling: Allocates microbatches inside the pipeline for different network latency to maximize the throughput.
Latency (ms) real-world centralized real-world east-west sim 16 sim 32 sim 64 sim 256
vLLM (tp) 253.0 failed / / / /
vLLM (pp) 89.1 37.3 68.8 55.3 36.1 /
DeServe (pp) 194.6 138.4 182.3 163.7 133.7 /
DeServe (opt) 445.2 434.1 458.5 457.3 456.8 442.9

To start the experiments, please refer to deserve_exp/readme.md.