diff --git a/_posts/2025-02-16-lightllm.md b/_posts/2025-02-16-lightllm.md index 26fd1d7..855225d 100644 --- a/_posts/2025-02-16-lightllm.md +++ b/_posts/2025-02-16-lightllm.md @@ -1,32 +1,34 @@ --- -title: LightLLM v4.0.0--Minimal Inter-Process Communication Overhead, Fastest DeepSeek-R1 Serving Performance on Single H200, and Prototype Support for PD seperation +title: LightLLM v4.0.0--Minimal Inter-Process Communication Overhead, Fastest DeepSeek-R1 Serving Performance on Single H200, and Prototype Support for PD-Disaggregation categories: - By MTC Team excerpt: | We are delighted to announce the release of LightLLM v4.0.0. --- -We are delighted to announce the release of LightLLM v4.0.0. After a year of continuous efforts, we have comprehensively upgraded the LightLLM architecture. We implemented a cross-process accessible request object, significantly reducing inter-process communication overhead, especially in high-concurrency scenarios. Meanwhile, we have conducted in-depth optimization for DeepSeek R1, achieving state-of-the-art performance among current open-source frameworks on a single H200 machine. Furthermore, we have innovatively proposed a prototype architecture implementation of PD-separation. +We are delighted to announce the release of LightLLM v4.0.0. After a year of continuous efforts, we have comprehensively upgraded the LightLLM architecture. We have implemented a cross-process accessible request object, significantly reducing inter-process communication overhead, especially in high-concurrency scenarios. Meanwhile, we have conducted in-depth optimization for DeepSeek R1, achieving state-of-the-art performance among current open-source frameworks on a single H200 machine. Furthermore, we have innovatively proposed a prototype architecture implementation of PD-separation. ### Framework -The new framework of LightLLM is shown in the figure below. We have retained the previous three-process architecture design and designed a request object that can be accessed across processes via ctypes. We store the metadata of the requests in shared memory, ensuring that only a minimal amount of necessary data is communicated between processes. Additionally, we have implemented the folding of scheduling and model inference, and nearly eliminated the communication overhead between the scheduler and modelrpc in the previous version of the router. We implemented the CacheTensorManager class, which takes over the allocation and release of Torch tensors within the framework. This maximizes the cross-layer sharing of tensors at runtime, as well as memory sharing between different CUDA graphs. On an 8x80GB H100 machine, with the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without running out of memory (OOM). We will subsequently publish a series of blog posts introducing the architecture of LightLLM. +The new framework of LightLLM is shown in the figure below. We have retained the previous three-process architecture design and designed a request object that can be accessed across processes via ctypes. We store the metadata of the requests in shared memory, ensuring that only a minimal amount of necessary data is communicated between processes. Additionally, we have implemented the folding of scheduling and model inference, and have nearly eliminated the communication overhead between the scheduler and model-rpc in the previous version of the router. We implemented the CacheTensorManager class, which takes over the allocation and release of Torch tensors within the framework. This maximizes the cross-layer sharing of tensors at runtime, as well as memory sharing between different CUDA graphs. On an 8x80GB H100 machine, with the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without running out of memory (OOM). We will subsequently publish a series of blog posts introducing the architecture of LightLLM. {% include relative-figure.html image="/assets/images/blogs/02-lightllm250216/lightllm.png" %} -### PD seperation Prototype +### PD-Disaggregation Prototype We have completed the prototype design of PD separation. For details, please refer to https://github.com/ModelTC/lightllm/tree/main/lightllm. We will subsequently provide a detailed structural explanation and performance data. -### Optimzation on DeepSeek + +### Optimization on DeepSeek Due to the different computational characteristics of Prefill (compute intensive) and Decode (memory intensive), we have implemented distinct optimizations for the DeepSeek MLA. During the Prefill stage, we decompress the KV cache, while in the Decode stage, we compress the query (q) to achieve optimal performance. Additionally, we leveraged OpenAI's Triton to implement high-performance Decode MLA and fused MoE kernels. ### Performance -The figure below shows the performance comparison of LightLLM, sglang==0.4.3, vllm==0.7.2, and trtllm==0.17.0 on a single H200 machine, using DeepSeek-R1 (num_clients = 100). The input length of the test data is 1024, and the output follows a Gaussian distribution with a mean of 128. LightLLM achieve the better performance. +The figure below shows the performance comparison of LightLLM, sglang==0.4.3, vllm==0.7.2, and trtllm==0.17.0 on a single H200 machine, using DeepSeek-R1 (num_clients = 100). The input length of the test data is 1024, and the output follows a Gaussian distribution with a mean of 128. LightLLM achieves better performance. {% include relative-figure.html image="/assets/images/blogs/02-lightllm250216/lightllm-performance.png" %} + ### Acknowledgment -We learned a lot from the following projects when developing LightLLM, including [vLLM](https://github.com/vllm-project/vllm), [sglang](https://github.com/sgl-project/sglang), [OpenAI Triton](https://github.com/openai/triton). We also warmly welcome the open-source community to help improve LightLLM. \ No newline at end of file +We learned a lot from the following projects when developing LightLLM, including [vLLM](https://github.com/vllm-project/vllm), [sglang](https://github.com/sgl-project/sglang), and [OpenAI Triton](https://github.com/openai/triton). We also warmly welcome the open-source community to help improve LightLLM.