From a44352a6c97d369f118ecf13ca816954e9ad30e5 Mon Sep 17 00:00:00 2001 From: shihaobai Date: Sun, 16 Feb 2025 20:27:53 +0800 Subject: [PATCH] redirect home to /blog --- _config.yml | 5 - index.html | 3 + index.md | 442 ---------------------------------------------------- 3 files changed, 3 insertions(+), 447 deletions(-) create mode 100644 index.html delete mode 100644 index.md diff --git a/_config.yml b/_config.yml index e59206f..5e53325 100644 --- a/_config.yml +++ b/_config.yml @@ -98,11 +98,6 @@ favicons: # Favicons are also used in the manifest file. Syntax is 'size: path' 1024: '/assets/logos/logo_1024.png' -# 9. Site navigation -navigation_header: -- title: Home - url: /blog/ - navigation_footer: - title: Powered by ModelTC diff --git a/index.html b/index.html new file mode 100644 index 0000000..6e53be2 --- /dev/null +++ b/index.html @@ -0,0 +1,3 @@ +--- +redirect_to: /blog +--- diff --git a/index.md b/index.md deleted file mode 100644 index ce436f1..0000000 --- a/index.md +++ /dev/null @@ -1,442 +0,0 @@ ---- -title: Lightllm Blog -feature_text: | - ## Lightllm -excerpt: "LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention." ---- - -LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention. - -{% include relative-figure.html image="/assets/images/lightllm.drawio.png" %} - -[English Docs](https://lightllm-en.readthedocs.io/en/latest/) | [中文文档](https://lightllm-cn.readthedocs.io/en/latest/) - -## Features - -- Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization. -- Nopad (Unpad): offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities. -- Dynamic Batch: enables dynamic batch scheduling of requests -- [FlashAttention](https://github.com/Dao-AILab/flash-attention): incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference. -- Tensor Parallelism: utilizes tensor parallelism over multiple GPUs for faster inference. -- [Token Attention](./docs/TokenAttention.md): implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference. -- High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput. -- Int8KV Cache: This feature will increase the capacity of tokens to almost twice as much. only llama support. - -## Supported Model List - -- [BLOOM](https://huggingface.co/bigscience/bloom) -- [LLaMA](https://github.com/facebookresearch/llama) -- [LLaMA V2](https://huggingface.co/meta-llama) -- [StarCoder](https://github.com/bigcode-project/starcoder) -- [Qwen-7b](https://github.com/QwenLM/Qwen-7B) -- [ChatGLM2-6b](https://github.com/THUDM/ChatGLM2-6B) -- [InternLM-7b](https://github.com/InternLM/InternLM) -- [InternVL-Chat](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) -- [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL) -- [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat) -- [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) -- [Llava-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b) -- [Llava-13b](https://huggingface.co/liuhaotian/llava-v1.5-13b) -- [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) -- [Stablelm](https://huggingface.co/stabilityai/stablelm-2-1_6b) -- [MiniCPM](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) -- [Phi-3](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3) -- [CohereForAI](https://huggingface.co/CohereForAI/c4ai-command-r-plus) -- [DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) -- [DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2) - -> When you start Qwen-7b, you need to set the parameter '--eos_id 151643 --trust_remote_code'. - -> ChatGLM2 needs to set the parameter '--trust_remote_code'. - -> InternLM needs to set the parameter '--trust_remote_code'. - -> InternVL-Chat(Phi3) needs to set the parameter '--eos_id 32007 --trust_remote_code'. - -> InternVL-Chat(InternLM2) needs to set the parameter '--eos_id 92542 --trust_remote_code'. - -> Qwen2-VL-7b needs to set the parameter '--eos_id 151645 --trust_remote_code', and use 'pip install git+https://github.com/huggingface/transformers' to upgrade to the latest version. - -> Stablelm needs to set the parameter '--trust_remote_code'. - -> Phi-3 only supports Mini and Small. - -> DeepSeek-V2-Lite and DeepSeek-V2 need to set the parameter '--data_type bfloat16' - -## Get started - -### Requirements - -The code has been tested with Pytorch>=1.3, CUDA 12.4, and Python 3.9. To install the necessary dependencies, please refer to the provided **requirements.txt** and follow the instructions as - -~~~shell -# for cuda 12.4 -pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124 -~~~ - -NOTE: If you are using torch with cuda 11.x instead, run `pip install nvidia-nccl-cu12==2.20.5` to support torch cuda graph. - -### Container - -You can use the official Docker container to run the model more easily. To do this, follow these steps: - -- Pull the container from the GitHub Container Registry: - - ```shell - docker pull ghcr.io/modeltc/lightllm:main - ``` - -- Run the container with GPU support and port mapping: - - ```shell - docker run -it --gpus all -p 8080:8080 \ - --shm-size 1g -v your_local_path:/data/ \ - ghcr.io/modeltc/lightllm:main /bin/bash - ``` - -- Alternatively, you can build the container yourself: - - ```shell - docker build -t . - docker run -it --gpus all -p 8080:8080 \ - --shm-size 1g -v your_local_path:/data/ \ - /bin/bash - ``` - -- You can also use a helper script to launch both the container and the server: - - ```shell - python tools/quick_launch_docker.py --help - ``` - -- Note: If you use multiple GPUs, you may need to increase the shared memory size by adding `--shm-size` to the `docker run` command. - -### Installation - -- Install from the source code by - -~~~shell -python setup.py install -~~~ - - - -- Install Triton Package - -The code has been tested on a range of GPUs including V100, A100, A800, 4090, and H800. If you are running the code on A100, A800, etc., we recommend using triton==3.0.0. - -~~~shell -pip install triton==3.0.0 --no-deps -~~~ -If you are running the code on H800 or V100., you can try triton-nightly to get better performance. -~~~shell -pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps -~~~ - -### RUN LLaMA -With efficient Routers and TokenAttention, LightLLM can be deployed as a service and achieve the state-of-the-art throughput performance. - -Launch the server: - -~~~shell -python -m lightllm.server.api_server --model_dir /path/llama-7B \ - --host 0.0.0.0 \ - --port 8080 \ - --tp 1 \ - --max_total_token_num 120000 -~~~ - -The parameter `max_total_token_num` is influenced by the GPU memory of the deployment environment. You can also specify --mem_faction to have it calculated automatically. - -~~~shell -python -m lightllm.server.api_server --model_dir /path/llama-7B \ - --host 0.0.0.0 \ - --port 8080 \ - --tp 1 \ - --mem_faction 0.9 -~~~ - -To initiate a query in the shell: - -~~~shell -curl http://127.0.0.1:8080/generate \ - -X POST \ - -d '{"inputs":"What is AI?","parameters":{"max_new_tokens":17, "frequency_penalty":1}}' \ - -H 'Content-Type: application/json' -~~~ - -To query from Python: - -~~~python -import time -import requests -import json - -url = 'http://localhost:8080/generate' -headers = {'Content-Type': 'application/json'} -data = { - 'inputs': 'What is AI?', - "parameters": { - 'do_sample': False, - 'ignore_eos': False, - 'max_new_tokens': 1024, - } -} -response = requests.post(url, headers=headers, data=json.dumps(data)) -if response.status_code == 200: - print(response.json()) -else: - print('Error:', response.status_code, response.text) -~~~ - -### RUN Multimodal Models - -##### Run QWen-VL -~~~shell -python -m lightllm.server.api_server \ - --host 0.0.0.0 \ - --port 8080 \ - --tp 1 \ - --max_total_token_num 12000 \ - --trust_remote_code \ - --enable_multimodal \ - --cache_capacity 1000 \ - --model_dir /path/of/Qwen-VL or /path/of/Qwen-VL-Chat -~~~ - -##### Run Llava -~~~shell -python -m lightllm.server.api_server \ - --host 0.0.0.0 \ - --port 8080 \ - --tp 1 \ - --max_total_token_num 12000 \ - --trust_remote_code \ - --enable_multimodal \ - --cache_capacity 1000 \ - --model_dir /path/of/llava-v1.5-7b or /path/of/llava-v1.5-13b -~~~ - -##### Query From QWen-VL -~~~python -import time -import requests -import json -import base64 - -url = 'http://localhost:8080/generate' -headers = {'Content-Type': 'application/json'} - -uri = "/local/path/of/image" # or "/http/path/of/image" -if uri.startswith("http"): - images = [{"type": "url", "data": uri}] -else: - with open(uri, 'rb') as fin: - b64 = base64.b64encode(fin.read()).decode("utf-8") - images=[{'type': "base64", "data": b64}] - -data = { - "inputs": "Generate the caption in English with grounding:", - "parameters": { - "max_new_tokens": 200, - # The space before <|endoftext|> is important, the server will remove the first bos_token_id, but QWen tokenizer does not has bos_token_id - "stop_sequences": [" <|endoftext|>"], - }, - "multimodal_params": { - "images": images, - } -} - -response = requests.post(url, headers=headers, data=json.dumps(data)) -if response.status_code == 200: - print(response.json()) -else: - print('Error:', response.status_code, response.text) -~~~ - -##### Query From QWen-VL-Chat -~~~python -import json -import requests -import base64 - -def run_once(query, uris): - images = [] - for uri in uris: - if uri.startswith("http"): - images.append({"type": "url", "data": uri}) - else: - with open(uri, 'rb') as fin: - b64 = base64.b64encode(fin.read()).decode("utf-8") - images.append({'type': "base64", "data": b64}) - - data = { - "inputs": query, - "parameters": { - "max_new_tokens": 200, - # The space before <|endoftext|> is important, the server will remove the first bos_token_id, but QWen tokenizer does not has bos_token_id - "stop_sequences": [" <|endoftext|>", " <|im_start|>", " <|im_end|>"], - }, - "multimodal_params": { - "images": images, - } - } - - # url = "http://127.0.0.1:8080/generate_stream" - url = "http://127.0.0.1:8080/generate" - headers = {'Content-Type': 'application/json'} - response = requests.post(url, headers=headers, data=json.dumps(data)) - if response.status_code == 200: - print(" + result: ({})".format(response.json())) - else: - print(' + error: {}, {}'.format(response.status_code, response.text)) - -""" -multi-img, multi-round: - -<|im_start|>system -You are a helpful assistant.<|im_end|> -<|im_start|>user - - -上面两张图片分别是哪两个城市?请对它们进行对比。<|im_end|> -<|im_start|>assistant -根据提供的信息,两张图片分别是重庆和北京。<|im_end|> -<|im_start|>user -这两座城市分别在什么地方?<|im_end|> -<|im_start|>assistant -""" -run_once( - uris = [ - "assets/mm_tutorial/Chongqing.jpeg", - "assets/mm_tutorial/Beijing.jpeg", - ], - query = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\n\n上面两张图片分别是哪两个城市?请对它们进行对比。<|im_end|>\n<|im_start|>assistant\n根据提供的信息,两张图片分别是重庆和北京。<|im_end|>\n<|im_start|>user\n这两座城市分别在什么地方?<|im_end|>\n<|im_start|>assistant\n" -) -~~~ - -##### Query From Llava -~~~python -import time -import requests -import json -import base64 - -url = 'http://localhost:8080/generate' -headers = {'Content-Type': 'application/json'} - -uri = "/local/path/of/image" # or "/http/path/of/image" -if uri.startswith("http"): - images = [{"type": "url", "data": uri}] -else: - with open(uri, 'rb') as fin: - b64 = base64.b64encode(fin.read()).decode("utf-8") - images=[{'type': "base64", "data": b64}] - -data = { - "inputs": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: \nPlease explain the picture. ASSISTANT:", - "parameters": { - "max_new_tokens": 200, - }, - "multimodal_params": { - "images": images, - } -} - -response = requests.post(url, headers=headers, data=json.dumps(data)) -if response.status_code == 200: - print(response.json()) -else: - print('Error:', response.status_code, response.text) -~~~ - -> Additional lanuch parameters: `--enable_multimodal`, `--cache_capacity`, larger `--cache_capacity` requires larger `shm-size` - -> Support `--tp > 1`, when `tp > 1`, visual model run on the gpu 0 - -> The special image tag for Qwen-VL is `` (`` for Llava), the length of `data["multimodal_params"]["images"]` should be the same as the count of tags, The number can be 0, 1, 2, ... - -> Input images format: list for dict like `{'type': 'url'/'base64', 'data': xxx}` - -## Performance - -### Service Performance - -We compared the service performance of LightLLM and vLLM==0.1.2 on LLaMA-7B using an A800 with 80G GPU memory. - -To begin, prepare the data as follows: - -~~~shell -wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -~~~ - -Launch the service: - -~~~shell -python -m lightllm.server.api_server --model_dir /path/llama-7b --tp 1 --max_total_token_num 121060 --tokenizer_mode auto -~~~ - -Evaluation: - -~~~shell -cd test -python benchmark_serving.py --tokenizer /path/llama-7b --dataset /path/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --request-rate 200 -~~~ - -The performance comparison results are presented below: - -| vLLM | LightLLM | -| ---------------------------------------------------- | ----------------------------------------------------- | -| Total time: 361.79 s
Throughput: 5.53 requests/s | Total time: 188.85 s
Throughput: 10.59 requests/s | - -### Static inference performance - -For debugging, we offer static performance testing scripts for various models. For instance, you can evaluate the inference performance of the LLaMA model by - -~~~shell -cd test/model -python test_llama.py -~~~ - -### FAQ - -- The LLaMA tokenizer fails to load. - - consider resolving this by running the command `pip install protobuf==3.20.0`. -- `error : PTX .version 7.4 does not support .target sm_89` - - launch with `bash tools/resolve_ptx_version python -m lightllm.server.api_server ... ` - -## Projects using lightllm - -If you have a project that should be incorporated, please contact via email or create a pull request. - -1.
LazyLLM: Easyest and lazyest way for building multi-agent LLMs applications. - - Once you have installed `lightllm` and `lazyllm`, and then you can use the following code to build your own chatbot: - - ~~~python - from lazyllm import TrainableModule, deploy, WebModule - # Model will be download automatically if you have an internet connection - m = TrainableModule('internlm2-chat-7b').deploy_method(deploy.lightllm) - WebModule(m).start().wait() - ~~~ - - Documents: https://lazyllm.readthedocs.io/ - -
- -## Community - -For further information and discussion, [join our discord server](https://discord.gg/WzzfwVSguU). - -## License - -This repository is released under the [Apache-2.0](LICENSE) license. - -## Acknowledgement - -We learned a lot from the following projects when developing LightLLM. -- [Faster Transformer](https://github.com/NVIDIA/FasterTransformer) -- [Text Generation Inference](https://github.com/huggingface/text-generation-inference) -- [vLLM](https://github.com/vllm-project/vllm) -- [Flash Attention 1&2](https://github.com/Dao-AILab/flash-attention) -- [OpenAI Triton](https://github.com/openai/triton) \ No newline at end of file