REServe: Reliable and Efficient Large Language Models Serving System
REServe: Reliable and Efficient Large Language Models Serving System
Pinned Loading
Repositories
Showing 8 of 8 repositories
- vllm Public Forked from vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
REServeLLM/vllm’s past year of commit activity - vllm-pdd Public Forked from KuntaiDu/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
REServeLLM/vllm-pdd’s past year of commit activity - ServerlessLLM Public Forked from ServerlessLLM/ServerlessLLM
Cost-efficient and fast multi-LLM serving.
REServeLLM/ServerlessLLM’s past year of commit activity - tensorrtllm_backend Public Forked from triton-inference-server/tensorrtllm_backend
The Triton TensorRT-LLM Backend
REServeLLM/tensorrtllm_backend’s past year of commit activity - TensorRT-LLM Public Forked from NVIDIA/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
REServeLLM/TensorRT-LLM’s past year of commit activity