A high-throughput and memory-efficient inference and serving engine for LLMs
amd
cuda
inference
pytorch
transformer
llama
gpt
rocm
model-serving
tpu
mlops
xpu
llm
inferentia
llmops
llm-serving
trainium
-
Updated
Sep 20, 2024 - Python