[INFO] How to scale ? #436
Replies: 3 comments 5 replies
-
@tsensei One replica should utlize one GPU or a CPU of one host fully. Assuming you have 1 GPU on your host, there is nothing you can do - you need to SCALE horizontally (kubernetes style horizontal pod autoscaler).
Just use there solutions if you dont have a kubernetes cluster. |
Beta Was this translation helpful? Give feedback.
-
@michaelfeil So, there's no point in
Is that right? Thanks |
Beta Was this translation helpful? Give feedback.
-
I recently discovered that SageMaker offers a variety of options for easy scaling. For example, with TGI (Text Generation Inference) from Hugging Face, you can seamlessly deploy it on SageMaker and set up auto-scaling, which has proven to be highly effective for us. Additionally, SageMaker supports multi-replica deployments. Take, for instance, a single instance equipped with 4 GPUs; you can allocate each model to a separate GPU using DataParallel (DP) (philschmid.de/sagemaker-multi-replica). Therefore, it should be feasible to integrate SageMaker with Infinity and leverage these capabilities. |
Beta Was this translation helpful? Give feedback.
-
Title should say it. But how do we actually scale for more throughput? For example let's say I use just one model for embedding. It's working fine but I need more throughput, do I deploy same model multiple times like embedding-model-1, embedding-model-2 and such? Because from my understanding, there is an internal queue for incoming requests and requests and continuously batched and sent to the model.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions