[INFO] How to scale ? #436

tsensei · 2024-10-21T08:54:13Z

tsensei
Oct 21, 2024

Title should say it. But how do we actually scale for more throughput? For example let's say I use just one model for embedding. It's working fine but I need more throughput, do I deploy same model multiple times like embedding-model-1, embedding-model-2 and such? Because from my understanding, there is an internal queue for incoming requests and requests and continuously batched and sent to the model.

Thanks

michaelfeil · 2024-10-21T16:17:34Z

michaelfeil
Oct 21, 2024
Maintainer

@tsensei One replica should utlize one GPU or a CPU of one host fully.

Assuming you have 1 GPU on your host, there is nothing you can do - you need to SCALE horizontally (kubernetes style horizontal pod autoscaler).
There are also plenty of examples on e.g. modal & runpod have built in autoscaling.

Runpod: https://github.com/runpod-workers/worker-infinity-embedding
Modal: https://github.com/michaelfeil/infinity/tree/main/infra/modal It works great up to 10 replicas which I test. I use it for running infinity.modal.michaelfeil.eu
Dstack https://dstack.ai/examples/infinity/ for AWS, Azure, GCP, Runpod
Skypilot https://github.com/michaelfeil/infinity/tree/main/infra

Just use there solutions if you dont have a kubernetes cluster.

0 replies

tsensei · 2024-10-21T16:23:35Z

tsensei
Oct 21, 2024
Author

@michaelfeil So, there's no point in

Running another container (in the same host with single GPU) and load balancing between them
Running same model twice with different labels and directing the traffic to each one from application code for different part of application using same model (say ingestion embedding and query embedding)

Is that right? Thanks

5 replies

michaelfeil Oct 21, 2024
Maintainer

No, both will be pretty useless.

Expect gains <= 5% thoughput and 2x memory usage from this. Maybe some tiny models have this 5%.
Expact even lower gains <= 2% thoughput and 2x memory usage from this.

tsensei Oct 21, 2024
Author

Well, in that case, I'd like to know (if anyone has faced this before) that a lot of people will be deploying GPU instances like T4 ones that come with 16GB or minimum 8GB vram on most providers. If they run a 2GB vram embedding model, most of the resources is left unused where the initial intention would be to maximize throughput - correct me if I am wrong but one instances is not gonna max out the 16GB of my GPU rather cap at some 3.5-4GB depending on the batch size etc and no further

What would be the to-do in that case?

Thanks again

michaelfeil Oct 21, 2024
Maintainer

@tsensei Sorry but I think you missed some basics. The bottleneck is the on-chip tensor core utilization and Global VRAM (16GB) to on-chip memory bandwidth. If you are maxing one of them out, and may it be with a model occupying 2GB Vram, your nvidia-smi will show 100% usage of you GPU. There are many other things to consider, and I recommend reading this book: https://safari.ethz.ch/architecture/fall2019/lib/exe/fetch.php?media=2013_programming_massively_parallel_processors_a_hands-on_approach_2nd.pdf Thank you.

You can adjust (increase in multiples of 8!, e.g. 40 or 64) the --batch-size, in case the gpu is bottlenecked on bandwidth. That will lower then the effective bandwidth needed for the matmul / attention per sample, as model params may be loaded to sram once. So far the theory. Thanks.

tsensei Oct 21, 2024
Author

I'm new to GPU computation actually - so yeah.

In this case, running a better scoring larger embedding model from MTEB
would be better usage of reserved resources. Is that right?

Thanks

michaelfeil Oct 21, 2024
Maintainer

@tsensei Yes, a larger model will use more vram and be slower.

@tsensei I can help you deploy a model, but I cannot provision you guidance on your machine learning project with your company. If you would like to contract my services, my contact details are easy to find. Thanks

hooman-bayer · 2024-10-21T16:43:14Z

hooman-bayer
Oct 21, 2024

I recently discovered that SageMaker offers a variety of options for easy scaling. For example, with TGI (Text Generation Inference) from Hugging Face, you can seamlessly deploy it on SageMaker and set up auto-scaling, which has proven to be highly effective for us. Additionally, SageMaker supports multi-replica deployments. Take, for instance, a single instance equipped with 4 GPUs; you can allocate each model to a separate GPU using DataParallel (DP) (philschmid.de/sagemaker-multi-replica).

Therefore, it should be feasible to integrate SageMaker with Infinity and leverage these capabilities.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INFO] How to scale ? #436

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[INFO] How to scale ? #436

tsensei Oct 21, 2024

Replies: 3 comments · 5 replies

michaelfeil Oct 21, 2024 Maintainer

tsensei Oct 21, 2024 Author

michaelfeil Oct 21, 2024 Maintainer

tsensei Oct 21, 2024 Author

michaelfeil Oct 21, 2024 Maintainer

tsensei Oct 21, 2024 Author

michaelfeil Oct 21, 2024 Maintainer

hooman-bayer Oct 21, 2024

tsensei
Oct 21, 2024

Replies: 3 comments 5 replies

michaelfeil
Oct 21, 2024
Maintainer

tsensei
Oct 21, 2024
Author

michaelfeil Oct 21, 2024
Maintainer

tsensei Oct 21, 2024
Author

michaelfeil Oct 21, 2024
Maintainer

tsensei Oct 21, 2024
Author

michaelfeil Oct 21, 2024
Maintainer

hooman-bayer
Oct 21, 2024