Reducing server RAM usage #309

yaronr · 2024-07-21T13:47:49Z

yaronr
Jul 21, 2024

Hi
I expected Infinity to use GPU vRAM mostly, and I'm noticing very high server memory usage when initially loading the models, which then drops (but remains higher than I would expect).
Example: jinaai/jina-embeddings-v2-base-en uses about 10gb of RAM, and then drops to 1-2 gb.
nvidia/NV-Embed-v1 fails to load on a budget of 13gb server RAM (24gb GPU ram).

Am I missing some configuration detail?
I'll try to load NV-Embed on a 32gb RAM server just to see if it works and update later.

michaelfeil · 2024-07-21T17:23:25Z

michaelfeil
Jul 21, 2024
Maintainer

Theoretically, there is some optimization opportinity to not load the model into ram.
If you load https://huggingface.co/nvidia/NV-Embed-v1/tree/main -> 16GB of model weights.

Assuming you are not streaming these from file through cpu ram into gpu, this is a challenge. There is very few libaries that started doing this lazy loading, e.g. https://huggingface.co/docs/accelerate/v0.13.2/en/usage_guides/big_modeling accelerate is one of them.

Generally speaking, its odd to see machines with RAM < VRAM. This often leads to described errors.

1 reply

yaronr Jul 22, 2024
Author

I think that streaming it to the VRAM is the correct approach, which I assume is what they do at vLLM. But I don't know how easy it is to implement.
Regarding the oddity of having more vRAM than RAM - I understand where you're coming from, but remember that these resources cost money. Using RAM for a few minutes and then using only 2% of it, yet 'renting' it for a long period of time doesn't make sense either.
If you look at various cloud provider offerings, it's quite common to see these setups - probably for the reasons I outlined.
For example, I'm now running infinity on an instance type with 64GB of memory instead of 16GB of memory, just for the sake of loading the model. It costs 30% more (sometimes it's much more).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing server RAM usage #309

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Reducing server RAM usage #309

yaronr Jul 21, 2024

Replies: 1 comment · 1 reply

michaelfeil Jul 21, 2024 Maintainer

yaronr Jul 22, 2024 Author

yaronr
Jul 21, 2024

Replies: 1 comment 1 reply

michaelfeil
Jul 21, 2024
Maintainer

yaronr Jul 22, 2024
Author