Replies: 1 comment 1 reply
-
Theoretically, there is some optimization opportinity to not load the model into ram. Assuming you are not streaming these from file through cpu ram into gpu, this is a challenge. There is very few libaries that started doing this lazy loading, e.g. https://huggingface.co/docs/accelerate/v0.13.2/en/usage_guides/big_modeling accelerate is one of them. Generally speaking, its odd to see machines with RAM < VRAM. This often leads to described errors. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi
I expected Infinity to use GPU vRAM mostly, and I'm noticing very high server memory usage when initially loading the models, which then drops (but remains higher than I would expect).
Example: jinaai/jina-embeddings-v2-base-en uses about 10gb of RAM, and then drops to 1-2 gb.
nvidia/NV-Embed-v1 fails to load on a budget of 13gb server RAM (24gb GPU ram).
Am I missing some configuration detail?
I'll try to load NV-Embed on a 32gb RAM server just to see if it works and update later.
Beta Was this translation helpful? Give feedback.
All reactions