When the memory is insufficient, how to implement on-demand loading #9506
-
When the memory is insufficient, how to implement on-demand loading |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
I don't understand the question. If you're switching between models, you may want to use ollama, it does so automagically, but if it's single model that does not fit in RAM, you're toast - buy more memory/bigger gpu/etc., use lower quantization or forget about that model, nothing else can help you. |
Beta Was this translation helpful? Give feedback.
-
I think he meant ln the gpu |
Beta Was this translation helpful? Give feedback.
-
IIRC it's been discussed before. As far as I know there's no such possibility and no point (hence incentive) to work on it: most probably overhead due to shuffling model back and forth from main memory to GPU via PCIe link would kill the performance, up to a point where doing the required part of calculations on the CPU would be faster. And if the model does not fit into RAM and needs to be loaded from the disk, even a fast NVMe... 0.03t/s for 70GB model? (rough and optimistic estimate). |
Beta Was this translation helpful? Give feedback.
I don't understand the question. If you're switching between models, you may want to use ollama, it does so automagically, but if it's single model that does not fit in RAM, you're toast - buy more memory/bigger gpu/etc., use lower quantization or forget about that model, nothing else can help you.