Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: skip cuda graphs that will oom and improve free memory logging #2450

Closed
wants to merge 2 commits into from

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented Aug 22, 2024

This PR adds logs to show the free memory amount before warmup, after cache allocation and after total cuda graph memory usage.

Additionally this PR will skip cuda graphs (and show a warning) for graphs that will likely OOM. This happens when the model + cache allocations amount is too large to include all of the cuda graph batch sizes. (long term it would be better to accurately estimate the cuda graph size and warn users if the combination will OOM) until we can better estimate, we'll optimistically apply the cuda graphs if there is enough available memory.

Note: This PR should help avoid OOM issues with low max token amounts related to https://github.com/huggingface/hf-endpoints/pull/1410

example log output

Server started at unix:///tmp/text-generation-server-0
Free memory before the warmup: 6939.06 MB
Free memory after allocating the cache: 393.06 MB
Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
Total memory used for CUDA graphs: 222.00 MB
Total memory available: 171.06 MB

@Narsil
Copy link
Collaborator

Narsil commented Oct 14, 2024

Closing this. We want to approach the problem differently. Silently skipping work is not ok (a warning is still silently doing something on behalf of the user, we can never do that if cuda graphs was user specified).

We have better ways to handle this we think.

@Narsil Narsil closed this Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants