fix: skip cuda graphs that will oom and improve free memory logging #2450
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds logs to show the free memory amount before warmup, after cache allocation and after total cuda graph memory usage.
Additionally this PR will skip cuda graphs (and show a warning) for graphs that will likely OOM. This happens when the model + cache allocations amount is too large to include all of the cuda graph batch sizes. (long term it would be better to accurately estimate the cuda graph size and warn users if the combination will OOM) until we can better estimate, we'll optimistically apply the cuda graphs if there is enough available memory.
Note: This PR should help avoid OOM issues with low max token amounts related to https://github.com/huggingface/hf-endpoints/pull/1410
example log output