-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-test] GPU memory numbers are always 0 on A100 runner #2574
Comments
Running on the GCP H100 machine or devGPU, the GPU memory data is generated successfully:
|
Debug reproduce: on the CI machine, the output is: https://github.com/pytorch/benchmark/actions/runs/12982872402/job/36203285897
|
I can't pinpoint yet the reason you get those specific responses. Having said that, the command:
works, so maybe you can rely on it? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The command
nvidia-smi pmon -s m -c 1 -o T
does not give GPU memory numbers correctly.https://github.com/pytorch/benchmark/blob/main/userbenchmark/release-test/monitor_proc.sh#L24
Error workflow:
https://github.com/pytorch/benchmark/actions/runs/12878326305/job/35904096937
The text was updated successfully, but these errors were encountered: