Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-test] GPU memory numbers are always 0 on A100 runner #2574

Open
xuzhao9 opened this issue Jan 22, 2025 · 3 comments
Open

[release-test] GPU memory numbers are always 0 on A100 runner #2574

xuzhao9 opened this issue Jan 22, 2025 · 3 comments

Comments

@xuzhao9
Copy link
Contributor

xuzhao9 commented Jan 22, 2025

The command nvidia-smi pmon -s m -c 1 -o T does not give GPU memory numbers correctly.

https://github.com/pytorch/benchmark/blob/main/userbenchmark/release-test/monitor_proc.sh#L24

Error workflow:
https://github.com/pytorch/benchmark/actions/runs/12878326305/job/35904096937

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Jan 27, 2025

Running on the GCP H100 machine or devGPU, the GPU memory data is generated successfully:

$ bash ~/benchmark/userbenchmark/release-test/monitor_proc.sh python main.py --epochs 4
Running  python main.py --epochs 4 > "/tmp/tmp.fVD4eQHgYE_monitor_proc" 2>&1 &
Watching PID: 2582
Searching for spawned processes with
Tail log file with: ps --forest -o pid --ppid 2582 --pid 2582 | awk 'NR>1 {print}'
    tail -f /tmp/tmp.fVD4eQHgYE_monitor_proc

Max GPU Mem.   Max RSS Mem.   Max PSS Mem.
904            1149.47        1438.21
1438           1209.12        1438.21
1438           1209.15        1438.21
1438           1209.59        1845.58

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Jan 27, 2025

Debug reproduce: on the CI machine, the output is: https://github.com/pytorch/benchmark/actions/runs/12982872402/job/36203285897

#Time         gpu         pid   type     fb   ccpm    command 
#HH:MM:SS     Idx           #    C/G     MB     MB    name 
 05:53:45       0          -     -      -      -    -              

#Time         gpu         pid   type     fb   ccpm    command 
#HH:MM:SS     Idx           #    C/G     MB     MB    name 
 05:53:47       0        698     C      0      0    python  

cc @atalman @jeanschmidt

@xuzhao9 xuzhao9 changed the title [release-test] GPU memory numbers are always 0 [release-test] GPU memory numbers are always 0 on A100 runner Jan 27, 2025
@jeanschmidt
Copy link
Contributor

I can't pinpoint yet the reason you get those specific responses. Having said that, the command:

$ nvidia-smi dmon -s m -c 1 -o T -i 0
#Time         gpu     fb   bar1   ccpm
#HH:MM:SS     Idx     MB     MB     MB
 15:54:14       0    429      4      0

works, so maybe you can rely on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants