-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counter values (potentially only those dependent on DIMENSION_INSTANCE[1:31]?) are consistently zero in rocprofv3 runs. #47
Comments
Hi @Sunlost. Internal ticket has been created to assist with your issue. Thanks! |
Hi @Sunlost |
@Sunlost I can see you state you've tried this:
but I suggest double-checking that you've set this for the correct device N (as shown by |
Hi @schung-amd -- thanks for working on reproducing the issue. I appreciate it. I've been setting the power state to For reference, the experimental workload I'm running is launching an instance of GPT2 on the device and running some basic inference. I'm running I've also tried running the same experiment as root with no change in results. The output from the sample program counter collection test still shows the same weird errors regarding Let me know if I can provide any additional information. Thanks! |
Thanks for double-checking! I'll see if I can repro with two GPUs. Regarding the |
Hi-- I've been attempting to use
rocprofv3
for counter collection for a basic PyTorch workload, but I'm running into odd issues where various counters (both basic and derived) always output zero (so far I've triedGPU_UTIL
,FETCH_SIZE
, andOccupancyPercent
.) Some counters still work though, likeSQ_WAVES
. I tried running my test programs both in the native Ubuntu 24.10 OS and the provided Ubuntu 24.04.1 Docker container for PyTorch development (rocm/pytorch:rocm6.3.2_ubuntu24.04_py3.12_pytorch_release_2.4.0
, image ID:c99bb965c26e
) and the counter values were consistently zero in both environments. I've runamd-smi process
while my Python scripts are running and can see that my script is running on the GPU-- so at the very leastGPU_UTIL
should be non-zero at some point in the output?To start diagnosing the problem, I ran the set of sample programs in
samples/
and they all passed, but the verbose output for test 7 (counter-collection-print-functional-counters
) showed a number of errors regarding missing values for any counter that reads fromDIMENSION_INSTANCE[1:31]
. Counters that read exclusively fromDIMENSION_INSTANCE[0]
likeGRBM_COUNT
show no missing values. Here's the full output from my latest run of the test suite in the provided Docker container. Test 7 begins on line 149. The first error due to a "missing" value is on line 417.This leads me to my primary questions:
rocprofv3
profiler runs?DIMENSION_INSTANCE[1:31]
in test 7 intended behavior, or is this potentially why the zero counter values are occurring?Additional things I've tried, including various recommended setup steps:
libdw-dev
in the Docker container viaapt install
to be able to build and run thesamples/
program suite.video
andrender
groups.render
group, so I have not been able to add it to that. I'm not sure if this matters, since my base user account is in both groups and is experiencing the same issue with counter values anyways. Regardless, this is only relevant on Ubuntu 20.04 (as stated here in step 1), right?sudo amd-smi set -g <N> -l stable_std
as suggested in this repo's README.md.PERFMON
capabilities via an additional--cap-add=PERFMON
flag in the Docker container's startup command.sample/
tests (I believe 8 or 9?) flagged that the Docker container didn't have this capability, so I added it in the hopes that it would fix this issue. The only noticeable outcome is that the test no longer flags the missing capability.--device=/dev/dri/renderD128
or--device=/dev/dri/renderD129
.LD_LIBRARY_PATH
=/opt/rocm-6.3.2/lib
ROCM_PATH
=/opt/rocm
/etc/ld.so.conf.d/rocm.conf
with the contents/opt/rocm/lib
and/opt/rocm/lib64
and runningsudo ldconfig
.Machine specs:
uname -srmv
:Linux 6.11.0-14-generic #15-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 10 23:48:25 UTC 2025 x86_64
rocminfo --support
output run on bare metal here.The text was updated successfully, but these errors were encountered: