Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading TorchProfiler after run #2323

Open
fabiogeraci opened this issue Jan 31, 2025 · 6 comments
Open

Reading TorchProfiler after run #2323

fabiogeraci opened this issue Jan 31, 2025 · 6 comments

Comments

@fabiogeraci
Copy link

under profiling_outputs/iteration_10/ how do i read them in and display on UI

-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank9_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank9_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank8_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 62K Jan 30 20:26 rank8_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank7_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank7_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank6_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank6_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank5_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank5_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank4_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank4_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank3_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank3_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank2_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank2_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank1_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank1_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank15_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank15_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank14_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank14_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank13_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank13_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 15M Jan 30 20:26 rank12_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 62K Jan 30 20:26 rank12_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank11_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank11_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank10_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 62K Jan 30 20:26 rank10_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 16M Jan 30 20:26 rank0_stacks.txt
-rw-r--r-- 1 fg12 ssg-isg 184K Jan 30 20:26 rank0_memory-timeline.html
-rw-r--r-- 1 fg12 ssg-isg 5.7M Jan 30 20:26 rank0_memory_snapshot.pickle
-rw-r--r-- 1 fg12 ssg-isg 63K Jan 30 20:26 rank0_key_averages.txt
-rw-r--r-- 1 fg12 ssg-isg 29M Jan 30 20:26 r0-2025-1-30-20-25.1738268754161971946.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26M Jan 30 20:26 r0-2025-1-30-20-25.1738268752786787917.pt.trace.json.gz

@SalmanMohammadi
Copy link
Collaborator

SalmanMohammadi commented Jan 31, 2025

Hey @fabiogeraci! Try throwing your trace.json file into a profile viewer like chrome trace viewer (chrome://tracing) to view the trace. You can also view the memory snapshot (memory_snapshot.pickle) in https://pytorch.org/memory_viz. Check out this tutorial for more info https://pytorch.org/docs/stable/torch_cuda_memory.html#using-the-visualizer on using the snapshot visualizer :)

@fabiogeraci
Copy link
Author

Hey @fabiogeraci! Try throwing your trace.json file into a profile viewer like chrome trace viewer (chrome://tracing) to view the trace. You can also view the memory snapshot (memory_snapshot.pickle) in https://pytorch.org/memory_viz. Check out this tutorial for more info https://pytorch.org/docs/stable/torch_cuda_memory.html#using-the-visualizer on using the snapshot visualizer :)

https://pytorch.org/memory_viz worked very well, any practical guide to figure what on earth all of that means?

@fabiogeraci
Copy link
Author

fabiogeraci commented Feb 3, 2025

@SalmanMohammadi the trace file is in the following format, if i am not mistaken, do i have to change anything in the config.yaml in order to get a single file?

does tensorboard need to be installed on the env in order to correctly save the trace.json

-rw-r--r-- 1 fg12 ssg-isg 26142080 Jan 9 12:11 r0-2025-1-9-12-11.1736424669783026023.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26855130 Jan 9 12:11 r0-2025-1-9-12-11.1736424670392946912.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26754858 Jan 9 12:11 r0-2025-1-9-12-11.1736424670363106605.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26847942 Jan 9 12:11 r0-2025-1-9-12-11.1736424670354319226.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26867271 Jan 9 12:11 r0-2025-1-9-12-11.1736424670332115537.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26859072 Jan 9 12:11 r0-2025-1-9-12-11.1736424670329751024.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26856984 Jan 9 12:11 r0-2025-1-9-12-11.1736424670305485876.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26763247 Jan 9 12:11 r0-2025-1-9-12-11.1736424670295915005.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26758040 Jan 9 12:11 r0-2025-1-9-12-11.1736424670266124533.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26676209 Jan 9 12:11 r0-2025-1-9-12-11.1736424670234061948.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26740513 Jan 9 12:11 r0-2025-1-9-12-11.1736424670197487312.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26643692 Jan 9 12:11 r0-2025-1-9-12-11.1736424670124682346.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26840276 Jan 9 12:11 r0-2025-1-9-12-11.1736424670417433945.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26848380 Jan 9 12:11 r0-2025-1-9-12-11.1736424670397910098.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 26834902 Jan 9 12:11 r0-2025-1-9-12-11.1736424670373898850.pt.trace.json.gz
-rw-r--r-- 1 fg12 ssg-isg 28228080 Jan 9 12:11 r0-2025-1-9-12-11.1736424670918257879.pt.trace.json.gz

@fabiogeraci fabiogeraci reopened this Feb 3, 2025
@fabiogeraci
Copy link
Author

I got tensorboard UI up, but I do not get anything onder trace

Image

tensorboard --logdir iteration_10 --load_fast=false --bind_all --port=29701

TensorFlow installation not found - running with reduced feature set.
I0204 12:45:57.793680 140206169708096 plugin.py:429] Monitor runs begin
I0204 12:45:57.794994 140206169708096 plugin.py:444] Find run directory /lustre/scratch127/admin/isg/fg12/torchtune/7B_model_profiled/2x8_tcp_jobid/209665/profiling_outputs/iteration_10
I0204 12:45:57.795410 140206152906304 plugin.py:493] Load run iteration_10
I0204 12:45:57.816983 140206152906304 loader.py:57] started all processing
TensorBoard 2.18.0 at http://farm22-head2.internal.sanger.ac.uk:29701/ (Press CTRL+C to quit)
WARNING: Logging before flag parsing goes to stderr.
E0204 12:46:15.595593 139662662558464 op_tree.py:134] Error in input data: ranges on the same thread should not intersect!Father:(aten::_local_scalar_dense,3028549358072.087,3028550273219.7217) Child:(ProfilerStep#8,3028549410793.485,3028550273654.7188)
WARNING: Logging before flag parsing goes to stderr.
E0204 12:46:16.360068 139663381971712 op_tree.py:134] Error in input data: ranges on the same thread should not intersect!Father:(aten::_local_scalar_dense,3028549358145.757,3028550273038.821) Child:(ProfilerStep#8,3028549410950.226,3028550273690.892)
WARNING: Logging before flag parsing goes to stderr.
WARNING: Logging before flag parsing goes to stderr.
E0204 12:46:16.519148 140529890038528 op_tree.py:134] Error in input data: ranges on the same thread should not intersect!Father:(aten::_local_scalar_dense,3028549360424.55,3028550273751.8257) Child:(ProfilerStep#8,3028549414443.456,3028550274187.833)
E0204 12:46:16.519102 140238289441536 op_tree.py:134] Error in input data: ranges on the same thread should not intersect!Father:(aten::_local_scalar_dense,3028549363417.05,3028550273229.5947) Child:(ProfilerStep#8,3028549411205.964,3028550273654.21)
WARNING: Logging before flag parsing goes to stderr.
E0204 12:46:16.956684 139692676543232 op_tree.py:134] Error in input data: ranges on the same thread should not intersect!Father:(aten::_local_scalar_dense,3028549357883.68,3028550273445.2563) Child:(ProfilerStep#8,3028549409392.773,3028550273857.044)
WARNING: Logging before flag parsing goes to stderr.
E0204 12:46:17.108336 140146848191232 op_tree.py:134] Error in input data: ranges on the same thread should not intersect!Father:(aten::_local_scalar_dense,3028549359829.513,3028550273636.999) Child:(ProfilerStep#8,3028549410035.08,3028550274062.557)
WARNING: Logging before flag parsing goes to stderr.
E0204 12:46:17.170794 140240845267712 op_tree.py:134] Error in input data: ranges on the same thread should not intersect!Father:(aten::_local_scalar_dense,3028549360603.504,3028550273535.1157) Child:(ProfilerStep#8,3028549411249.045,3028550273964.88)
W0204 12:46:35.831752 140206143448640 security_validator.py:60] In 3.0, this warning will become an error:
Requires default-src for Content-Security-Policy
W0204 12:46:42.873453 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.874097 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.875085 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.876468 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.876963 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.877645 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.878149 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.878630 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.879178 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.879693 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.880201 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.880809 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.881299 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.881775 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.882272 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
W0204 12:46:42.882727 140206152906304 run_generator.py:458] cannot parse node name from worker name r0-2025-2-4-9-22
I0204 12:46:42.883775 140206152906304 plugin.py:497] Run iteration_10 loaded
I0204 12:46:42.883978 140206161315392 plugin.py:467] Add run iteration_10
W0204 12:48:37.383477 140206143448640 security_validator.py:60] In 3.0, this warning will become an error:
Requires default-src for Content-Security-Policy
W0204 12:48:37.592616 140206143448640 security_validator.py:60] In 3.0, this warning will become an error:
Requires default-src for Content-Security-Policy
W0204 13:04:41.494575 140206143448640 security_validator.py:60] In 3.0, this warning will become an error:
Requires default-src for Content-Security-Policy
W0204 13:04:41.697061 140206143448640 security_validator.py:60] In 3.0, this warning will become an error:
Requires default-src for Content-Security-Policy

@fabiogeraci
Copy link
Author

is there a way to setup execution_trace_observer

@fabiogeraci
Copy link
Author

i gunzip -c r0-2025-2-4-9-56.1738663014233131587.pt.trace.json.gz > pt.trace.json and imported into chrome://tracing and there is nothing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants