Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard unable to capture profile for jax example #6974

Open
cfRod opened this issue Feb 17, 2025 · 1 comment
Open

Tensorboard unable to capture profile for jax example #6974

cfRod opened this issue Feb 17, 2025 · 1 comment

Comments

@cfRod
Copy link

cfRod commented Feb 17, 2025

To report a problem with TensorBoard itself, please fill out the
remainder of this template.

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same
environment from which you normally run TensorFlow/TensorBoard, and
paste the output here:

/JAX/xla/xla/service/cpu/benchmarks/e2e/gemma2/keras$ python diagnose_tensorboard.py

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version c6ca9f1d004e2a1bc7c160abc43be229b82cad7e

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='ip-10-252-30-225', release='6.8.0-1021-aws', version='#23~22.04.1-Ubuntu SMP Tue Dec 10 16:50:46 UTC 2024', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: '/home/../venv/gemma2-keras'

--- check: installed_packages
INFO: installed: tensorboard==2.18.0
INFO: installed: tensorflow==2.18.0
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.7.2

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.18.0'

--- check: tensorflow_python_version
2025-02-17 17:43:29.606821: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-17 17:43:29.616812: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1739814209.629483    7716 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739814209.632905    7716 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-17 17:43:29.644947: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO: tensorflow.__version__: '2.18.0'
INFO: tensorflow.__git_version__: 'v2.18.0-rc2-4-g6550e4bd802'

--- check: tensorboard_data_server_version
INFO: data server binary: '/home/.../venv/gemma2-keras/lib/python3.10/site-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.2'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/../venv/gemma2-keras/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'ip-10-252-30-225.eu-west-1.compute.internal'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=8278350, st_dev=66305, st_nlink=2, st_uid=1007, st_gid=1008, st_size=4096, st_atime=1739813704, st_mtime=1739814201, st_ctime=1739814201)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/.../venv/gemma2-keras/lib/python3.10/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==2.1.0
astunparse==1.6.3
certifi==2024.12.14
charset-normalizer==3.4.1
etils==1.12.0
filelock==3.16.1
flatbuffers==24.12.23
fsspec==2024.12.0
gast==0.6.0
google-pasta==0.2.0
grpcio==1.69.0
gviz-api==1.10.0
h5py==3.12.1
idna==3.10
importlib_resources==6.5.2
jax==0.4.38
jaxlib==0.4.38
Jinja2==3.1.5
kagglehub==0.3.6
keras==3.8.0
keras-hub==0.18.1
keras-nlp==0.18.1
libclang==18.1.1
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
mdurl==0.1.2
ml-dtypes==0.4.1
mpmath==1.3.0
namex==0.0.8
networkx==3.4.2
numpy==2.0.2
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
opt_einsum==3.4.0
optree==0.13.1
packaging==24.2
pip==22.0.2
protobuf==4.25.6
Pygments==2.19.1
regex==2024.11.6
requests==2.32.3
rich==13.9.4
scipy==1.15.0
setuptools==59.6.0
six==1.17.0
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
tensorboard-plugin-profile==2.19.0
tensorflow==2.18.0
tensorflow-io-gcs-filesystem==0.37.1
tensorflow-text==2.18.1
termcolor==2.5.0
torch==2.5.1
tqdm==4.67.1
triton==3.1.0
typing_extensions==4.12.2
urllib3==2.3.0
Werkzeug==3.1.3
wheel==0.45.1
wrapt==1.17.0
zipp==3.21.0

Next steps

No action items identified. Please copy ALL of the above output,
including the lines containing only backticks, into your GitHub issue
or comment. Be sure to redact any sensitive information.

Issue description

I am running the example on the CPU provided here https://docs.jax.dev/en/latest/profiling.html

import jax

jax.profiler.start_trace("/tmp/tensorboard")

# Run the operations to be profiled
key = jax.random.key(0)
x = jax.random.normal(key, (5000, 5000))
y = x @ x
y.block_until_ready()

jax.profiler.stop_trace()

However I see no trace capture for the default example:

Image
@penpornk
Copy link
Member

Hi @cfRod,

Unfortunately, only a few tools in TensorBoard supports XLA:CPU profiling right now: trace viewer and graph viewer.

To see the results, you can select trace viewer tool from the drop down.

Image

You can go to the graph viewer from the timeline by clicking on the HLO op you are interested in, there will be a link to the graph in the bottom right panel of the page.

Image

Example graph viewer screen:

Image

Framework op stats tool sometimes works (but didn't work in this case). We hope to fix this in the future. I don't have a timeline yet though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants