Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test flake on GPU: CUDA_ERROR_ILLEGAL_ADDRESS #8796

Open
tengyifei opened this issue Mar 5, 2025 · 2 comments
Open

Test flake on GPU: CUDA_ERROR_ILLEGAL_ADDRESS #8796

tengyifei opened this issue Mar 5, 2025 · 2 comments
Assignees
Labels
CI CI related change testing Testing and coverage related issues. xla:gpu

Comments

@tengyifei
Copy link
Collaborator

tengyifei commented Mar 5, 2025

Running in PjRt runtime: /__w/xla/xla/pytorch/xla/test/test_profiler.py
++ command -v nvidia-smi
+ '[' -x /usr/bin/nvidia-smi ']'
+ '[' '' '!=' 0 ']'
+ PJRT_DEVICE=CUDA
+ run_coverage /__w/xla/xla/pytorch/xla/test/test_profiler.py
+ '[' 0 '!=' 0 ']'
+ python3 /__w/xla/xla/pytorch/xla/test/test_profiler.py
Epoch 1 train begin 10:10:55
| Training Device=xla:0/0 Step=0 Loss=nan Rate=36.93 GlobalRate=36.93 Time=10:10:55
Starting to trace for 5000 ms. Remaining attempt(s): 4
| Training Device=xla:0/0 Step=20 Loss=1.92684 Rate=23.05 GlobalRate=14.22 Time=10:11:18
| Training Device=xla:0/0 Step=40 Loss=1.52069 Rate=342.06 GlobalRate=27.10 Time=10:11:19
| Training Device=xla:0/0 Step=60 Loss=1.04478 Rate=469.75 GlobalRate=39.39 Time=10:11:19
| Training Device=xla:0/0 Step=80 Loss=0.40812 Rate=521.54 GlobalRate=51.11 Time=10:11:20
| Training Device=xla:0/0 Step=100 Loss=0.09717 Rate=542.68 GlobalRate=62.32 Time=10:11:21
| Training Device=xla:0/0 Step=120 Loss=0.03902 Rate=550.82 GlobalRate=73.04 Time=10:11:21
| Training Device=xla:0/0 Step=140 Loss=0.02194 Rate=555.45 GlobalRate=83.31 Time=10:11:22
| Training Device=xla:0/0 Step=160 Loss=0.01457 Rate=557.48 GlobalRate=93.16 Time=10:11:22
E0305 10:11:23.529679   29891 pjrt_stream_executor_client.cc:3050] Execution of replica 0 failed: INTERNAL: Failed to complete all kernels launched on stream 0x55890eb6ea00: CUDA error: Could not synchronize CUDA stream: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
*** Received signal 11 ***
*** BEGIN MANGLED STACK TRACE ***
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/__w/xla/xla/pytorch/xla/test/test_profiler.py", line 35, in train_worker
    test_profile_mp_mnist.train_mnist(
  File "/__w/xla/xla/pytorch/xla/test/test_profile_mp_mnist.py", line 213, in train_mnist
    train_loop_fn(train_device_loader, epoch)
  File "/__w/xla/xla/pytorch/xla/test/test_profile_mp_mnist.py", line 187, in train_loop_fn
    loss_i = loss.item()
RuntimeError: Bad StatusOr access: INTERNAL: Failed to complete all kernels launched on stream 0x55890eb6ea00: CUDA error: Could not synchronize CUDA stream: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(+0x600ed86)[0x7f0820d5bd86]
/lib/x86_64-linux-gnu/libc.so.6(+0x38dd0)[0x7f0a5a88bdd0]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(_ZN9torch_xla7runtime21PjRtComputationClient18ExecuteComputationERKNS0_17ComputationClient11ComputationEN4absl12lts_202308024SpanIKSt10shared_ptrINS2_4DataEEEERKSsRKNS2_25ExecuteComputationOptionsE+0x251)[0x7f0820d49b21]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(+0x5d3e269)[0x7f0820a8b269]
/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so(_ZN5torch4lazy9MultiWait8CompleteERKSt8functionIFvvEE+0x1a)[0x7f09d11d485a]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(_ZN5Eigen15ThreadPoolTemplIN3tsl6thread16EigenEnvironmentEE10WorkerLoopEi+0xbe)[0x7f082a44c6be]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(_ZN4absl12lts_2023080222internal_any_invocable13RemoteInvokerILb0EvRZN3tsl6thread16EigenEnvironment12CreateThreadESt8functionIFvvEEEUlvE_JEEET0_PNS1_15TypeErasedStateEDpNS1_18ForwardedParameterIT2_E4typeE+0x48)[0x7f082a449158]
/usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so(+0xf6e8a95)[0x7f082a435a95]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7)[0x7f0a5a838ea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f0a5a94eacf]
*** END MANGLED STACK TRACE ***

Seen on https://github.com/pytorch/xla/actions/runs/13672628318/job/38228051603?pr=8785

Seen again in https://github.com/pytorch/xla/actions/runs/13688297625/job/38277579644

Here's a run on Mar 05 that got lucky and passed: https://github.com/pytorch/xla/actions/runs/13684181906/job/38265392956?pr=8788

@tengyifei
Copy link
Collaborator Author

@ysiraichi could you help take a look?

@tengyifei tengyifei changed the title Test flake on GPU Test flake on GPU: CUDA_ERROR_ILLEGAL_ADDRESS Mar 6, 2025
@ysiraichi ysiraichi added xla:gpu testing Testing and coverage related issues. CI CI related change labels Mar 6, 2025
@ysiraichi
Copy link
Collaborator

This is odd. I can't seem to reproduce on my 1-GPU machine. Any thoughts?

======================================================================
FAIL: test_trace_and_metrics (__main__.ProfilerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "xla/test/test_profiler.py", line 100, in test_trace_and_metrics
    self._check_trace_namespace_exists(path)
  File "xla/test/test_profiler.py", line 73, in _check_trace_namespace_exists
    self.assertTrue('train_mnist' in proto_str,
AssertionError: False is not true : Expected "train_mnist" trace in: /tmp/tmp753hsfdx/plugins/profile/2025_03_06_19_59_13/localhost_52827.xplane.pb

----------------------------------------------------------------------
Ran 1 test in 66.076s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI CI related change testing Testing and coverage related issues. xla:gpu
Projects
None yet
Development

No branches or pull requests

2 participants