-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多卡训练遇到的bug ng=1单卡可以训练,多卡时遇到以下问题 #60
Comments
Meet the same issue, looking forward to some kind suggestion. |
Maybe you need to provide your environment like pytorch version, cuda version, etc. @Nina-yang @Lick0920 |
torch verison: 1.8.1+cu111, cuda version 11.1, I have a machine with 8 Tesla V100-SXM2 32G, however because of the error, I'm using only 1 GPU now. It takes too long to finish training. |
Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang |
|
|
|
Thank you, I will try to re-install pytorch. |
|
2023-01-30 06:57:42,675-rk0-launch.py#86:Rank 0 initialization finished.
2023-01-30 06:57:42,678-rk0-launch.py#86:Rank 1 initialization finished.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/changkang.li/EOD/up/main.py", line 28, in
main()
File "/home/changkang.li/EOD/up/main.py", line 22, in main
args.run(args)
File "/home/changkang.li/EOD/up/commands/train.py", line 163, in _main
launch(main, args.num_gpus_per_machine, args.num_machines, args=args, start_method=args.fork_method)
File "/home/changkang.li/EOD/up/utils/env/launch.py", line 52, in launch
mp.start_processes(
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/EOD/up/utils/env/launch.py", line 113, in _distributed_worker
dist_helper.barrier()
File "/home//EOD/up/utils/env/dist_helper.py", line 139, in barrier
dist_barrier(*args, **kwargs)
File "/home/EOD/up/utils/env/dist_helper.py", line 124, in dist_barrier
dist.barrier(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The text was updated successfully, but these errors were encountered: