多卡训练遇到的bug ng=1单卡可以训练，多卡时遇到以下问题 #60

Leeon-K · 2023-01-30T08:23:34Z

2023-01-30 06:57:42,675-rk0-launch.py#86:Rank 0 initialization finished.
2023-01-30 06:57:42,678-rk0-launch.py#86:Rank 1 initialization finished.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/changkang.li/EOD/up/main.py", line 28, in
main()
File "/home/changkang.li/EOD/up/main.py", line 22, in main
args.run(args)
File "/home/changkang.li/EOD/up/commands/train.py", line 163, in _main
launch(main, args.num_gpus_per_machine, args.num_machines, args=args, start_method=args.fork_method)
File "/home/changkang.li/EOD/up/utils/env/launch.py", line 52, in launch
mp.start_processes(
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/EOD/up/utils/env/launch.py", line 113, in _distributed_worker
dist_helper.barrier()
File "/home//EOD/up/utils/env/dist_helper.py", line 139, in barrier
dist_barrier(*args, **kwargs)
File "/home/EOD/up/utils/env/dist_helper.py", line 124, in dist_barrier
dist.barrier(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Nina-yang · 2023-01-30T08:52:06Z

Meet the same issue, looking forward to some kind suggestion.

yqyao · 2023-01-31T02:31:29Z

Maybe you need to provide your environment like pytorch version, cuda version, etc. @Nina-yang @Lick0920

Nina-yang · 2023-01-31T02:43:38Z

Maybe you need to provide your environment like pytorch version, cuda version, etc. @Nina-yang @Lick0920

torch verison: 1.8.1+cu111, cuda version 11.1, I have a machine with 8 Tesla V100-SXM2 32G, however because of the error, I'm using only 1 GPU now. It takes too long to finish training.

yqyao · 2023-01-31T02:54:01Z

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang

Nina-yang · 2023-01-31T02:56:46Z

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang
Thanks. Besides, my python version is 3.7.3

Leeon-K · 2023-01-31T07:11:16Z

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang
My environment set is as follow: python3.8.10 torch 1.10.1+rocm4.1 torchvision 0.11.2+rocm4.1
I have tried to use python3.6.9, but a new issue is happened:
`from future import annotations
^
SyntaxError: future feature annotations is not defined
I searched for a solution to this problem. Let me upgrade python to >3.7....

yqyao · 2023-01-31T07:57:48Z

where is the code in our repo, I think you need to re-install pytorch. @Lick0920

Leeon-K · 2023-01-31T08:05:06Z

Thank you, I will try to re-install pytorch.
Traceback (most recent call last):
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/home/EOD/up/init.py", line 21, in
from .commands import *
File "/home/EOD/up/commands/init.py", line 5, in
from .flops import Flops # noqa
File "/home/EOD/up/commands/flops.py", line 5, in
from prettytable import PrettyTable
File "/home/.local/lib/python3.6/site-packages/prettytable-3.6.0-py3.6.egg/prettytable/init.py", line 1
from future import annotations

hxy0307 · 2023-04-17T09:15:31Z

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang
My environment set is as follow: python3.8.10 torch 1.10.1+rocm4.1 torchvision 0.11.2+rocm4.1
I have tried to use python3.6.9, but a new issue is happened:
`from future import annotations
^
SyntaxError: future feature annotations is not defined
I searched for a solution to this problem. Let me upgrade python to >3.7....
您好，我现在使用的也是adm的显卡，遇到了numba无法在adm的显卡上运行，运行代码时会遇到CUDA_ERROR_NOT_INITIALIZED的错误，请问您遇到相同的错误了吗？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡训练遇到的bug ng=1单卡可以训练，多卡时遇到以下问题 #60

多卡训练遇到的bug ng=1单卡可以训练，多卡时遇到以下问题 #60

Leeon-K commented Jan 30, 2023

Nina-yang commented Jan 30, 2023

yqyao commented Jan 31, 2023

Nina-yang commented Jan 31, 2023

yqyao commented Jan 31, 2023

Nina-yang commented Jan 31, 2023

Leeon-K commented Jan 31, 2023

yqyao commented Jan 31, 2023

Leeon-K commented Jan 31, 2023

hxy0307 commented Apr 17, 2023

多卡训练遇到的bug ng=1单卡可以训练，多卡时遇到以下问题 #60

多卡训练遇到的bug ng=1单卡可以训练，多卡时遇到以下问题 #60

Comments

Leeon-K commented Jan 30, 2023

Nina-yang commented Jan 30, 2023

yqyao commented Jan 31, 2023

Nina-yang commented Jan 31, 2023

yqyao commented Jan 31, 2023

Nina-yang commented Jan 31, 2023

Leeon-K commented Jan 31, 2023

yqyao commented Jan 31, 2023

Leeon-K commented Jan 31, 2023

hxy0307 commented Apr 17, 2023