Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡训练遇到的bug ng=1单卡可以训练,多卡时遇到以下问题 #60

Open
Leeon-K opened this issue Jan 30, 2023 · 9 comments

Comments

@Leeon-K
Copy link

Leeon-K commented Jan 30, 2023

2023-01-30 06:57:42,675-rk0-launch.py#86:Rank 0 initialization finished.
2023-01-30 06:57:42,678-rk0-launch.py#86:Rank 1 initialization finished.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/changkang.li/EOD/up/main.py", line 28, in
main()
File "/home/changkang.li/EOD/up/main.py", line 22, in main
args.run(args)
File "/home/changkang.li/EOD/up/commands/train.py", line 163, in _main
launch(main, args.num_gpus_per_machine, args.num_machines, args=args, start_method=args.fork_method)
File "/home/changkang.li/EOD/up/utils/env/launch.py", line 52, in launch
mp.start_processes(
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/EOD/up/utils/env/launch.py", line 113, in _distributed_worker
dist_helper.barrier()
File "/home//EOD/up/utils/env/dist_helper.py", line 139, in barrier
dist_barrier(*args, **kwargs)
File "/home/EOD/up/utils/env/dist_helper.py", line 124, in dist_barrier
dist.barrier(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@Nina-yang
Copy link

Meet the same issue, looking forward to some kind suggestion.

@yqyao
Copy link

yqyao commented Jan 31, 2023

Maybe you need to provide your environment like pytorch version, cuda version, etc. @Nina-yang @Lick0920

@Nina-yang
Copy link

Maybe you need to provide your environment like pytorch version, cuda version, etc. @Nina-yang @Lick0920

torch verison: 1.8.1+cu111, cuda version 11.1, I have a machine with 8 Tesla V100-SXM2 32G, however because of the error, I'm using only 1 GPU now. It takes too long to finish training.

@yqyao
Copy link

yqyao commented Jan 31, 2023

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang

@Nina-yang
Copy link

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang
Thanks. Besides, my python version is 3.7.3

@Leeon-K
Copy link
Author

Leeon-K commented Jan 31, 2023

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang
My environment set is as follow: python3.8.10 torch 1.10.1+rocm4.1 torchvision 0.11.2+rocm4.1
I have tried to use python3.6.9, but a new issue is happened:
`from future import annotations
^
SyntaxError: future feature annotations is not defined
I searched for a solution to this problem. Let me upgrade python to >3.7....

@yqyao
Copy link

yqyao commented Jan 31, 2023

where is the code in our repo, I think you need to re-install pytorch. @Lick0920

@Leeon-K
Copy link
Author

Leeon-K commented Jan 31, 2023

Thank you, I will try to re-install pytorch.
Traceback (most recent call last):
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/home/EOD/up/init.py", line 21, in
from .commands import *
File "/home/EOD/up/commands/init.py", line 5, in
from .flops import Flops # noqa
File "/home/EOD/up/commands/flops.py", line 5, in
from prettytable import PrettyTable
File "/home/.local/lib/python3.6/site-packages/prettytable-3.6.0-py3.6.egg/prettytable/init.py", line 1
from future import annotations

@hxy0307
Copy link

hxy0307 commented Apr 17, 2023

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang
My environment set is as follow: python3.8.10 torch 1.10.1+rocm4.1 torchvision 0.11.2+rocm4.1
I have tried to use python3.6.9, but a new issue is happened:
`from future import annotations
^
SyntaxError: future feature annotations is not defined
I searched for a solution to this problem. Let me upgrade python to >3.7....
您好,我现在使用的也是adm的显卡,遇到了numba无法在adm的显卡上运行,运行代码时会遇到CUDA_ERROR_NOT_INITIALIZED的错误,请问您遇到相同的错误了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants