You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] Error executing method 'init_device'. This might cause deadlock in distributed execution.
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] Traceback (most recent call last):
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 566, in execute_method
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] return run_method(target, method, args, kwargs)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] return func(*args, **kwargs)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker.py", line 155, in init_device
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] torch.cuda.set_device(self.device)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] torch._C._cuda_setDevice(device)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] torch._C._cuda_init()
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] RuntimeError: No CUDA GPUs are available
(MyLLM pid=70946) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::MyLLM.__init__() (pid=70946, ip=11.163.37.230, actor_id=202b48118215566c51057a0101000000, repr=<test_ray_vllm_rlhf.MyLLM object at 0x7fb7453669b0>)
(MyLLM pid=70946) File "/data/cfs/workspace/test_ray_vllm_rlhf.py", line 96, in __init__
(MyLLM pid=70946) super().__init__(*args, **kwargs)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 1051, in inner
(MyLLM pid=70946) return fn(*args, **kwargs)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 242, in __init__
(MyLLM pid=70946) self.llm_engine = self.engine_class.from_engine_args(
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 484, in from_engine_args
(MyLLM pid=70946) engine = cls(
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 273, in __init__
(MyLLM pid=70946) self.model_executor = executor_class(vllm_config=vllm_config, )
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 262, in __init__
(MyLLM pid=70946) super().__init__(*args, **kwargs)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 51, in __init__
(MyLLM pid=70946) self._init_executor()
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
(MyLLM pid=70946) self._init_workers_ray(placement_group)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
(MyLLM pid=70946) self._run_workers("init_device")
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 476, in _run_workers
(MyLLM pid=70946) self.driver_worker.execute_method(sent_method, *args, **kwargs)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 575, in execute_method
(MyLLM pid=70946) raise e
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 566, in execute_method
(MyLLM pid=70946) return run_method(target, method, args, kwargs)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
(MyLLM pid=70946) return func(*args, **kwargs)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker.py", line 155, in init_device
(MyLLM pid=70946) torch.cuda.set_device(self.device)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
(MyLLM pid=70946) torch._C._cuda_setDevice(device)
(MyLLM pid=70946) File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
(MyLLM pid=70946) torch._C._cuda_init()
(MyLLM pid=70946) RuntimeError: No CUDA GPUs are available
I found in transformers==4.47.1 the script could run normally. However when I tried transformers==4.48.0, 4.48.1 and 4.49.0 I got the error messages above. Then I checked pip envs with pip list and found only transformers versions are different.
I've tried to change vllm version between 0.7.0 and 0.7.2, the behavior is the same.
Hi @ArthurinRUC, the issue clearly seems to be occurring in torch, not in transformers! For some reason, Torch is unable to detect your GPU, possibly because of a mismatch of versions between torch and CUDA? I suspect this is an environment issue that we can't really debug for you!
In my experiments torch version remains 2.5.1 and CUDA toolkit/user-mode driver version remains 12.4, only transformers versions are different. Besides I've used this environment to run many sft/pretrain/ray tasks and they were just ok. So maybe some new code introduced in transformers>=4.48 triggered the problem I’m encountering?
System Info
transformers
version: 4.48.0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi for all!
I failed to run the vLLM project RLHF example script. The code is exactly same as the vLLM docs page: https://docs.vllm.ai/en/latest/getting_started/examples/rlhf.html
The error messages are:
I found in transformers==4.47.1 the script could run normally. However when I tried transformers==4.48.0, 4.48.1 and 4.49.0 I got the error messages above. Then I checked pip envs with
pip list
and found only transformers versions are different.I've tried to change vllm version between 0.7.0 and 0.7.2, the behavior is the same.
Related Ray issues:
Expected behavior
The script runs normally.
The text was updated successfully, but these errors were encountered: