[Bugs] RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example #36295

ArthurinRUC · 2025-02-20T07:58:49Z

System Info

transformers version: 4.48.0
Platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.27.1
Safetensors version: 0.5.2
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: Yes
Using GPU in script?: Yes
GPU type: NVIDIA A800-SXM4-80GB

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Hi for all!

I failed to run the vLLM project RLHF example script. The code is exactly same as the vLLM docs page: https://docs.vllm.ai/en/latest/getting_started/examples/rlhf.html

The error messages are:

(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] Error executing method 'init_device'. This might cause deadlock in distributed execution.
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] Traceback (most recent call last):
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 566, in execute_method
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     return run_method(target, method, args, kwargs)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     return func(*args, **kwargs)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker.py", line 155, in init_device
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     torch.cuda.set_device(self.device)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     torch._C._cuda_setDevice(device)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     torch._C._cuda_init()
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] RuntimeError: No CUDA GPUs are available
(MyLLM pid=70946) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::MyLLM.__init__() (pid=70946, ip=11.163.37.230, actor_id=202b48118215566c51057a0101000000, repr=<test_ray_vllm_rlhf.MyLLM object at 0x7fb7453669b0>)
(MyLLM pid=70946)   File "/data/cfs/workspace/test_ray_vllm_rlhf.py", line 96, in __init__
(MyLLM pid=70946)     super().__init__(*args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 1051, in inner
(MyLLM pid=70946)     return fn(*args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 242, in __init__
(MyLLM pid=70946)     self.llm_engine = self.engine_class.from_engine_args(
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 484, in from_engine_args
(MyLLM pid=70946)     engine = cls(
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 273, in __init__
(MyLLM pid=70946)     self.model_executor = executor_class(vllm_config=vllm_config, )
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 262, in __init__
(MyLLM pid=70946)     super().__init__(*args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 51, in __init__
(MyLLM pid=70946)     self._init_executor()
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
(MyLLM pid=70946)     self._init_workers_ray(placement_group)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
(MyLLM pid=70946)     self._run_workers("init_device")
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 476, in _run_workers
(MyLLM pid=70946)     self.driver_worker.execute_method(sent_method, *args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 575, in execute_method
(MyLLM pid=70946)     raise e
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 566, in execute_method
(MyLLM pid=70946)     return run_method(target, method, args, kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
(MyLLM pid=70946)     return func(*args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker.py", line 155, in init_device
(MyLLM pid=70946)     torch.cuda.set_device(self.device)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
(MyLLM pid=70946)     torch._C._cuda_setDevice(device)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
(MyLLM pid=70946)     torch._C._cuda_init()
(MyLLM pid=70946) RuntimeError: No CUDA GPUs are available

I found in transformers==4.47.1 the script could run normally. However when I tried transformers==4.48.0, 4.48.1 and 4.49.0 I got the error messages above. Then I checked pip envs with pip list and found only transformers versions are different.

I've tried to change vllm version between 0.7.0 and 0.7.2, the behavior is the same.

Related Ray issues:

Expected behavior

The script runs normally.

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-02-20T13:43:25Z

Hi @ArthurinRUC, the issue clearly seems to be occurring in torch, not in transformers! For some reason, Torch is unable to detect your GPU, possibly because of a mismatch of versions between torch and CUDA? I suspect this is an environment issue that we can't really debug for you!

ArthurinRUC · 2025-02-21T05:27:39Z

In my experiments torch version remains 2.5.1 and CUDA toolkit/user-mode driver version remains 12.4, only transformers versions are different. Besides I've used this environment to run many sft/pretrain/ray tasks and they were just ok. So maybe some new code introduced in transformers>=4.48 triggered the problem I’m encountering?

ArthurinRUC added the bug label Feb 20, 2025

ArthurinRUC mentioned this issue Feb 20, 2025

[Bug]: RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example vllm-project/vllm#13597

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugs] RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example #36295

[Bugs] RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example #36295

ArthurinRUC commented Feb 20, 2025 •

edited

Loading

Rocketknight1 commented Feb 20, 2025

ArthurinRUC commented Feb 21, 2025 •

edited

Loading

[Bugs] RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example #36295

[Bugs] RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example #36295

Comments

ArthurinRUC commented Feb 20, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Feb 20, 2025

ArthurinRUC commented Feb 21, 2025 • edited Loading

ArthurinRUC commented Feb 20, 2025 •

edited

Loading

ArthurinRUC commented Feb 21, 2025 •

edited

Loading