Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugs] RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example #36295

Open
3 of 4 tasks
ArthurinRUC opened this issue Feb 20, 2025 · 2 comments
Labels

Comments

@ArthurinRUC
Copy link

ArthurinRUC commented Feb 20, 2025

System Info

  • transformers version: 4.48.0
  • Platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.27.1
  • Safetensors version: 0.5.2
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: Yes
  • Using GPU in script?: Yes
  • GPU type: NVIDIA A800-SXM4-80GB

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hi for all!

I failed to run the vLLM project RLHF example script. The code is exactly same as the vLLM docs page: https://docs.vllm.ai/en/latest/getting_started/examples/rlhf.html

The error messages are:

(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] Error executing method 'init_device'. This might cause deadlock in distributed execution.
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] Traceback (most recent call last):
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 566, in execute_method
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     return run_method(target, method, args, kwargs)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     return func(*args, **kwargs)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker.py", line 155, in init_device
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     torch.cuda.set_device(self.device)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     torch._C._cuda_setDevice(device)
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]   File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574]     torch._C._cuda_init()
(MyLLM pid=70946) ERROR 02-20 15:38:34 worker_base.py:574] RuntimeError: No CUDA GPUs are available
(MyLLM pid=70946) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::MyLLM.__init__() (pid=70946, ip=11.163.37.230, actor_id=202b48118215566c51057a0101000000, repr=<test_ray_vllm_rlhf.MyLLM object at 0x7fb7453669b0>)
(MyLLM pid=70946)   File "/data/cfs/workspace/test_ray_vllm_rlhf.py", line 96, in __init__
(MyLLM pid=70946)     super().__init__(*args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 1051, in inner
(MyLLM pid=70946)     return fn(*args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 242, in __init__
(MyLLM pid=70946)     self.llm_engine = self.engine_class.from_engine_args(
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 484, in from_engine_args
(MyLLM pid=70946)     engine = cls(
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 273, in __init__
(MyLLM pid=70946)     self.model_executor = executor_class(vllm_config=vllm_config, )
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 262, in __init__
(MyLLM pid=70946)     super().__init__(*args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 51, in __init__
(MyLLM pid=70946)     self._init_executor()
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
(MyLLM pid=70946)     self._init_workers_ray(placement_group)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
(MyLLM pid=70946)     self._run_workers("init_device")
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 476, in _run_workers
(MyLLM pid=70946)     self.driver_worker.execute_method(sent_method, *args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 575, in execute_method
(MyLLM pid=70946)     raise e
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 566, in execute_method
(MyLLM pid=70946)     return run_method(target, method, args, kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
(MyLLM pid=70946)     return func(*args, **kwargs)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/vllm/worker/worker.py", line 155, in init_device
(MyLLM pid=70946)     torch.cuda.set_device(self.device)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
(MyLLM pid=70946)     torch._C._cuda_setDevice(device)
(MyLLM pid=70946)   File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
(MyLLM pid=70946)     torch._C._cuda_init()
(MyLLM pid=70946) RuntimeError: No CUDA GPUs are available

I found in transformers==4.47.1 the script could run normally. However when I tried transformers==4.48.0, 4.48.1 and 4.49.0 I got the error messages above. Then I checked pip envs with pip list and found only transformers versions are different.

I've tried to change vllm version between 0.7.0 and 0.7.2, the behavior is the same.

Related Ray issues:

Expected behavior

The script runs normally.

@Rocketknight1
Copy link
Member

Hi @ArthurinRUC, the issue clearly seems to be occurring in torch, not in transformers! For some reason, Torch is unable to detect your GPU, possibly because of a mismatch of versions between torch and CUDA? I suspect this is an environment issue that we can't really debug for you!

@ArthurinRUC
Copy link
Author

ArthurinRUC commented Feb 21, 2025

In my experiments torch version remains 2.5.1 and CUDA toolkit/user-mode driver version remains 12.4, only transformers versions are different. Besides I've used this environment to run many sft/pretrain/ray tasks and they were just ok. So maybe some new code introduced in transformers>=4.48 triggered the problem I’m encountering?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants