-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Driver installation not working and Dataproc 2.2 cluster creation is failing #1239
Comments
This is due to https://github.com/cjac/initialization-actions/blob/rapids-20240806/gpu/install_gpu_driver.sh#L1077 Santosh, did you say you've tried this workaround and that it's unblocked you? |
Please review and test #1240 |
@cjac Yes, I tried with the workaround script you mentioned but still breaking with similar error in Dataproc 2.2 -----END PGP PUBLIC KEY BLOCK-----' sed -i -e 's:deb https:deb [signed-by=/usr/share/keyrings/mysql.gpg] https:g' /etc/apt/sources.list.d/mysql.list |
@cjac I have disabled secure boot in dataproc. Is that okay or should we enable it for this workaround? |
to use secure boot, you'll need to build a custom image. Instructions here: https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot You do not need secure boot enabled for the workaround to function. I think you may just be missing an apt-get update after the sources.list files are cleaned up and the trust keys are written to /usr/share/keyrings |
@cjac I tried with that but still breaking with same error |
I forgot that I'm pinned to 2.2.20-debian12 I'll try to make it work with the latest from the 2.2 line. |
Okay, Thank you. I am getting the error in, 2.2.32-debian12. |
this might do it:
|
yes, that last iteration does seem to get the installer working for me on 2.2 latest |
@cjac Thank you. I tried with the above changes but the cluster creation still failed. It didn't give the previous package installation error and looks good in init script logs, last few lines of install_gpu_dirver.sh script below:- pdate-alternatives: using /usr/lib/mesa-diverted to provide /usr/lib/glx (glx) in auto mode I am seeing the following error in Dataproc logs:- DEFAULT 2024-09-27T02:58:49.624652770Z Setting up xserver-xorg-video-nvidia (560.35.03-1) ... I think this error caused the cluster creation failure. |
@cjac We are unable to create dataproc GPU cluster since Dataproc 2.1/2/2 upgrade . Please let me know if there are any workaround to proceed with cluster creation. |
I did publish another version since last we spoke. Can you please review the code at https://github.com/GoogleCloudDataproc/initialization-actions/pull/1240/files please? The tests passed last commit but took 2 hours and one minute to complete. This latest update should reduce the runtime significantly. |
I received those messages as well, but they should just be warnings. Does the new change get things working? |
@cjac I tried the latest script but dataproc initialization action is breaking with timeout error and cluster is not starting:- name: "gs://syn-development-kub/syn-cluster-config/install_gpu_driver.sh" I couldn't find any error details in the init script output. I am attaching the init script output for your reference. |
Can you increase your timeout by 5-10 minutes? I do have a fix that's in the works for the base image, and once it gets published, we should be able to skip the full upgrade in the init action. |
Here is a recent cluster build I did in my repro lab. It took 14m47.946s:
|
I see that I hard-coded a regional bucket path into the code. this will slow things down when running outside of us-west4 ; I'll fix that next. |
@cjac Adding timeout fixed the error and created cluster. We are able to run GPU workloads in the cluster. Thank you so much for the support!!. |
Glad I could help! I'll work on getting these changes integrated into the base image. |
@cjac I have created GPU clusters using nvidia-tesla-t4 multiple times and it worked fine. But cluster creation is taking too long and failing with following error when we try to use nvidia-tesla-p4 GPU type. Do you know if Dataproc has any issue with this GPU type?
|
thanks for writing, @santhoshvly That is correct, the P4[1] GPUs are no longer supported[4] since the kernel requires GPL licensing, and the older drivers were not released under that license. Wish I could help. Please try T4[2] or L4[3] for similar cost for performance. I run my tests on n1-standard-4 + 1 single T4 for each master and worker node. I burst up to H100 for some tests. [1] NVIDIA/open-gpu-kernel-modules#19 |
It may be possible to build kernel drivers from an image released before 2023, but it's not a really great long-term solution, and I have not confirmed that it would work. Can you move off of the P4 hardware? |
I removed the full upgrade from the init action. We need to unhold systemd related packages for installation of pciutils to succeed. I included the patch to move off of /etc/apt/trusted.gpg to files under /usr/share/keyrings/ referenced directly inline from the files in /etc/apt/sources.list.d/*.list |
The code to clean up apt gpg trust databases and unhold systemd went in to #1240 and #1242 I spoke with engineering and they do not feel comfortable unholding the package while their builder is executing. They recommended that we unhold in any init action which would fail with the hold in place. I could place the hold again after the package installation, but that's not how it is currently implemented. |
@cjac Thank you so much for sharing the details. Yea, we can move out of P4 hardware. But the documentation is still not updated and confusing the users, https://cloud.google.com/compute/docs/gpus#p4-gpus |
Thanks for the information. There may be other GCE use cases where drivers are pre-built. In this case, P4 may still be supported. But I've opened an internal issue to track an update to the documentation. |
Hi,
I am trying to attach GPUs to Dataproc 2.2 cluster, but it is breaking and cluster creation failing. Secure boot is disabled and I am using the latest install_gpu_driver.sh from this repository. I am getting the following error during cluster initialization now:-
++ tr '[:upper:]' '[:lower:]'
++ lsb_release -is
++ . /etc/os-release
+++ PRETTY_NAME='Debian GNU/Linux 12 (bookworm)'
+++ NAME='Debian GNU/Linux'
+++ VERSION_ID=12
+++ VERSION='12 (bookworm)'
+++ VERSION_CODENAME=bookworm
+++ ID=debian
+++ HOME_URL=https://www.debian.org/
+++ SUPPORT_URL=https://www.debian.org/support
+++ BUG_REPORT_URL=https://bugs.debian.org/
++ echo debian12
++ get_metadata_attribute dataproc-role
++ local -r attribute_name=dataproc-role
++ local -r default_value=
++ /usr/share/google/get_metadata_value attributes/dataproc-role
++ get_metadata_attribute rapids-runtime SPARK
++ local -r attribute_name=rapids-runtime
++ local -r default_value=SPARK
++ /usr/share/google/get_metadata_value attributes/rapids-runtime
++ echo -n SPARK
++ get_metadata_attribute cuda-version 12.4
++ local -r attribute_name=cuda-version
++ local -r default_value=12.4
++ /usr/share/google/get_metadata_value attributes/cuda-version
++ echo -n 12.4
++ get_metadata_attribute gpu-driver-version 550.54.14
++ local -r attribute_name=gpu-driver-version
++ local -r default_value=550.54.14
++ /usr/share/google/get_metadata_value attributes/gpu-driver-version
++ echo -n 550.54.14
++ os_id
++ xargs
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ os_version
++ xargs
++ cut -d= -f2
++ grep '^VERSION_ID=' /etc/os-release
++ os_id
++ xargs
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ os_id
++ xargs
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ os_id
++ xargs
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ get_metadata_attribute cudnn-version 9.1.0.70
++ local -r attribute_name=cudnn-version
++ local -r default_value=9.1.0.70
++ /usr/share/google/get_metadata_value attributes/cudnn-version
++ echo -n 9.1.0.70
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ cut -d= -f2
++ xargs
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_version
++ grep '^VERSION_ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ xargs
++ os_id
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ xargs
++ os_version
++ cut -d= -f2
++ grep '^VERSION_ID=' /etc/os-release
++ xargs
++ os_id
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ xargs
++ os_version
++ grep '^VERSION_ID=' /etc/os-release
++ cut -d= -f2
++ xargs
++ get_metadata_attribute nccl-version 2.21.5
++ local -r attribute_name=nccl-version
++ local -r default_value=2.21.5
++ /usr/share/google/get_metadata_value attributes/nccl-version
++ echo -n 2.21.5
++ get_metadata_attribute gpu-driver-url https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
++ local -r attribute_name=gpu-driver-url
++ local -r default_value=https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
++ /usr/share/google/get_metadata_value attributes/gpu-driver-url
++ echo -n https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_vercat
++ is_ubuntu
+++ os_id
+++ grep '^ID=' /etc/os-release
+++ xargs
+++ cut -d= -f2
++ [[ debian == \u\b\u\n\t\u ]]
++ is_rocky
+++ os_id
+++ xargs
+++ cut -d= -f2
+++ grep '^ID=' /etc/os-release
++ [[ debian == \r\o\c\k\y ]]
++ os_version
++ xargs
++ cut -d= -f2
++ grep '^VERSION_ID=' /etc/os-release
++ get_metadata_attribute nccl-repo-url https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
++ local -r attribute_name=nccl-repo-url
++ local -r default_value=https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
++ /usr/share/google/get_metadata_value attributes/nccl-repo-url
++ echo -n https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
++ get_metadata_attribute cuda-url https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
++ local -r attribute_name=cuda-url
++ local -r default_value=https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
++ /usr/share/google/get_metadata_value attributes/cuda-url
++ echo -n https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
++ echo -e '8.3.1.22\n9.1.0.70'
++ head -n1
++ sort -V
++ echo -e '9.1.0.70\n8.4.1.50'
++ head -n1
++ sort -V
++ echo -e '12.0\n12.4'
++ head -n1
++ sort -V
++ get_metadata_attribute gpu-driver-provider NVIDIA
++ local -r attribute_name=gpu-driver-provider
++ local -r default_value=NVIDIA
++ /usr/share/google/get_metadata_value attributes/gpu-driver-provider
++ echo -n NVIDIA
++ get_metadata_attribute install-gpu-agent false
++ local -r attribute_name=install-gpu-agent
++ local -r default_value=false
++ /usr/share/google/get_metadata_value attributes/install-gpu-agent
++ echo -n false
++ mktemp -u -d -p /run/tmp -t ca_dir-XXXX
++ get_metadata_attribute private_secret_name
++ local -r attribute_name=private_secret_name
++ local -r default_value=
++ /usr/share/google/get_metadata_value attributes/private_secret_name
++ echo -n ''
++ uname -r
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_version
++ grep '^VERSION_ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
Please let me know if I am missing anything or is there any work around to proceed further?
The text was updated successfully, but these errors were encountered: