Docker image doesn't work #7

Skyy93 · 2023-07-14T12:56:27Z

Hello, thank you for your amazing work! I want to try it and used the docker instructions you provided here:
https://github.com/nianticlabs/nerf-object-removal/blob/main/docker/README.md

The image builds correctly and runs but when I try your example command i get the following message in the logs:

[2023-07-14 13:53:18,408][saicinpainting.training.trainers.base][INFO] - BaseInpaintingTrainingModule init done
[2023-07-14 13:53:18,627][__main__][CRITICAL] - Prediction failed due to Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination:
Traceback (most recent call last):
  File "bin/predict.py", line 59, in main
    model.to(device)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/pytorch_lightning/core/decorators.py", line 89, in inner_fn
    module = fn(self, *args, **kwargs)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 120, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination

Because of this a following error occurcs

FileNotFoundError: [Errno 2] No such file or directory: '/app/object-removal/experiments/real/001/data/../lama_depth_output_real/000_mask001.png'

and also fails JAX to find a GPU

W0714 13:53:32.249409 140354252236608 xla_bridge.py:363] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

I have a RTX 4090 with this driver and cuda version in the docker container: Driver Version: 535.54.03 CUDA Version: 11.8

Could you please look into it? I tried to use another Cuda12.0 Container as base image then the pytorch error resolves but not the JAX error that implies it does not find the GPU.

Thank you

The text was updated successfully, but these errors were encountered:

520xyxyzq · 2023-10-30T02:46:40Z

Hi Skyy93, did you solve the problem? I encountered the exact same problem.

sbhavani · 2023-11-09T18:59:42Z

NVIDIA's nightly JAX containers are available here: https://github.com/NVIDIA/JAX-Toolbox with open Dockerfiles. I'd recommend starting from a base image here and adding PyTorch and other libs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker image doesn't work #7

Docker image doesn't work #7

Skyy93 commented Jul 14, 2023

520xyxyzq commented Oct 30, 2023

sbhavani commented Nov 9, 2023

Docker image doesn't work #7

Docker image doesn't work #7

Comments

Skyy93 commented Jul 14, 2023

520xyxyzq commented Oct 30, 2023

sbhavani commented Nov 9, 2023