Error while running EAS #218

ujjwaldasari10 · 2024-09-17T06:40:28Z

Describe the bug

I am not able to train AM model with EAS using the link given here: https://rl4.co/examples/modeling/2-transductive-methods/#perform-search.

To Reproduce

Steps to reproduce the behavior.

Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

Please use the markdown code blocks for both code and stack traces.

import torch

# Move the model to the device before initializing the trainer
policy = policy.to(device)

trainer = RL4COTrainer(
    max_epochs=1,
    gradient_clip_val=None,
    strategy='ddp_notebook'
)
trainer.fit(eas_model)

RuntimeError Traceback (most recent call last)
Cell In[9], line 11
4 policy = policy.to(device)
6 trainer = RL4COTrainer(
7 max_epochs=1,
8 gradient_clip_val=None,
9 strategy='ddp_notebook'
10 )
---> 11 trainer.fit(eas_model)

File ~/miniconda3/envs/rl4co/lib/python3.10/site-packages/rl4co/utils/trainer.py:146, in RL4COTrainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
141 log.warning(
142 "Overriding gradient_clip_val to None for 'automatic_optimization=False' models"
143 )
144 self.gradient_clip_val = None
--> 146 super().fit(
147 model=model,
148 train_dataloaders=train_dataloaders,
149 val_dataloaders=val_dataloaders,
150 datamodule=datamodule,
151 ckpt_path=ckpt_path,
152 )

File ~/miniconda3/envs/rl4co/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:543, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
...
206 if _IS_INTERACTIVE:
207 message += " You will have to restart the Python kernel."
--> 208 raise RuntimeError(message)

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

Can you please help figure out the issue.

The text was updated successfully, but these errors were encountered:

fedebotu · 2024-09-17T08:21:44Z

Hi @ujjwaldasari10 , that can happen if you had the model (or part of it) already cast to the device prior to the trainer.
Here are some ideas:

Can you try to remove the call to policy.to(device) firstly?
I recommend training the model and collecting the checkpoint from a separate notebook / script, and load it as done here
There might be an issue with ddp_notebook. I recommend trying without that, and setting devices=1 instead

ujjwaldasari10 · 2024-09-25T09:16:50Z

Changing 3 was sufficient for fixing the problem. Thanks

ujjwaldasari10 added the bug Something isn't working label Sep 17, 2024

ujjwaldasari10 assigned cbhua and fedebotu Sep 17, 2024

fedebotu added a commit that referenced this issue Sep 17, 2024

[Minor] fix Colab link #218

2812925

ujjwaldasari10 closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while running EAS #218

Error while running EAS #218

ujjwaldasari10 commented Sep 17, 2024

fedebotu commented Sep 17, 2024

ujjwaldasari10 commented Sep 25, 2024

Error while running EAS #218

Error while running EAS #218

Comments

ujjwaldasari10 commented Sep 17, 2024

Describe the bug

To Reproduce

fedebotu commented Sep 17, 2024

ujjwaldasari10 commented Sep 25, 2024