Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while running EAS #218

Closed
ujjwaldasari10 opened this issue Sep 17, 2024 · 2 comments
Closed

Error while running EAS #218

ujjwaldasari10 opened this issue Sep 17, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@ujjwaldasari10
Copy link

Describe the bug

I am not able to train AM model with EAS using the link given here: https://rl4.co/examples/modeling/2-transductive-methods/#perform-search.

To Reproduce

Steps to reproduce the behavior.

Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

Please use the markdown code blocks for both code and stack traces.

import torch

# Move the model to the device before initializing the trainer
policy = policy.to(device)

trainer = RL4COTrainer(
    max_epochs=1,
    gradient_clip_val=None,
    strategy='ddp_notebook'
)
trainer.fit(eas_model)


RuntimeError Traceback (most recent call last)
Cell In[9], line 11
4 policy = policy.to(device)
6 trainer = RL4COTrainer(
7 max_epochs=1,
8 gradient_clip_val=None,
9 strategy='ddp_notebook'
10 )
---> 11 trainer.fit(eas_model)

File ~/miniconda3/envs/rl4co/lib/python3.10/site-packages/rl4co/utils/trainer.py:146, in RL4COTrainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
141 log.warning(
142 "Overriding gradient_clip_val to None for 'automatic_optimization=False' models"
143 )
144 self.gradient_clip_val = None
--> 146 super().fit(
147 model=model,
148 train_dataloaders=train_dataloaders,
149 val_dataloaders=val_dataloaders,
150 datamodule=datamodule,
151 ckpt_path=ckpt_path,
152 )

File ~/miniconda3/envs/rl4co/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:543, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
...
206 if _IS_INTERACTIVE:
207 message += " You will have to restart the Python kernel."
--> 208 raise RuntimeError(message)

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

Can you please help figure out the issue.

@ujjwaldasari10 ujjwaldasari10 added the bug Something isn't working label Sep 17, 2024
fedebotu added a commit that referenced this issue Sep 17, 2024
@fedebotu
Copy link
Member

Hi @ujjwaldasari10 , that can happen if you had the model (or part of it) already cast to the device prior to the trainer.
Here are some ideas:

  1. Can you try to remove the call to policy.to(device) firstly?
  2. I recommend training the model and collecting the checkpoint from a separate notebook / script, and load it as done here
  3. There might be an issue with ddp_notebook. I recommend trying without that, and setting devices=1 instead

@ujjwaldasari10
Copy link
Author

Changing 3 was sufficient for fixing the problem. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants