Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common Imports Fix and Readme Update to fix RuntimeError in trainer.fit() #216

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

scorixear
Copy link

Running example code with current package creates following errors:

  • cannot import name 'DeepSpeedPlugin' from 'pytorch_lightning.plugins - aitextgen.py line 14
  • cannot import name 'ProgressBarBase' from 'pytorch_lightning.callbacks.progress - train.py line 13
  • cannot import name '_TPU_AVAILABLE' from 'pytorch_lightning.utilities - train.py line 14 - fixed in update pytorch-lightning requirement to >= 1.8.0 #202
  • Runtime Error: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. - aitextgen.py line 752

The Runtime error suggests wrapping the user code in a main function as hinted here https://discuss.pytorch.org/t/runtimeerror-an-attempt-has-been-made-to-start-a-new-process-before-the-current-process-has-finished-its-bootstrapping-phase/145462

But I cannot confirm if this fixes the issue as the current code does not progress at all (Might also because ProgressBar is not the correct replacement for ProgressBarBase.

Would love to have your input if theses changes actually work!

@scorixear
Copy link
Author

After around 1 Hour of training the program finished correctly, although the progress bar seems to be broken
grafik

@vjarora1978
Copy link

Getting this error while executing the example
image

@scorixear
Copy link
Author

Getting this error while executing the example image

yes I get the same error, I will investigate whats up

@scorixear
Copy link
Author

Getting this error while executing the example image

seems like ProgressBarBase contained the "loss" tensor for version 1.8.6, but got removed in ProgressBar version 2.0.0 (the latest of pytorch lightning)

I replaced the metrics with the outputs loss value - this doesn't affect the training code at all, its just about the progress bar viewing current and average loss

@fictionFanKazuki
Copy link

fictionFanKazuki commented May 2, 2023

this is a really helpful pull req, thanks a lot! however, i still get an error about the kwarg "gpus" being unkown in pytorch's argsparse.py? "gpus" seemed to be part of that trainer object thing in train.py, could you help?

TypeError                                 Traceback (most recent call last)
<ipython-input-11-341925ca7a1c> in <cell line: 1>()
----> 1 ai.train(file_name,
      2          line_by_line=False,
      3          from_cache=False,
      4          num_steps=3000,
      5          generate_every=300,

1 frames
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/argparse.py in insert_env_defaults(self, *args, **kwargs)
     67 
     68         # all args were already moved to kwargs
---> 69         return fn(self, **kwargs)
     70 
     71     return cast(_T, insert_env_defaults)

TypeError: Trainer.__init__() got an unexpected keyword argument 'gpus'

@scorixear
Copy link
Author

@fictionFanKazuki

this is a really helpful pull req, thanks a lot! however, i still get an error about the kwarg "gpus" being unkown in pytorch's argsparse.py? "gpus" seemed to be part of that trainer object thing in train.py, could you help?

TypeError                                 Traceback (most recent call last)
<ipython-input-11-341925ca7a1c> in <cell line: 1>()
----> 1 ai.train(file_name,
      2          line_by_line=False,
      3          from_cache=False,
      4          num_steps=3000,
      5          generate_every=300,

1 frames
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/argparse.py in insert_env_defaults(self, *args, **kwargs)
     67 
     68         # all args were already moved to kwargs
---> 69         return fn(self, **kwargs)
     70 
     71     return cast(_T, insert_env_defaults)

TypeError: Trainer.__init__() got an unexpected keyword argument 'gpus'

Hm, not sure how to reproduce this.
I have changed the "gpus" arguments to"num_nodes" in my latest commit. Maybe you haven't used the latest one there?

Otherwise there is probably a new version of pytorch_lightning that had more breaking changes. But i would need to know which version you have installed there and/or the full stack trace as I canot deciver where the utilities function was called from.

On my machine with my version of pytorch_lightning (2.0.0) it works. I will push a restricted requirements.txt shortly

@Vectorrent
Copy link
Contributor

Thanks for this! I merged these fixes into my custom fork of AITextGen, and it allowed me to upgrade to PL v2.0.4 successfully!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants