Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug❓: --cache_latents prevents regular checkpoint saving in train_network.py 😥 #1937

Open
Jackiiiii opened this issue Feb 15, 2025 · 2 comments

Comments

@Jackiiiii
Copy link

When the --cache_latents flag is enabled, the training greatly benefits from a speed boost (since the latents are computed once and then reused / i think). However, with this flag active, the checkpoint saving mechanism (e.g. via --save_every_n_steps 10) is completely suppressed – no intermediate checkpoints are saved, and no related log messages appear.
When --cache_latents is disabled, checkpoints are saved normally, but the training speed drops drastically. In my case, on an RTX 3090 Ti, I must reduce the effective values to --train_batch_size 16 and --gradient_accumulation_steps 1, even though I would expect to use much higher values (for example, a batch size of 64 togather with --cache_latents). This makes training impractically slow.

Expected Behavior:

  • Training should both benefit from the speed improvements of latent caching (allowing for high batch sizes and fast processing) and save checkpoints at regular intervals (e.g., every 10 steps(it's low, to test, if it save)).

Actual Behavior:

With --cache_latents:

  • Fast training due to cached latents.
  • No checkpoints are saved (neither intermediate nor at the end).
  • Looks like my CPU is working too.

Without --cache_latents:

  • Checkpoints are saved, but the training is extremely slow. I must reduce the effective parameters to --train_batch_size 16 and --gradient_accumulation_steps 1.

System Information:

  • Operating System: Windows 11
  • GPU: NVIDIA RTX 3090 Ti
  • CPU: AMD Ryzen Threadripper 3960X 24-Core Processor
  • Python Version: 3.10.9
  • Additional Flags: The training is run with flags such as --xformers, --mixed_precision=fp16, and --gradient_checkpointing.

Any help or fixes would be greatly appreciated!

@rockerBOO
Copy link
Contributor

Maybe try --cache_latents_to_disk but might be a bug either way

@Jackiiiii
Copy link
Author

i already try --cache_latents_to_disk but makes nothing faster just worst but with --cache_latents ram and cpu is used too but does not save.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants