-
Notifications
You must be signed in to change notification settings - Fork 951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux: big jumps in key size seems to be due to min_snr_gamma not being hooked up #1980
Comments
I don't know the exact details, but I'm fairly certain you shouldn't be using min-snr-gamma on Flux anyways. Flux has its own system of weighted timestep sampling which should be used instead. |
@drhead, are you maybe talking about |
I have not trained anything on Flux, but larger loss values for some timesteps sounds normal for Flux's timestep sampling. Min-snr-gamma is a timestep weighting strategy. It simply is a multiplier on loss values based on the timestep's noise level, and it proportionally decreases the impact of the gradients coming from that image (it reduces how much influence those have on the model). It does so with the objective of making the influence of each timestep more equal. As a side note, you shouldn't take the lowered loss you get from min-snr-gamma too seriously. You could simply multiply the loss values for all timesteps by 0.01, and that won't mean that your model is any better than before. Flux uses a timestep sampling strategy. Instead of multiplying the loss values, it picks timesteps that are expected to produce higher loss values less often, with the same intention of equalizing the overall contributions. But it also means those timesteps will produce more impactful gradients, which might make things a bit less stable if you're using a low batch size. If you're worried about training instability, you should use gradient accumulation to have a larger effective batch size. If your batch size is something like 1 or 2, then it's honestly not surprising that this would happen. But having a high loss timestep's gradient diluted among the gradients of about 16 or 32 different steps shouldn't create such a huge shock to the model. |
While measuring sd-scripts' LoRA key lengths with Tensorboard, I noticed that from time to time there were big jumps.
The jumps seem to correspond to the noise timestep value for that step being >900. In the example below, the timestep value is 932. (I'm not talking about the training step number, which coincidentally happens to also be around 900 here). While looking into reducing the loss for high timestep values, I spotted that that discussion seems to have already taken place for Stable Diffusion and SDXL, with a pull request by @AI-Casanova for a feature called
--min_snr_gamma
here in March 2023:#308 (comment)
(The implementation was later fixed by @drhead)
Flux does not seem to have this min-snr-gamma feature enabled, as the function that should call it is stubbed out (in
flux_train_network.py
):It would be nice to get an implementation of this working for Flux to stop the overly-large jumps in key sizes occuring when the timestep value is high. Flux does actually take the
--min_snr_gamma
parameter without complaint, but silently makes no use of the value that is set.The text was updated successfully, but these errors were encountered: