Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux: big jumps in key size seems to be due to min_snr_gamma not being hooked up #1980

Open
araleza opened this issue Mar 11, 2025 · 3 comments

Comments

@araleza
Copy link

araleza commented Mar 11, 2025

While measuring sd-scripts' LoRA key lengths with Tensorboard, I noticed that from time to time there were big jumps.

The jumps seem to correspond to the noise timestep value for that step being >900. In the example below, the timestep value is 932. (I'm not talking about the training step number, which coincidentally happens to also be around 900 here). While looking into reducing the loss for high timestep values, I spotted that that discussion seems to have already taken place for Stable Diffusion and SDXL, with a pull request by @AI-Casanova for a feature called --min_snr_gamma here in March 2023:

#308 (comment)

(The implementation was later fixed by @drhead)

Flux does not seem to have this min-snr-gamma feature enabled, as the function that should call it is stubbed out (in flux_train_network.py):

    def post_process_loss(self, loss, args, timesteps, noise_scheduler):
        return loss

It would be nice to get an implementation of this working for Flux to stop the overly-large jumps in key sizes occuring when the timestep value is high. Flux does actually take the --min_snr_gamma parameter without complaint, but silently makes no use of the value that is set.

Image

@drhead
Copy link

drhead commented Mar 11, 2025

I don't know the exact details, but I'm fairly certain you shouldn't be using min-snr-gamma on Flux anyways. Flux has its own system of weighted timestep sampling which should be used instead.

@araleza
Copy link
Author

araleza commented Mar 11, 2025

@drhead, are you maybe talking about --timestep_sampling flux_shift? I found out about that just a few weeks ago, and I have it in place on my command line. But I still see the big loss values for the high time steps. Unless these big steps are intentional and expected for Flux?

@drhead
Copy link

drhead commented Mar 11, 2025

I have not trained anything on Flux, but larger loss values for some timesteps sounds normal for Flux's timestep sampling.

Min-snr-gamma is a timestep weighting strategy. It simply is a multiplier on loss values based on the timestep's noise level, and it proportionally decreases the impact of the gradients coming from that image (it reduces how much influence those have on the model). It does so with the objective of making the influence of each timestep more equal. As a side note, you shouldn't take the lowered loss you get from min-snr-gamma too seriously. You could simply multiply the loss values for all timesteps by 0.01, and that won't mean that your model is any better than before.

Flux uses a timestep sampling strategy. Instead of multiplying the loss values, it picks timesteps that are expected to produce higher loss values less often, with the same intention of equalizing the overall contributions. But it also means those timesteps will produce more impactful gradients, which might make things a bit less stable if you're using a low batch size.

If you're worried about training instability, you should use gradient accumulation to have a larger effective batch size. If your batch size is something like 1 or 2, then it's honestly not surprising that this would happen. But having a high loss timestep's gradient diluted among the gradients of about 16 or 32 different steps shouldn't create such a huge shock to the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants