Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LoRA-GGPO for Flux #1974

Draft
wants to merge 1 commit into
base: sd3
Choose a base branch
from
Draft

Add LoRA-GGPO for Flux #1974

wants to merge 1 commit into from

Conversation

rockerBOO
Copy link
Contributor

@rockerBOO rockerBOO commented Mar 6, 2025

https://arxiv.org/abs/2502.14538v1

LoRA-GGPO (Gradient-Guided Perturbation Optimization) suggests they can mitigate double descent in LoRA.

The double descent phenomenon is a common non-
monotonic behavior in machine learning, where
model performance exhibits an “increase-decrease-
increase” trend with complexity or training time.

To do this they get the weight norm and gradient norm and use them to perturbed the activation's with a random matrix.

Ablation Study on GLUE Benchmark Llama-2-7B results
Screenshot 2025-03-06 at 00-06-20 2502 14538v1 pdf Screenshot 2025-03-06 at 00-06-37 2502 14538v1 pdf

On the chart it shows LoRA+ and rsLoRA but they are not mutually exclusive and can be used together.

Downside is it can increase training time by a semi-significant margin currently. I was able to get it down into the 20% decreased training speed but maybe better code could make it faster. In the paper they suggest 5% but I was not able to hit this.

To improve speed:

  • Reduce the modules trained, each module will increase the training time
  • Only train on some modules like attention and not on feed forward will be a little faster
  • Skip steps on updating norms. Right now about 1.5s to update all the norms so not doing it on every step can save some time. It is set to update every 5 steps but we could make that adjustable.

Usage:

--network_args ggpo_sigma=0.03 ggpo_beta=0.01

network_args = [
   "ggpo_sigma=0.03",
   "ggpo_beta=0.01"
]

σ(sigma) > 0 controls the
overall strength of the perturbation, with larger σ in-
creasing the perturbation range; β(beta) > 0 balances the
contributions of weight norms and gradient norms,
with larger β focusing more on gradient sensitiv-
ity and smaller β emphasizing weight importance

Generally keeping it as the values above would be good and all they represented in the paper.

It is working at this stage and I have been using it. Originally implemented for the new Lumina LoRA but could be translated to SD1.5, SDXL LoRA's as well.

I also want to consider merging the norm calculations of scale_weight_norm with the norms in this, and modifying or making the schedule modifiable for how often it gets the weight norm norms for it. Displaying the weight and grad norms would also be a nice thing, as well as recording them in the logs.

@feffy380
Copy link
Contributor

Do you have any visual examples?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants