Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to Unscale FP16 Gradients Bug #10752

Closed
iszihan opened this issue Feb 10, 2025 · 7 comments · May be fixed by #10783
Closed

Attempting to Unscale FP16 Gradients Bug #10752

iszihan opened this issue Feb 10, 2025 · 7 comments · May be fixed by #10783
Labels
bug Something isn't working training

Comments

@iszihan
Copy link

iszihan commented Feb 10, 2025

Describe the bug

Hello, I have the following error when trying to train a LoRA with SDXL:

ValueError: Attempting to unscale FP16 gradients.
Traceback (most recent call last):
  File "/nfs/horai.dgpsrv/year/zling/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1994, in <module>
    main(args)
  File "/nfs/horai.dgpsrv/year/zling/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1823, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/u8/c/zling/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2396, in clip_grad_norm_
    self.unscale_gradients()
  File "/u8/c/zling/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2340, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/u8/c/zling/.local/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/u8/c/zling/.local/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 260, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

Reproduction

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
# export INSTANCE_DIR="dog"
export INSTANCE_DIR="/scratch/year/zling/progressive-shading/picasso-data/surrealism_images"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

accelerate launch --gpu_ids 0,1 train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME  \
--instance_data_dir=$INSTANCE_DIR \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--instance_prompt="a drawing in sks style" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A drawing of sks style" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub

Logs

System Info

  • 🤗 Diffusers version: 0.33.0.dev0
  • Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
  • Running on Google Colab?: No
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.6.0+cu124 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.28.1
  • Transformers version: 4.48.3
  • Accelerate version: 1.3.0
  • PEFT version: 0.7.0
  • Bitsandbytes version: not installed
  • Safetensors version: 0.5.2
  • xFormers version: not installed
  • Accelerator:
    NVIDIA RTX A6000, 49140 MiB
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

No response

@iszihan iszihan added the bug Something isn't working label Feb 10, 2025
@hlky hlky added the training label Feb 10, 2025
@hlky
Copy link
Collaborator

hlky commented Feb 10, 2025

@sayakpaul
Copy link
Member

Can you try to apply these fixes?
#9628 (comment)

@iszihan
Copy link
Author

iszihan commented Feb 10, 2025

Thank you! That seems to get rid of this particular error but now I have a new one

File "/u8/c/zling/.local/lib/python3.10/site-packages/torch/_inductor/fx_passes/quantization.py", line 1477, in fn
  scales = match.kwargs["scales"].meta["val"]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AttributeError: 'float' object has no attribute 'meta'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
  import torch._dynamo
  torch._dynamo.config.suppress_errors = True

@sayakpaul
Copy link
Member

Are you using torch.compile()?

@iszihan
Copy link
Author

iszihan commented Feb 11, 2025

Yes it works now after I turned it off with accelerate config. Thank you!!

@iszihan iszihan closed this as completed Feb 11, 2025
@sayakpaul
Copy link
Member

Thanks! Would you maybe like to open a PR with your fixes? This way your contributions would go directly to the library :)

@iszihan
Copy link
Author

iszihan commented Feb 13, 2025

Just did!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants