Attempting to Unscale FP16 Gradients Bug #10752

iszihan · 2025-02-10T03:40:48Z

Describe the bug

Hello, I have the following error when trying to train a LoRA with SDXL:

ValueError: Attempting to unscale FP16 gradients.
Traceback (most recent call last):
  File "/nfs/horai.dgpsrv/year/zling/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1994, in <module>
    main(args)
  File "/nfs/horai.dgpsrv/year/zling/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1823, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/u8/c/zling/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2396, in clip_grad_norm_
    self.unscale_gradients()
  File "/u8/c/zling/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2340, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/u8/c/zling/.local/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/u8/c/zling/.local/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 260, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

Reproduction

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
# export INSTANCE_DIR="dog"
export INSTANCE_DIR="/scratch/year/zling/progressive-shading/picasso-data/surrealism_images"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

accelerate launch --gpu_ids 0,1 train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME  \
--instance_data_dir=$INSTANCE_DIR \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--instance_prompt="a drawing in sks style" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A drawing of sks style" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub

Logs

System Info

🤗 Diffusers version: 0.33.0.dev0
Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
Running on Google Colab?: No
Python version: 3.10.12
PyTorch version (GPU?): 2.6.0+cu124 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.28.1
Transformers version: 4.48.3
Accelerate version: 1.3.0
PEFT version: 0.7.0
Bitsandbytes version: not installed
Safetensors version: 0.5.2
xFormers version: not installed
Accelerator:
NVIDIA RTX A6000, 49140 MiB
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

No response

The text was updated successfully, but these errors were encountered:

hlky · 2025-02-10T07:30:48Z

cc @linoytsaban @sayakpaul

https://github.com/huggingface/diffusers/issues?q=is%3Aissue%20unscale%20fp16%20

sayakpaul · 2025-02-10T08:08:14Z

Can you try to apply these fixes?
#9628 (comment)

iszihan · 2025-02-10T19:21:04Z

Thank you! That seems to get rid of this particular error but now I have a new one

File "/u8/c/zling/.local/lib/python3.10/site-packages/torch/_inductor/fx_passes/quantization.py", line 1477, in fn
  scales = match.kwargs["scales"].meta["val"]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AttributeError: 'float' object has no attribute 'meta'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
  import torch._dynamo
  torch._dynamo.config.suppress_errors = True

sayakpaul · 2025-02-11T02:28:46Z

Are you using torch.compile()?

iszihan · 2025-02-11T02:48:46Z

Yes it works now after I turned it off with accelerate config. Thank you!!

sayakpaul · 2025-02-11T02:50:40Z

Thanks! Would you maybe like to open a PR with your fixes? This way your contributions would go directly to the library :)

iszihan · 2025-02-13T01:30:21Z

Just did!

iszihan added the bug Something isn't working label Feb 10, 2025

hlky added the training label Feb 10, 2025

iszihan closed this as completed Feb 11, 2025

iszihan mentioned this issue Feb 13, 2025

Fix to fp16 unscaling bug #10783

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempting to Unscale FP16 Gradients Bug #10752

Attempting to Unscale FP16 Gradients Bug #10752

iszihan commented Feb 10, 2025

hlky commented Feb 10, 2025

sayakpaul commented Feb 10, 2025

iszihan commented Feb 10, 2025

sayakpaul commented Feb 11, 2025

iszihan commented Feb 11, 2025

sayakpaul commented Feb 11, 2025

iszihan commented Feb 13, 2025

Attempting to Unscale FP16 Gradients Bug #10752

Attempting to Unscale FP16 Gradients Bug #10752

Comments

iszihan commented Feb 10, 2025

Describe the bug

Reproduction

Logs

System Info

Who can help?

hlky commented Feb 10, 2025

sayakpaul commented Feb 10, 2025

iszihan commented Feb 10, 2025

sayakpaul commented Feb 11, 2025

iszihan commented Feb 11, 2025

sayakpaul commented Feb 11, 2025

iszihan commented Feb 13, 2025