Loss Explodes to NaN in large batch-size #8778

IsNoobgrammer · 2025-03-03T00:26:33Z

🐛 Bug

When using larger batch size on v4-8 , the loss after 2 optimizer step becomes nan ;
This issue primarily due to newer torch-xla and libtpu ;

Here is the Notebook you can use for quick repro

As mentioned earlier #8591 and #8683 (comment)

However even after trying recommended choice we were not able to mitigate it

Tagging @tengyifei @zpcore !! Thanks for your efforts

ysiraichi added bug Something isn't working xla:tpu TPU specific issues and PRs labels Mar 5, 2025