Transformer NaN Loss: 7 Fixes That Actually Work

# transformer# nanloss# trainingdivergence# mixedprecision

TildAlice

Most NaN Losses Aren't Gradient Explosions Here's a hot take that might save you hours:...

Most NaN Losses Aren't Gradient Explosions

Here's a hot take that might save you hours: when your transformer training hits NaN, your first instinct—lowering the learning rate—is usually wrong.

I've watched countless engineers immediately slash their learning rate from 3e-4 to 1e-5 when they see NaN. The training limps along for longer, sure, but it still diverges eventually. The real culprits are almost always elsewhere: mixed precision underflow, attention score overflow, or that one layer norm you forgot to initialize properly.

Let me walk you through the actual debugging process.

A woman shows her weight loss by holding oversized jeans revealing her toned stomach. — Photo by Annushka Ahuja on Pexels

The Detection Problem

Before you can fix NaN, you need to know exactly where it appears. PyTorch's default behavior is annoyingly silent—your loss goes NaN, and you're left guessing which of your 124 million parameters exploded.

The torch.autograd.set_detect_anomaly(True) flag exists for this reason, but there's a catch: it slows training by roughly 2-3x and doesn't always pinpoint the exact operation. Here's what actually works:


python
import torch

---

*Continue reading the full article on [TildAlice](https://tildalice.io/transformer-nan-loss-debugging/)*