TildAliceMost NaN Losses Aren't Gradient Explosions Here's a hot take that might save you hours:...
Here's a hot take that might save you hours: when your transformer training hits NaN, your first instinct—lowering the learning rate—is usually wrong.
I've watched countless engineers immediately slash their learning rate from 3e-4 to 1e-5 when they see NaN. The training limps along for longer, sure, but it still diverges eventually. The real culprits are almost always elsewhere: mixed precision underflow, attention score overflow, or that one layer norm you forgot to initialize properly.
Let me walk you through the actual debugging process.
Before you can fix NaN, you need to know exactly where it appears. PyTorch's default behavior is annoyingly silent—your loss goes NaN, and you're left guessing which of your 124 million parameters exploded.
The torch.autograd.set_detect_anomaly(True) flag exists for this reason, but there's a catch: it slows training by roughly 2-3x and doesn't always pinpoint the exact operation. Here's what actually works:
python
import torch
---
*Continue reading the full article on [TildAlice](https://tildalice.io/transformer-nan-loss-debugging/)*