shangkyu shinUnderstanding CNNs is not about memorizing layers. It’s about understanding why this design...
Understanding CNNs is not about memorizing layers.
It’s about understanding why this design exists.
Cross-posted from Zeromath. Original article: https://zeromathai.com/en/convolutional-layer-lec-en/
Images are structured data.
A fully connected network treats them as flat vectors.
Example:
224×224×3 → 150K inputs
Dense layer → millions of parameters
Problems:
CNN introduces two key ideas:
Instead of connecting everything:
→ look locally, reuse globally
Image → Conv → ReLU → Pool → Conv → ... → FC → Softmax
A filter slides across the image.
At each position:
Input: 32×32×3
Filter: 5×5×3
Output: 28×28
Feature maps are representations.
They answer:
→ where is this feature?
f(x) = max(0, x)
Without it:
With it:
28×28 → 14×14
Benefits:
CNNs are not truly translation invariant.
Pooling only makes them more robust to shifts.
Too much pooling:
→ destroys spatial detail
Modern CNNs:
→ reduce pooling
→ use strided convolution
Flatten → combine features → classify
Softmax → probabilities
CNNs learn progressively:
| Layer | Learns |
|---|---|
| Early | edges |
| Middle | textures |
| Deep | objects |
Example:
edge → eye → face
CNN:
Dense:
Use:
These help:
The real breakthrough of CNNs is not just convolution.
It is the combination of:
That’s what turns pixels into meaning.
For image tasks today, do you still start with CNNs, or jump straight to Vision Transformers?
Let’s discuss 👇