shangkyu shinWhy do CNNs outperform fully connected neural networks on image tasks? This article explains local...
Why do CNNs outperform fully connected neural networks on image tasks? This article explains local connectivity, weight sharing, pooling, and inductive bias in a practical, developer-friendly way.
Cross-posted from Zeromath. Original article: https://zeromathai.com/en/introduction-to-cnns-en/
CNNs were not invented just because researchers wanted a better benchmark score.
They were invented because applying a standard Multilayer Perceptron to images is a bad fit.
A fully connected network treats an image as a long flat vector. That already hints at the problem: images are not flat in any meaningful visual sense.
They are spatial.
They have local patterns.
And the same useful feature can appear in different positions.
That mismatch is the whole reason CNNs matter.
Take a 200 × 200 RGB image.
That gives:
200 × 200 × 3 = 120,000 input values
Now connect that input to a hidden layer with 1,000 neurons.
You get about 120 million weights.
That is bad for three reasons:
So the issue is not just "too many parameters."
The deeper issue is that a dense layer starts from the wrong assumption.
For tabular data, treating inputs as a feature vector is often fine.
For images, it is not.
Why?
Because image data has properties that matter directly:
A cat is still a cat whether it appears slightly left or slightly right.
A model for images should reflect that.
CNNs solve this by injecting a useful inductive bias.
Instead of saying, "learn everything from scratch," CNNs say:
That one architectural choice changes both efficiency and generalization.
In a dense layer, each neuron connects to the full input.
In a convolutional layer, each neuron looks at a small local patch.
That patch is the receptive field.
This makes sense for images because most meaningful low-level features are local:
You do not need the whole image to detect a vertical edge in one region.
From an engineering perspective, local connectivity dramatically cuts parameter count.
From a modeling perspective, it aligns the network with the structure of visual data.
This is the design principle that makes convolution feel elegant.
A CNN does not learn a separate edge detector for every location in the image.
It learns one filter and applies it across locations.
That means the same detector can fire on the left side, center, or right side of the input.
This gives us two big wins.
Instead of learning duplicated weights for similar patterns at many positions, the model reuses the same filter.
If the input shifts, the activation pattern shifts consistently.
That is translation equivariance.
A simple intuition:
For early visual processing, that is exactly what we want.
Pooling is often introduced as a downsampling step, and that is true, but it is more useful to think of it as controlled compression.
It reduces the size of feature maps while preserving the strongest or most representative signals.
Common examples are max pooling and average pooling.
Why is that useful?
A subtle but important point: pooling does not create perfect invariance by itself.
What it really gives is robustness to minor local variation.
That is a better mental model.
CNNs are not just smaller versions of dense networks.
They are structured models.
That matters because generalization improves when the architecture matches the data domain.
CNNs help by:
So when people say CNNs are efficient, they do not just mean "faster."
They mean the model wastes less capacity on unrealistic hypotheses.
One of the nicest ways to understand CNNs is to think in terms of feature maps.
A filter scans the image and produces a map showing where that filter’s learned pattern appears.
Early filters often learn things like:
Deeper layers then combine those into:
This is hierarchical representation learning.
In practice, CNNs move from "small visual primitives" to "larger semantic concepts."
That is why deep convolutional networks became so effective in computer vision.
Here is the cleanest mental contrast.
So the difference is not just architecture style.
It is a difference in how the model thinks the data is organized.
Once these core ideas were established, later CNN families mainly improved optimization, depth, and efficiency.
Different design, same foundation.
All of them rely on the logic introduced above.
The most important thing to learn from CNNs is bigger than CNNs.
Good model design is about matching architecture to data structure.
For images, that means locality, repeated patterns, and spatial hierarchy.
CNNs encode those assumptions directly.
That is why they work.
If you remember one line, remember this:
MLPs treat images like generic vectors. CNNs treat images like images.
That is the real reason convolution changed computer vision.
What part of CNN design do you think mattered most historically: local connectivity, weight sharing, or later innovations like residual connections?