VGGNet: A Deep CNN Approach to Large-Scale Image Recognition
Last time, we talked about how CNNs use convolution to process images. Today, let’s dive deeper into one of the most influential models in the field of image recognition: VGGNet.
This is written based on the book Easy Deep Learning by 혁펜하임. The book was sponsored for review as part of my review activities for 혁펜하임's new book, Easy Deep Learning.
In 2014, a deep learning model called VGGNet took the computer vision world by storm, securing 2nd place in the ImageNet Large-Scale Visual Recognition Challenge. Developed by the Visual Geometry Group (VGG) at Oxford University, VGGNet explored a fundamental question: How does the depth of a Convolutional Neural Network (CNN) affect its performance on large-scale image recognition tasks?
ImageNet is a massive dataset with 1,000 object categories, and models are evaluated based on the Top-5 Error Rate—the rate at which the correct label isn’t in the top 5 predictions. The reason for using Top-5 instead of Top-1 accuracy is:
Why VGGNet? – A Deep Dive into Depth
The core goal of VGGNet was to test how network depth affects accuracy. While previous models used larger convolution filters, VGGNet used smaller 3x3 filters stacked on top of each other, allowing the network to grow deeper.
The key issue with using large filters is that as the filter size increases, the feature map size decreases rapidly after each convolution, which limits how deep the network can go before the feature map is reduced to the smallest possible size (1x1). Once the feature map becomes 1x1, you can't continue adding more convolutional layers without losing essential spatial information. By opting for smaller 3x3 filters, VGGNet maintained more detailed feature maps, enabling the network to go deeper and capture more complex patterns in the data.
VGGNet’s team experimented with six different architectures (A, A-LRN, B, C, D, E), each designed to test different depths and structures. They found that Local Response Normalization (LRN) did not significantly improve performance, leading them to omit LRN from models B, C, D, and E.
The Power of the 3x3 Filter
Using two 3x3 filters and a single 5x5 filter result in the same output size. Both are considered small filters, so you might wonder: Why not just go with the faster, simpler 5x5 filter?
And here's another interesting fact: Using three 3x3 filters in a row results in the same feature map size as using a 7x7 filter just once. So why not just use the 7x7 filter directly?
Here’s why the 3x3 filter approach still wins:
VGGNet Version D
If you look at the VGGNet configuration diagram, you'll notice that every convolution layer is labeled conv3, indicating the use of 3x3 convolution filters throughout the network, as we discussed earlier. For this article, we’ll focus on Version D of VGGNet. In this version, all convolution layers use padding = 1 and stride = (1, 1). This setup ensures that the spatial dimensions of the feature maps remain the same after each convolution operation. After the convolution layers, max pooling with a 2x2 filter and stride = (2, 2) is applied, reducing the width and height of the feature maps by half.
Let’s break down the structure of VGGNet Version D in more detail:
Recommended by LinkedIn
1. Input:
2. First Convolution Block:
3. Second Convolution Block:
This pattern is repeated throughout the architecture. Starting from the second block, VGGNet uses three convolution layers in a row, compared to the first two blocks, which use only two layers. As you pass through each block, you’ll notice that the number of filters doubles at each step: 64 -> 128 -> 256 -> 512.
After the last convolution block, we obtain a feature map of 512 x 7 x 7. This is the result of inputting a 224 x 224 image and applying five max pooling operations, reducing the spatial dimensions by half each time. Mathematically, after applying max pooling five times, the width and height are divided by 2^5 (224 / 2^5 = 7).
4. Fully Connected Layers (MLP) for Classification
After the convolutional layers, VGGNet transitions to fully connected layers (MLP) to perform classification:
FC-4096: This fully connected layer connects the flattened vector of 25,088 nodes to 4096 nodes.
Another FC-4096: A second fully connected layer with 4096 nodes.
FC-1000: The final fully connected layer reduces the number of nodes to 1000, which corresponds to the 1000 classes in the ImageNet dataset.
Conclusion
By using 3x3 filters, VGGNet was able to create a deep network architecture that could effectively capture complex patterns in images while maintaining relatively simple convolution operations. This deep architecture, along with the careful use of max pooling and fully connected layers, contributed to its impressive performance on ImageNet.
VGGNet’s design has had a lasting impact on the field, influencing the development of future deep learning architectures and becoming a popular model for transfer learning in various computer vision tasks.