Scratch to SOTA: Build Famous Classification Nets 4 (GoogLeNet)

6 min readDec 6, 2020

Introduction

Two articles ago, we dissected the structures of AlexNet and VGG-family. While the two networks differ largely in their choice of filters, strides and depths, they all have the straight-forward linear architecture. GoogLeNet (and later the whole Inception family) in contrast, has a more complex structure.

At the first glance of the well-known GoogLeNet structure diagram and table (see below), we tend to be over-whelmed by the nontrivial sophistication in the design, and baffled by the specification over the filter choices. However, once we understand the philosophy behind the “Inception” module, the whole structure of GoogLeNet can quickly collapse into a simple, traceable pattern.

Overview

Explanation of the Inception module
Overall architecture of GoogLeNet
PyTorch implementation and discussions

GoogLeNet Basics

With its stylized name paying homage to LeNet, GoogLeNet was the winner for ImageNet 2014 classification challenge. A single GoogLeNet achieves a top-5 error rate of 10.07%. With a 7-model ensemble and an aggressive cropping policy during testing, the error rate can be slashed to 6.67%. In comparison, VGG, the first runner-up in the same competition has an error rate of 7.32% while AlexNet in the year of 2012 has an error rate of 16.4% if trained without external data.

Its impressive performance comes down to the novel Inception module. In fact, GoogLeNet is but an epitome of this design heuristic. In the paper, the authors described GoogLeNet as a “particular incarnation of the Inception architecture”. So let’s look into the Inception module

Inception Module

As shown below, the naive version of Inception module is just a set of filters of different sizes along with a max pooling block. The outputs of these individual components are concatenated in the channel dimension to form the output of the inception module.

Naive Inception

The authors argue that on the feature map of a specific layer, some statistical relationships are local, while other relationships are formed by “feature pixels” that are more spatially spread out. If we are to cover all these relationships effectively, we should use filters of different sizes. 1x1 filters can capture local statistics while 3x3 and 5x5 can be used for statistics that are more spread out.

Comparing this with VGG family which exclusively uses 3x3 filters. If we talk about the receptive field of the filters on the original image, VGG’s filters of the same layers are only collecting statistics from regions of the same size. In comparison, every Inception module on GoogLeNet gets to collect statistics from regions of different sizes. This gives GoogLeNet more representational power. (This illustration is of course not entirely accurate, as after certain depth, all filters are already looking at the whole image. But you get the idea.)

However, the naive version of the module is impractical. As we go deeper into the network, the channels of feature maps increase. The computation of filters on these feature maps increases quadratically. Computation for 3x3 and 5x5 filters may be too big. Therefore, we should reduce the number of feature map channels for 3x3 and 5x5 filters.

1x1 filters

1x1 filters in GoogLeNet are not only used to capture local statistics, but also to reduce the input channels. For example, suppose an input feature map’s dimension is 14x14x512, we can reduce the number of channels of this feature map by passing this though 24 1x1 filters. The resulting feature map will then have a dimension of 14x14x24. Now we can apply larger filters on this reduced feature map with less computation required. This explains the 1x1 convolution before 3x3 and 5x5 convolution in the final Inception module (figure below). There is also a 1x1 convolution behind the 3x3 pooling block. The effect of this 1x1 filter is similar, it ensures the output channels are kept in check. Without this 1x1 filter, the output channels will only increase monotonically.

Final Inception

Let’s do a bit of math to demonstrate the reduction in computation with 1x1 filters. Take the 3rd branch (5x5) of the naive version as well as that of the final Inception module. Take inception-4b (see table below) as an example. It has an input size of 14x14x512 and the 3rd branch has 64 5x5 filters. If we use the naive version, the third branch will incur 14x14x512 x 5x5x64 =161M floating point multiplication. With the reduction, the floating point multiplication is reduced to 14x14x512 x 1x1x24 + 14x14x24 x 5x5x64 = 10M. This amounts to a reduction of about 16 times!

Dual-purpose. We should take note that these 1x1 filters used for reduction also have ReLU activation placed behind them. Thus, in addition to dimension reduction, they also add extra non-linearity to the layer.

GoogLeNet Architecture

When we treat the Inception modules as a basic building block, GoogLeNet condenses back into a linear model for inference (for training, it requires two auxiliary side classifiers. Getting to it soon!).

GoogLeNet Structure Diagram

The table below and the diagram above illustrate its structure very clearly. The “S” and “V” in the diagram refers same padding and valid padding. It is the way to specify padding in TensorFlow. Here and here are stackoverflow answers explaining them. We can use the formulae in these answers to compute the corresponding paddings that we need to add in PyTorch.

Note: in this article, we ignore the Local Response Normalization layers.

GoogLeNet Structure Table

Auxiliary Classifiers

GoogLeNet has 22 convolutional/linear layers (if we sum up the depth column in the table). It was considered very deep at that time.

Large depth generally induces concerns in gradient backpropogation, efficiency of the model and overfitting. To alleviate these issues, the authors proposed adding two auxiliary classifiers to the intermediate layers.

The auxiliary classifiers are due to two considerations. The first one comes from the insight that many shallow networks can have strong performance on the image classification tasks. It means that the feature maps after a few rounds of convolution should already be discriminative enough. Therefore, we attach classifiers to lower layers to ensure this property. In the process, we also force the parameters of the lower layers to generate more interpretable features, preventing them from just projecting the features to whatever space that best fit the training data. This constraint on the parameters offers regularization effects. (I have to admit that the last two sentences are my own rationalization of how the regularizing effects come about. I may have committed the mistake of over-interpreation 😐).

The second consideration comes from gradient flow. Many gradients in the backward pass can be killed by the ReLU unit or diminished by small weights/intermediate features, the auxiliary classifiers can provide more direct gradients to the lower layers, enhancing their training.