Convolutional Neural Networks : More Dogs, cats, and frogs and cars..
To teach an algorithm how to recognize objects in images, data scientists use a specific type of Artificial Neural Network: a Convolutional Neural Network (CNN). Their name stems from one of the most important operations in the network: convolution.
The above example gif exhibits an implementation of CNN (LeNet) used for recognizing handwrittern numbers. ConvNets or Convolutional Neural Networks, in deep learning, is a class of deep neural networks that is a biologically-inspired variants of a multi-layer perceptron (MLP). ConvNets are inspired by the functioning of the Visual cortex, something I will cover in the next section.
Trivia : The average number of neurons in the adult human primary visual cortex in each hemisphere has been estimated at around 140 million.
The motivation: How is an image processed by organisms?
Taking inspiration from an explanation of the visual system by my 3rd grader, I am putting this section together.
In 1968, Hubel, D. and Wiesel, T. explained in their mutual research paper, Receptive fields and functional architecture of monkey striate cortex, that the Visual cortex[V1] of a complex organisms and species - with advanced sensory organs - contain an elaborate arrangement of optically sensitive cells [divided into 6 functionally distinct layers labelled 1 to 6.]
Overlapping Receptive Field
The visual cortex, V1 layer cells are sensitive to small sub-regions of visual field, called a receptive field. In other words, a receptive field of a single sensory neuron is the specific region of the retina in which something will affect the firing of that neuron (that is, will active the neuron.) Every sensory neuron cell has similar receptive fields, and their fields are overlying. These receptive fields are tiled to cover the entire visual field.
Hubel and Wiesel advanced the theory that,
receptive fields of cells at one level of the visual system are formed from input by cells at a lower level of the visual system.
In this way, small, simple receptive fields could be combined to form large, complex receptive fields. Later theorists elaborated this simple, hierarchical arrangement by allowing cells at one level of the visual system to be influenced by feedback from higher levels.
The primary cortex cells act as local filters over the input visual space and are well-suited to exploit the strong spatially local correlation present in natural images.
In their paper, Hubel and Weisel described two basic types of visual neuron cells in the brain that each act in a different way: simple cells (S cells) and complex cells (C cells).
- The Simple cells respond maximally to specific edge-like patterns within their receptive field. In other words, The simple cells activate, when they identify basic shapes as lines in a fixed area and a specific angle.
- The complex cells have larger receptive fields and their output is not sensitive to the specific position in the field. The complex cells continue to respond to a certain stimulus, even though its absolute position on the retina changes.
A few learning models developed emulating the visual cortex are : NeoCognitron [Fukushima80] , HMAX [Serre07], and LeNet-5 [LeCun98].
So, Next, Let us understand the computer applications part, starting with how does a computer stores an image.
How is an image represented in computer?
Next, let us understand how is an image represented in a computer system. Each image can be represented as a multi-dimensional array. Most simply, An image is a matrix (2d) of pixel values. A colored image can be thought as Three 2-dimensional matrices (these 2d matrices are also referred to as Channels) stacked over each other (Red, Green, Blue) and each matrix-unit having pixel values in the range of 0 to 255. At the other hand, a greyscale image can be represented only with one channel i.e. only single 2d-matrix.
What's the difference : Regular ANNs versus CNNs
Regular ANNs have more generalized properties like,
- Firstly, they transform an input by putting it through a series of hidden layers.
- Secondly, Every layer is made up of a set of neurons, where each layer is fully connected to all neurons in the layer before.
- Finally, there is a last fully-connected layer — the output layer — that represent the predictions.
On the other hand, ConvNets are a special case of ANN are a bit different.
- First of all, the layers are organized in 3 dimensions: width, height and depth.
- Further, the neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. (Sparse connections)
- Lastly, the final output will be reduced to a single vector of probability scores, organized along the depth dimension.
Sparse connections: CNNs exploit spatially-local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. In other words, the inputs of hidden units in layer m are from a subset of units in layer m-1, units that have spatially contiguous receptive fields. The functional aspect of sparse connections are analogous with notions of receptive field explained earlier.
- The receptive field of neurons in layer m with respect to the layer m-1 is 3, but their receptive field with respect to the input is larger (5).
- Each unit is unresponsive to variations outside of its receptive field with respect to the retina.
This architecture pattern thus ensures that the learnt “filters” produce the strongest response to a spatially local input pattern.
- Stacking many such layers leads to (non-linear) “filters” that become increasingly “global” (i.e. responsive to a larger region of pixel space.)
Shared weights : Each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. Also, Weights of the same color are shared—constrained to be identical.
In the above figure, we show 3 hidden units belonging to the same feature map.
- Replicating units in this way allows for features to be detected regardless of their position in the visual field.
- Additionally, weight sharing increases learning efficiency by greatly reducing the number of free parameters being learnt.
- This constraint, to have weights of same color, on the model enable CNNs to achieve better generalization on vision problems.
CNN Architecture
There are three main types of layers that builds ConvNet architectures:
- Convolutional Layer,
- Pooling Layer, and
- Fully-Connected Layer (exactly as seen in regular Neural Networks).
We will stack these layers to form a full ConvNet architecture. Let us start with convolution.
What is Convolution [CONV]
The Convolution layer is the core building block of a CNN that does most of the computational heavy lifting. The term convolution refers to the mathematical combination of two functions to produce a third function (reminds me of Higher order functions.) Basically, It merges two sets of information.
In the case of a CNN, the convolution is performed on the input data with the use of a filter or kernel (these terms are used interchangeably) to then produce a feature map. In other words, a feature map is obtained by repeated application of a function across sub-regions of the entire image, i.e. by convolution of the input image with a linear filter, adding a bias term and then applying a non-linear function.
Mathematically, a convolution is executed by sliding the filter over the input (sub-region) by x strides, i.e. x pixels, at a time. At every location, a matrix multiplication is performed and sums the result onto the feature map.
In the animation, one can see the convolution operation. The filter (the green square) is sliding by one pixel over the input (the blue square) and the sum of the convolution goes into the feature map (the red square.) The area of the filter is also called the receptive field. The size of this filter is 3x3.
Although, I have shown the operation in 2D, but in reality convolutions are performed in 3D. This is because each image is namely represented as a 3D matrix with a dimension for width, height, and depth. Depth is a dimension because of the colors channels used in an image (RGB).
Spatial arrangement. Next let us discuss number of neurons in the output volume and their arrangement. Four hyper-parameters control the size of the output volume: the kernel size, depth, stride and zero-padding. Here is a video that I liked and best explains the 3D aspects and these hyper-parameters involved in convolution.
Here is view with 2 layers of filters, that will help you to understand
Pooling Layer [POOL]
It is common to periodically insert a Pooling layer in-between successive Convolution layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting.
The most frequent type of pooling is max pooling, which takes the maximum value in each window. These window sizes need to be specified beforehand.
This decreases the feature map size while at the same time keeping the significant information. The depth dimension remains unchanged. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation.
Classification or full connections layer [FC]
The last layers of a Convolutional NN are fully connected layers. Neurons in a fully connected layer have full connections to all the activations in the previous layer. This part is in principle the same as a regular Neural Network. Their activations can hence be computed with a matrix multiplication followed by a bias offset. However, these fully connected layers can only accept 1 Dimensional data; hence we have to first flatten a 3D into a 1D vector.
In summary,
CNNs are deep learning networks useful for image classification and recognition. The network itself involves multiple steps for CONV + POOL + FC as explained above and
Namaste !!!