Basic information of CNN
TOC
Convolutional Neural Network (CNN)
- A class of deep, feed-forward artificial neural network that have successfully been applied to analyzing visual imagery. - Wiki
- CNN can be also adapted to voice recognition and other fields.
Architecture
- Compared to traditional DNN, CNN has feature extraction section including convolution layers and pooling layers additionally.
- Feature extraction section consists of convolution layers, active functions and pooling layers. (Pooling layer is optional.)
- Classification section is organized with affine layers and active functions, like traditional DNN. This is also called as fully connected network.
- From the feature extraction section, object information is extracted from the input image, and this information is classified in classification section. At the end of classification section, the machine guess what the input is finally.
Convolution Layer
- As previous MNIST training in traditional DNN, geometric information is disappeared by flattening image to 1 x 784 array. However, CNN can be trained with geometric information. Besides, channel information is trained as the 3rd dimension.
- An Input/Output of Convolution layer is called as feature map. Feature map
- Along CNN, features becomes clear. At the beginning of CNN, detected feature size is too small, but the features from the last of feature extraction section is human-distinguishable.
Convolution
-
The process of adding each element of the image to its local neighbors, weighted by the kernel (= filter). - Wiki
$$ \begin{bmatrix} 3 & 3 & 2 & 1 & 0 \\ 0 & 0 & 1 & 3 & 1 \\ 3 & 1 & 2 & 2 & 3 \\ 2 & 0 & 0 & 2 & 2 \\ 2 & 0 & 0 & 0 & 1 \end{bmatrix} \circledast \begin{bmatrix} 0 & 1 & 2 \\ 2 & 2 & 0 \\ 0 & 1 & 2\end{bmatrix} $$ $$ = \begin{bmatrix} 12 & 12 & 17 \\ 10 & 17 & 19 \\ 9 & 6 & 14 \end{bmatrix}$$
Weights and Biases
-
CNN also has weights and biases like traditional DNN. Only multiplication of X and W is replaced by convolution of X and W.
-
In CNN, kernel and filter represent weight.
$$ X \circledast W + b $$ $$ = \begin{bmatrix} 1 & 2 & 3 & 0 \\ 0 & 1 & 2 & 3 \\ 3 & 0 & 1 & 2 \\ 2 & 3 & 0 & 1 \end{bmatrix} \circledast \begin{bmatrix} 2 & 0 & 1 \\ 0 & 1 & 2 \\ 1 & 0 & 2 \end{bmatrix} + 3 $$ $$ = \begin{bmatrix} 15 & 16 \\ 6 & 15 \end{bmatrix} + 3 $$ $$ = \begin{bmatrix} 18 & 19 \\ 9 & 18 \end{bmatrix} $$
Padding
- Padding is to wrap the input feature map with a specific value to prevent it from shrinking.
- In the previous example, the convolution of the 4x4 input feature map and 3 x 3 filter returns the 2 x 2 output feature map. If input feature map passes through multiple convolution layers without padding, its information will be zipped into a single scalar value.
- There are many kinds of Padding but usually zero padding is used.
- With padding, the size of output feature map is the same as the size of input feature map.
Stride
- Stride is an interval of moving filter.
- If stride is 1 x 1, filter is moved as 1 along x axis and y axis. The stride of Image 3 is 1 x 1.
- If stride is 2 x 2,
- According to stride, the size of output feature image is changed.
$$ OW = \frac{IW + 2P - FW}{SW} + 1 $$ $$ OH = \frac{IH + 2P - FH}{SH} + 1 $$
* IW: Input Width, IH: Input Height
* OW: Output Width, OH: Output Height
* SW: Stride Width, SH: Stride Height
* P: Padding
3 Dimensional Convolution
- CNN does not consider only width and height, but also channel, which is color information usually.
- For each channel, independent filters are required. The below example, R,G and B feature map has their own RF, GF and BF filters.
- It is easy to think all channel's feature map as a block for further extension.
- For CNN, it is possible to apply multiple filters to input feature map. The number of filters becomes the channel number of the output feature map.
Pooling Layer
- Pooling is an operation to shrink input feature map, so it is called sub sampling.
- Max pooling is to select the max value, and average pooling is to select the average in target area. The target area is shrunken into a scalar value chosen, then output feature map is smalled than input feature map.
- Usually, max pooling is used.
- 3 features of pooling layer
- No weights to train
- The number of channel is not changed
- Stable to variation of input feature map
Fully Connected Layer
- Fully connected (FC) layer is the traditional softmax regression of DNN.
- After features are extracted by feature extraction section, FC layer inferences what the objects are with softmax classification.
COMMENTS