Handwritten Digit Recognition Course Report

Introduction to Convolutional Neural Networks#

Summary of Convolutional Neural Networks#

Convolutional Neural Networks (CNN) were proposed by Yann Lecun of New York University in 1998. They are essentially a type of multilayer perceptron. CNN is a type of feedforward neural network with a deep structure that includes convolutional calculations. It is one of the representative algorithms of deep learning. CNN is a special type of multilayer neural network. Like other neural networks, CNN also uses a backpropagation algorithm for training, but the difference lies in the network structure.

Basic Features of Convolutional Neural Networks#

Multiple layers with hierarchical network structure#

Convolutional Neural Networks (CNN) are considered the first successful deep learning method that uses a hierarchical network structure with robustness. By exploiting the spatial correlation of data, CNN reduces the number of trainable parameters in the network, thereby improving the efficiency of the backpropagation algorithm in the feedforward network. In CNN, small regions in the image (also known as "local receptive fields") are treated as input data at the bottom layer of the hierarchy. Information passes through each layer of the network through forward propagation, and each layer consists of filters to obtain significant features of the observed data. Because local receptive fields can capture basic features such as edges and corners in the image, this approach provides a certain degree of invariance to displacement, stretching, and rotation. The close connection and spatial information between layers in CNN make it particularly suitable for image processing and understanding, and it can automatically extract rich relevant characteristics from images.

No complex preprocessing of samples required#

The classification model of Convolutional Neural Networks (CNN) differs from traditional models in that it can directly input a two-dimensional image into the model and output the classification result at the output end. The advantage is obvious: there is no need for complex preprocessing, and feature extraction and pattern classification are completely placed in a black box. The required parameters of the network are obtained through continuous optimization, and the desired classification results are output at the output layer. The feature extraction layer parameters of CNN are learned from the training data, so it avoids manual feature extraction and learns from the training data.

Local connectivity with strong generalization ability#

The generalization ability of Convolutional Neural Networks (CNN) is significantly better than other methods and has been widely used in pattern classification, object detection, object recognition, and other fields. By combining local receptive fields, weight sharing, and downsampling in space or time, CNN fully utilizes the locality and other features contained in the data itself to optimize the network structure and ensure a certain degree of invariance to displacement and deformation. Therefore, CNN can be used to recognize two-dimensional or three-dimensional images with invariance to displacement, scaling, and other forms of distortion. CNN is a deep learning model under supervised learning, which has strong adaptability, is good at mining local features from data, and can extract global training features and classification. Its weight-sharing structure network makes it more similar to biological neural networks and has achieved good results in various fields of pattern recognition.

The weight sharing feature of Convolutional Neural Networks (CNN) can reduce the number of parameters that need to be solved. In the convolutional layer, the weights connecting each neuron to the data window are fixed, and each neuron only focuses on one feature. By using multiple filters (convolution kernels) to convolve the image, multiple feature maps are obtained. Neurons in the same feature map share weights, reducing network parameters. This is a major advantage of convolutional networks over fully connected networks. On the other hand, weight sharing also reduces the complexity of the network, and the characteristics of multidimensional input signals (such as speech and images) can be directly input into the network without the need for data rearrangement in the feature extraction and classification process. The number of parameters in the hidden layer is independent of the number of neurons in the hidden layer, and only depends on the size of the filter and the number of filter types. The number of neurons in the hidden layer is related to the original image (i.e., the size of the input, that is, the number of neurons), the size of the filter, and the sliding step of the filter in the image.

CNN Network Structure and Basic Principles#

A typical CNN consists of three main parts (convolutional layer, pooling layer, fully connected layer) and a total of five layers.

Input layer
Convolutional layer
Activation layer
Pooling layer
Fully connected layer

In simple terms:

The convolutional layer is responsible for extracting local features from the image; the pooling layer is used to significantly reduce the number of parameters (dimensionality reduction); the fully connected layer is similar to the part of a traditional neural network that outputs the desired result.

A typical CNN is not just the three-layer structure mentioned below, but a combination of them to form a multi-layer structure. For example, the structure of LeNet-5 is convolutional layer - pooling layer - convolutional layer - pooling layer - convolutional layer - fully connected layer.

Input layer - Preprocessing#

Like traditional neural networks/machine learning, the model needs to preprocess the input. There are three common preprocessing methods: mean subtraction, normalization, PCA/SVD dimensionality reduction, etc.

Convolutional layer - Feature extraction#

The convolutional layer scans the entire image with a convolutional kernel. This process can be understood as using a filter (convolutional kernel) to filter various small regions of the image to obtain the feature values of these small regions.

Summary: The convolutional layer extracts local features from the image through the filtering of the convolutional kernel, similar to the feature extraction in human vision mentioned above.

Activation layer#

Activation is actually a nonlinear mapping of the output of the convolutional layer.

If no activation function is used (which is equivalent to the activation function being f(x) = x), in this case, the output of each layer is a linear function of the input of the previous layer. It is easy to conclude that no matter how many neural network layers there are, the output is a linear combination of the input, which is the same as the effect without hidden layers, becoming the most primitive perceptron.

Common activation functions include:

Sigmoid function (slow)
Tanh function (good for text and audio processing)
ReLU (fast, but not ideal)
Leaky ReLU
ELU
Maxout

In this experiment, the ReLU function was used.

Pooling layer (downsampling) - Data dimensionality reduction to avoid overfitting#

The pooling layer is simply downsampling or undersampling, mainly used for feature dimensionality reduction, compressing the amount of data and parameters, reducing overfitting, and improving the fault tolerance of the model. The main types are:

Max Pooling: maximum pooling
Average Pooling: average pooling

Fully connected layer - Output result#

After several rounds of convolution + activation + pooling, the model will connect the learned high-quality feature images to the fully connected layer. In fact, before the fully connected layer, if the number of neurons is too large and the learning ability is strong, overfitting may occur. Therefore, dropout can be introduced to randomly delete some neurons in the neural network to solve this problem. Local response normalization (LRN), data augmentation, and other operations can also be performed to increase robustness.

After reaching the fully connected layer, it can be understood as a simple multi-classification neural network (such as BP neural network), which obtains the final output through the softmax function, and the entire model training is completed.

Introduction to the Handwritten Digit Recognition Process#

Obtain the dataset.
Train the dataset.
Run a GUI program.
Use mouse to input handwritten digits in the GUI program.
Use CNN with the fully connected layer to output the recognition result.

Training Environment, Training Results, and GUI Interface Display (including random sampling of the test set and mouse input of handwritten digits)#

Training Environment#

System Environment#

WSL (Windows Subsystem for Linux) was chosen.

In Linux, Python has fewer bugs compared to Windows, and Windows is more convenient for daily use. Therefore, I chose WSL (Windows Subsystem for Linux). Besides, I had already installed the Conda environment in WSL before, so there was no need to reconfigure it.

Python Environment#

Miniconda was chosen.

Anaconda is too bloated, with an installation size of over 2 GB, and it installs a bunch of software that is usually not used, such as the Python IDE Spyder. Most of the time, we use more modern editors, such as VS Code. Therefore, the Python environment on my computer was Miniconda before.

Python Dependency Installation#

Since we have already installed the Conda environment, there is no need to use pip to manage dependencies. We can directly use Conda for management. I created a virtual environment configuration file environment.yml and installed all the required dependencies through conda env create --file environment.yml. Specifically, I referred to someone else's blog (1) at that time.

name: aiclass_handwriting
channels:
- defaults
dependencies:
- python=3.6  # Chose Python 3.6 version
- tensorflow
- numpy
- matplotlib
- pylint
- autopep8
- notebook
- PyQt
- PyQtGraph

Editor#

VS Code was chosen.

The editor is also different from the tutorial, mainly because of the previous development environment. Specifically, I referred to the tutorial (2) at that time.

Computer Hardware#

Lenovo R9000P 2021 Edition

Training Results#

The following figure shows the loss curve of this experiment.

The final test accuracy was 98.7%.

GUI Interface Display#

Initial Interface#

Random Sampling#

Handwritten Recognition#

Issues and Insights during the Process#

Issues#

Possible overlap between the test set and the training set#

It seems that this experiment did not distinguish between the test set and the training set, resulting in some leakage, causing the softmax of the handwritten input with the mouse to be significantly lower than the randomly selected set in the MNIST dataset (it may also be due to my training not reaching the desired goal or insufficient understanding of the code).

Font display issue in WSL at the beginning#

The default configuration is for the Windows environment, and the default font is missing Chinese characters in Linux. This issue was resolved by replacing the font.

Slow training#

Smaller datasets were used for training.

Insights#

The size of the dataset has a much smaller impact on accuracy compared to the number of training layers.

Coding is a very interesting process.

References#

[1] Don’t use Anaconda: How to setup a decent machine learning environment? https://blog.spencerwoo.com/2020/02/dont-use-anaconda

[2] Using Visual Studio Code in WSL https://dowww.spencerwoo.com/3-vscode/3-0-intro.html