Background
This is the second article in the “Learn PyTorch by Examples” series. In the previous article “Learn PyTorch by Examples (1): PyTorch Basics and MNIST Handwritten Digit Recognition (1)”, we introduced the basic concepts and usage of PyTorch, and implemented MNIST handwritten digit recognition using a simple three-layer fully connected neural network, which is the “Hello World” in the field of deep learning. In this article, we will discuss the selection of parameters in this simple three-layer neural network and compare the impact of different parameter choices on the model performance.
The code for this article can be found in the T02_mnist_cnn
folder in my GitHub repository https://github.com/jin-li/pytorch-tutorial. The code is based on the official PyTorch example code https://github.com/pytorch/examples.
Parameter Selection
Based on the three-layer fully connected neural network in the previous article, we can adjust the parameters to observe the change in model performance.
The parameters in machine learning models can be divided into two categories: hyperparameters and model structure. Hyperparameters are parameters set before training the model, such as learning rate, number of iterations, batch size, etc. Model structure refers to the network structure of the model, such as the number of layers in the network, the number of neurons in each layer, activation functions, loss functions, regularization methods, etc.
Model Structure
In the previous article, we used a simple three-layer fully connected neural network. We can try to increase or decrease the number of layers in the network, the number of neurons in each layer, activation functions, etc. to observe the change in model performance. We can also try different loss functions and regularization methods.
-
Number of Layers and Number of Neurons in Each Layer
In machine learning, the number of layers and the number of neurons in each layer of a neural network are very important hyperparameters. Increasing the number of layers and the number of neurons in each layer can increase the expressive power of the model, but it will also increase the complexity of the model, which may lead to overfitting. Conversely, reducing the number of layers and the number of neurons in each layer can reduce the complexity of the model, but it may also lead to a decrease in the accuracy of the model. Therefore, we need to strike a balance between the two.
Here we try several different network structures, such as using 1 layer, 3 layers, 5 layers of hidden layers, and different values for the number of neurons in each layer, such as 64, 128, 256, etc.
-
Activation Functions
The activation function is a very important concept in neural networks, and it is also the key to the ability of neural networks to fit various models. We can understand the activation function in this way: no matter what model, its essence is to make some judgments, to judge what the output should be based on different inputs, this judgment may be a single judgment, or a comprehensive judgment of many judgments. The activation function introduces a non-linear factor into the neuron, providing the neuron with a judgment ability.
Common activation functions include ReLU, Sigmoid, Tanh, etc. It can be seen that different activation functions are actually very different, but they all introduce a non-linear factor. Here we also try some different activation functions.
-
Loss Functions
The loss function is used to measure the difference between the predicted value of the model and the true value, that is, the evaluation index of the model. The loss function is very important because it determines the optimization direction of the model.
Common loss functions include cross-entropy loss function, mean square error loss function, etc. Here we also try some different loss functions. It should be noted that the mean square error loss function is generally used for regression problems, and the cross-entropy loss function is generally used for classification problems. But we can still try to use the mean square error loss function. Since the target value in the mean square error loss function parameters (
target
) needs to input aone-hot
encoded vector, and the labels in the original MNIST dataset are integers, we need to encode the labels intoone-hot
encoding. The code corresponding to this article makes a judgment on the loss function. If the loss function is the mean square error loss function, the labels are encoded intoone-hot
encoding.1 2
if loss_function == F.mse_loss: target = F.one_hot(target, num_classes=10).float()
The so-called
one-hot
encoding is to convert an integer into a vector, the length of the vector is equal to the number of categories, with only one element being 1 and the rest being 0. For example, for the MNIST dataset, there are 10 categories in total, we can convert label 0 to[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
, label 1 to[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
, and so on. -
Regularization Methods
Regularization is mainly used to prevent overfitting. Why can regularization prevent overfitting?
First, let’s analyze the reason for overfitting. Overfitting means that the model performs well on the training set, but poorly on the test set. The reason for overfitting is that the model learns the noise of the training set on the training set, which leads to poor generalization ability of the model on the test set. Regularization is to add the model parameters to the loss function, so that the model parameters are as small as possible, thereby reducing the complexity of the model and preventing the model from learning noise on the training set.
Common regularization methods include L1 regularization, L2 regularization, Dropout, etc. Here we also try some different regularization methods.
Hyperparameters
Hyperparameters are parameters set before training the model, such as learning rate, number of iterations, batch size, etc. These parameters have a great impact on the performance of the model, so we need to carefully select these parameters.
-
Learning Rate
The learning rate is the step size at which the model updates the parameters, which determines the speed at which the model parameters are updated. A learning rate that is too small may slow down the convergence of the model, while a learning rate that is too large may prevent the model from converging.
-
Epochs
The number of epochs refers to the number of iterations the model iterates on the training set. Too few epochs may lead to underfitting, while too many epochs may lead to overfitting.
-
Batch Size
When training the model, we usually divide the training set into several batches, each batch containing several samples. This can reduce memory usage and speed up model training. A batch size that is too small may slow down the convergence of the model, while a batch size that is too large may lead to overfitting.
Modify the Code to Make Parameters Customizable
In the previous article, the model structure was fixed, and hyperparameters could be specified through command-line parameters. Now we modify the code, extract the main function, and pass all these parameters as parameters to the main function. In this way, we can write another script to call the main function, pass in different parameters, and achieve different model structures and hyperparameter choices.
Here we won’t post the complete code, just give the prototype of the modified main function:
|
|
The first three parameters are hyperparameters, the next four parameters are model structure parameters, and the middle few parameters are some auxiliary parameters.
Finally, because we want to compare the model performance under different parameters, we let the main function return the loss and accuracy during training, so that we can draw a comparison of the model performance under different parameters in the script calling the main function. For this purpose, we define a class to store the training performance:
|
|
In the training process, we store the loss and accuracy of the training and test sets in this class, and finally return them from the main function. The specific code can be found in the code corresponding to this article in the GitHub repository.
Performance Comparison
We focus on three hyperparameters and four model parameters here, a total of seven parameters, and each parameter has multiple choices. Assuming we only choose three values for each parameter, then there are a total of $3^7=2187$ combinations, which is not a small number!
Due to the variety of parameter combinations, finding an optimal parameter combination is like finding a recipe through trial and error, so many people metaphorically call machine learning “alchemy”. Indeed, this is very similar to the work of ancient alchemists.
In practical applications, we can use some heuristic methods to select parameters, such as grid search, random search, Bayesian optimization, etc. These methods can help us find an optimal parameter combination more quickly.
Here we use the method of controlling variables, taking the model structure and hyperparameters in the previous article as the baseline, and adjusting only one parameter at a time to observe the change in model performance. This can help us better understand the impact of each parameter on the model performance. The specific code can be found in the T02_mnist_parameters
folder in my GitHub repository https://github.com/jin-li/pytorch-tutorial, in the parametric_study.py
file.
After each set of parameters is trained, we draw the training and test loss and accuracy:
|
|
Here you need the same Python virtual environment as in the previous article, which can be activated by conda activate pytorch-mnist
. Then you can use the following command to run all the tests:
|
|
Model Structure Parameters
-
Number of Hidden Layers
Here I tried 1, 3, and 5 hidden layers. The number of neurons in the middle layers varies between 64, 128, and 256. The following figure shows the test results:
It can be seen that the model with 1 hidden layer has performed very well, and increasing the number of hidden layers does not significantly improve the model performance. Of course, this may also be related to other parameters in the model. Perhaps increasing the number of hidden layers also requires adjusting other parameters of the neural network to make the hidden layers work. But from this experimental result, 1 hidden layer is enough for this problem.
-
Activation Functions
Here I tried ReLU, Sigmoid, Tanh, and other activation functions. The following figure shows the test results:
It can be seen that the performance of these three activation functions is similar, but ReLU performs relatively better, and the model performance of Sigmoid and Tanh is relatively worse. This is also in line with our expectations, because the ReLU activation function is the most commonly used activation function at present. Its advantages are simple calculation, fast convergence, and not easy to encounter the problem of gradient disappearance.
-
Loss Functions
In addition to the negative log-likelihood loss function
nll_loss()
used in the model, I also tried the cross-entropy loss functioncross_entropy()
and the mean square error loss functionmse_loss()
. The following figure shows the test results:It can be seen that the cross-entropy loss function and the negative log-likelihood loss function perform similarly, with good performance, and the mean square error loss function has the worst model performance. This is also in line with our expectations, because the mean square error loss function is generally used for regression problems, the cross-entropy loss function is generally used for classification problems.
-
Regularization Methods
Regularization methods are special. In the code corresponding to this article, using regularization requires too many modifications to the code, so I did not test the effect of regularization here, and I will leave it for future research.
Hyperparameters
-
Batch Size
Here I tried three batch sizes: 16, 64, and 256. The following figure shows the test results:
It can be seen that the model performs best with batch sizes of 64 and 256, and worst with a batch size of 16. I also recorded the training time, the training time is longest with a batch size of 16, and relatively shorter with batch sizes of 64 and 256:
1 2 3
batch size 16: 164.6s batch size 64: 133.9s batch size 256: 125.0s
In the case of sufficient memory and computing resources, we can choose a larger batch size, which can speed up the training of the model.
-
Number of Epochs
The number of epochs refers to the number of iterations the model iterates on the training set. I tried up to 25 epochs. The following figure shows the test results:
It can be seen that as the number of epochs increases, the loss value gradually decreases, and the accuracy gradually increases. However, when the number of epochs exceeds a certain value, the model performance no longer improves, and may even overfit. Therefore, we need to choose the appropriate number of epochs based on the performance of the model. For this problem, 14 epochs are enough.
-
Learning Rate
The learning rate is a very important hyperparameter. I tried three learning rates: 0.1, 1.0, and 10.0. The following figure shows the test results:
It can be seen that the model performs best with a learning rate of 1.0, and worst with a learning rate of 10.0. This is also in line with our expectations, because a learning rate that is too small may slow down the convergence of the model, while a learning rate that is too large may prevent the model from converging.
Computing Resources
Above experiments were run on my personal computer, using an NVIDIA GeForce GTX 4060 Ti graphics card. Each example uses almost the same amount of computing resources, basically not consuming much graphics card resources, the graphics card utilization rate is about 5%, the graphics memory occupies about 180 MB, and the running time is about 2 minutes and 15 seconds. If using the CPU, the running time is about 3 minutes.
Summary
In this article, we discussed the parameter selection problem in neural networks, including model structure and hyperparameters. We modified the code so that the model structure and hyperparameters can be specified through command-line parameters, and then adjusted each parameter one by one through the method of controlling variables to observe the change in model performance.
Due to limited computing resources, we only tested a part of the parameter combinations, but this is enough to illustrate the impact of parameter selection on model performance. In practical applications, we can use some heuristic methods to select parameters, such as grid search, random search, Bayesian optimization, etc. These methods can help us find an optimal parameter combination more quickly. We will continue to discuss these methods in subsequent articles.
Interested readers can use the code corresponding to this article in the GitHub repository https://github.com/jin-li/pytorch-tutorial (in the T02_mnist_cnn
folder) to try different parameter combinations and observe the change in model performance.