Motivation
I used TensorFlow when I took a machine learning course many years ago, but my understanding of deep learning was not deep enough at that time, so I only used it briefly. Later, I did not take any machine learning courses, but I still used some machine learning knowledge in my research work, so I learned some machine learning knowledge intermittently.
Now deep learning has become the mainstream method of machine learning, and PyTorch is a very popular deep learning framework. I have also learned about PyTorch before, but I have not learned it systematically, so I recently decided to learn PyTorch systematically.
Although there are already many tutorials on PyTorch online, I still want to start a new series “Learn PyTorch by Examples”, mainly to deepen my understanding. The goal of this series is to start with the basics of PyTorch and use the examples provided by the PyTorch official to learn how to implement some classic machine learning models using PyTorch.
The PyTorch official provides many examples, including MNIST handwritten digit recognition, CIFAR10 image classification, IMDB sentiment analysis, etc., and the GitHub repository address is https://github.com/pytorch/examples. These examples are the Hello World of deep learning and are very suitable for beginners to learn. However, the PyTorch official only provides code without explanation, which may not be friendly enough for beginners. So I plan to use this series to explain these examples to help beginners learn PyTorch better.
Here I have created a new GitHub repository, the main body of which is the PyTorch official examples, but I will add some auxiliary code and documents to each example to facilitate beginners’ learning. The address of this repository is https://github.com/jinli/pytorchtutorial. Welcome to Star and Fork.
PyTorch Basics
Introduction to PyTorch
PyTorch is an opensource deep learning framework developed by Facebook’s artificial intelligence research team. PyTorch provides two main functions:
 A multidimensional tensor library, similar to NumPy, but can run on GPUs.
 An automatic differentiation engine for building and training neural networks.
Advantages of PyTorch:
 PyTorch is a dynamic graph framework that allows for more flexible definition of neural networks.
 PyTorch’s API is more Pythonic and easier to learn and use.
 PyTorch’s community is more active, with more tutorials and examples.
PyTorch Runtime Environment
PyTorch supports multiple operating systems, including Linux, Windows, and macOS. PyTorch supports multiple hardware devices, including CPUs, GPUs, and TPUs. PyTorch supports multiple programming languages, including Python, C++, and Java.
Here my runtime environment is a desktop computer with Ubuntu 22.04, equipped with an Intel Core i59600K processor and an NVIDIA GeForce GTX 4060 Ti graphics card. The CPU memory is 64GB, and the GPU memory is 8GB. However, in this series of tutorials, I will try to use both the CPU and the GPU, both for performance comparison and to allow readers to run on different hardware devices.
PyTorch Installation
PyTorch installation is very simple, just use the pip command. But if you install directly with pip, you may mess up your Python environment, so we will use conda to create a virtual environment each time, and then run the PyTorch instance in the virtual environment.
For Python environment management, I wrote an article “Python Environment Management Methods Summary”, interested readers can refer to it.
MNIST Handwritten Digit Recognition
MNIST is a very classic handwritten digit recognition dataset, which contains 60,000 training images and 10,000 test images. Each image is a 28x28 pixel grayscale image, and the label is a number between 0 and 9. MNIST has become the Hello World of deep learning, especially in the field of computer vision (CV), almost all deep learning frameworks have MNIST examples. Here we start with MNIST to start our PyTorch learning journey.
Process Overview
The general process of implementing a deep learning model with PyTorch is as follows:
 Prepare the dataset: download the dataset, convert the dataset to a PyTorch dataset.
 Define the model: define the neural network model, including the network structure and parameters.
 Train the model: train the model using the training dataset, adjust the model parameters.
 Test the model: test the model using the test dataset, evaluate the model performance.
Here we will implement MNIST handwritten digit recognition according to this process.
Prepare the Dataset
For the MNIST handwritten digit recognition problem, the dataset preparation is very simple because PyTorch has builtin the MNIST dataset. We only need to use the torchvision.datasets.MNIST
class, here we can specify train=True
to indicate the training dataset, train=False
to indicate the test dataset.


The MNIST dataset is a series of images and labels, each image is a 28x28 grayscale image, and each label is a number between 0 and 9.
Here we use datasets.MNIST()
to get the dataset, and then use torch.utils.data.DataLoader()
to convert the dataset to a PyTorch dataset. DataLoader
is an iterator that can easily batch process the dataset, here we specify batch_size=64
to indicate that 64 samples are taken each time, and shuffle=True
to indicate that the order of samples is shuffled each time.
Define the Model
Here we choose to use a neural network to implement handwritten digit recognition. Once the type of model is determined, we need to consider the specific structure of the model, including the number of layers in the network, the number of neurons in each layer, the activation function, etc. The choice of these parameters depends on the specific problem and depends heavily on experience.

We can first determine the input and output. Obviously, the input of this neural network is a 28x28 grayscale image, and the output is a number between 0 and 9.

We need to choose a neural network type, such as a fully connected neural network, a convolutional neural network, a recurrent neural network, etc. Here we choose to use a fully connected neural network.

We need to determine the structure of the network, including the number of layers in the network, the number of neurons in each layer, the activation function, etc. Here we choose a simple neural network, including an input layer, a hidden layer, and an output layer.
The code to create this neural network is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
import torch import torch.nn as nn import torch.nn.functional as F class SimpleNN(nn.Module): def __init__(self): super(SimpleNN, self).__init__() self.fc1 = nn.Linear(28 * 28, 128) # Input layer to hidden layer self.fc2 = nn.Linear(128, 10) # Hidden layer to output layer def forward(self, x): x = x.view(1, 28 * 28) # Flatten the input image x = F.relu(self.fc1(x)) # Apply ReLU activation x = self.fc2(x) # Output layer return F.log_softmax(x, dim=1) # Apply logsoftmax for classification
Here we define a class named
Simple
that inherits fromnn.Module
. In PyTorch, all neural network models need to inherit from thenn.Module
class and implement the__init__
andforward
methods. In the
__init__
method, we define two fully connected layersfc1
andfc2
, representing the input layer to the hidden layer and the hidden layer to the output layer, respectively.nn.Linear
represents a fully connected layer, the first parameter represents the number of input neurons, and the second parameter represents the number of output neurons.  In the
forward
method, we define the forward propagation process of the network, that is, how we calculate the output given the input data. Where:x.view(1, 28 * 28)
means to flatten the input data into a onedimensional vector, that is, to flatten a 28x28 image into a 784dimensional vector.F.relu(self.fc1(x))
means to input the data into the first fully connected layer, and then apply the ReLU activation function.self.fc2(x)
means to input the data after ReLU activation into the second fully connected layer to get the output result.F.log_softmax(x, dim=1)
means to convert the output result to a probability, that is, take the logarithm of the probability of each class.
 In the
Train the Model
With the dataset and model, we can start training the model. The training process is to use the gradient descent algorithm to continuously adjust the model parameters to minimize the error between the model’s prediction and the true result.
The general process of training the model is as follows:
 Initialize the model parameters.
 Take a batch of data from the dataset.
 Input the data into the model to get the model’s prediction.
 Calculate the error between the model’s prediction and the true result.
 Update the model parameters using the gradient descent algorithm.
 Repeat steps 2 to 5 until the model converges.
The most critical steps are steps 4 and 5, that is, calculating the error and updating the parameters. PyTorch provides the torch.optim
module to implement the gradient descent algorithm, and the torch.nn.functional
module to implement the loss function. The process of updating the parameters is implemented by the backpropagation algorithm, PyTorch provides the loss.backward()
method to calculate the gradient, and the optimizer.step()
method to update the parameters.
The loss function and backpropagation algorithm are the core of deep learning because the loss function determines the optimization goal of the model, and the backpropagation algorithm determines how to adjust the model parameters.
Loss Function
The loss function is used to measure the difference between the model’s prediction and the true result, that is, the model’s error. PyTorch provides many common loss functions:
 Classification problem: crossentropy loss function
torch.nn.CrossEntropyLoss()
, negative loglikelihood loss functiontorch.nn.NLLLoss()
, etc.  Regression problem: mean square error loss function
torch.nn.MSELoss()
.  Binary classification problem: binary crossentropy loss function
torch.nn.BCELoss()
.  Multilabel classification problem: multilabel crossentropy loss function
torch.nn.BCEWithLogitsLoss()
.
Backpropagation
The backpropagation algorithm is used to calculate the gradient of the model parameters, that is, the derivative of the model’s error with respect to the parameters. PyTorch provides the loss.backward()
method to calculate the gradient, and then uses the optimizer.step()
method to update the parameters.
Training Code
We encapsulate this process into a train
function:


model.train()
sets the model to training mode, so that the Dropout layer and BatchNorm layer in the model will take effect.optimizer.zero_grad()
sets the gradient of the optimizer to zero because PyTorch defaults to accumulate gradients.output = model(data)
inputs the data into the model to get the model’s prediction.loss = F.nll_loss(output, target)
calculates the error between the model’s prediction and the true result, using the negative loglikelihood loss function.loss.backward()
calculates the gradient of the model parameters using the backpropagation algorithm.optimizer.step()
updates the model parameters using the gradient descent algorithm.
Note that since we have converted the dataset to a PyTorch dataset in advance, each time a batch of data is taken, that is, data
is a tensor, for the MNIST dataset, the shape of data
is (batch_size, 1, 28, 28)
, and the shape of target
is (batch_size,)
.
Training all the data is called completing an epoch, and we can iterate over the training dataset multiple times. However, iterating over the training dataset multiple times does not necessarily improve the model’s performance, as it may lead to overfitting. Therefore, we need to monitor the model’s performance during training and stop training in time.
Why do we train in batches? Because training all the data at once may lead to insufficient memory, and training in batches can speed up the training process.
How to choose the batch size? The batch size is a hyperparameter that needs to be chosen based on the specific problem and hardware device. Generally speaking, the larger the batch size, the faster the training speed, but also the larger the memory consumption. The choice of batch size also affects the convergence speed and generalization ability of the model.
Test the Model
After training the model, we need to test the model’s performance. The general process of testing the model is as follows:
 Take a batch of data from the test dataset.
 Input the data into the model to get the model’s prediction.
 Calculate the error between the model’s prediction and the true result.
 Repeat steps 1 to 3 until the test dataset is traversed.


model.eval()
sets the model to evaluation mode, so that the Dropout layer and BatchNorm layer in the model will not take effect.with torch.no_grad():
means that we do not need to calculate the gradient because during testing we only need to calculate the model’s prediction, not update the model parameters. The error here is the average error of the entire test dataset.
pred = output.argmax(dim=1, keepdim=True)
means to take the class with the highest probability from the prediction result.correct += pred.eq(target.view_as(pred)).sum().item()
means to calculate the number of samples predicted correctly.
Main Program
With the above preparation, we can start training and testing the model. Here we define a main
function to call the train
and test
functions.



transforms.Compose()
means to combine multiple data transformation operations. 
transforms.ToTensor()
means to convert the data to a tensor. 
transforms.Normalize()
means to standardize the data, that is, subtract the mean and divide by the standard deviation. Here the mean and standard deviation are the mean and standard deviation of the MNIST dataset, which can be calculated, the specific code is as follows:1 2 3 4 5 6 7 8 9 10 11 12 13
import torch from torchvision import datasets, transforms # Load the MNIST dataset without any transformations dataset = datasets.MNIST('../data', train=True, download=True, transform=transforms.ToTensor()) # Compute the mean and standard deviation loader = torch.utils.data.DataLoader(dataset, batch_size=60000, shuffle=False) data = next(iter(loader))[0] # Get all the images in a single batch mean = data.mean().item() std = data.std().item() print(f'Mean: {mean}, Std: {std}')
I put this code in the file
get_mnist_statistics.py
, which can be run directly. 
model = SimpleNN().to(device)
means to move the model to the specified device, here we can specify CPU or GPU.
Complete Code
The above is all the main code needed to implement MNIST handwritten digit recognition. I have integrated these codes into a file mnist_nn.py
. All the above codes are in the T01_mnist_nn
folder of my GitHub repository https://github.com/jinli/pytorchtutorial.
In addition to the above code, I also defined some code to parse command line parameters, so that we can specify some parameters through the command line, such as learning rate, batch size, number of iterations, etc. You can view all parameters by entering the following command:


Python Environment
Before running the code, we need to create a Python virtual environment and install PyTorch and other dependencies. There are many Python environment management tools, see my previous article “Python Environment Management with venv/conda/mamba”. Here I use conda
to create a virtual environment for this series of tutorials and install PyTorch and other dependencies.


Run the Code
We run this code to see the model’s performance on the test dataset:


The above code defaults to using the GPU, on my machine (NVIDIA GeForce GTX 4060 Ti) it takes about 2 minutes and 15 seconds. If you don’t have a GPU, you can specify nocuda
to use only the CPU, on my machine (Intel Core i59600K) it takes about 3 minutes and 1 second, slightly slower than using the GPU, but not much slower, this is because our model is relatively simple, and the MNIST dataset is relatively small.
After running 14 epochs, you will get the following output:


That is, our simple threelayer neural network model has an accuracy of 98% on the MNIST dataset, which is already very good.
Summary
In this article, we introduced the basic concepts and usage of PyTorch, and then implemented MNIST handwritten digit recognition using a simple threelayer fully connected neural network. This is the “Hello World” of deep learning. However, the parameters in this neural network are directly given, we did not discuss how these parameters are obtained, nor did we discuss how the choice of these parameters affects the model’s performance. In the next article, we will try different parameter choices to see how they affect the model’s performance and how to adjust these parameters to improve the model’s performance.