Background
This is the 7th article in the “Learn PyTorch by Examples” series. In the 6th article “Learn PyTorch by Examples (6): Language Model (I) – Implementing a WordLevel Language Model with LSTM”, we briefly introduced how to implement a wordlevel language model using LSTM.
LSTM and other models based on Recurrent Neural Networks (RNN) have been widely used in natural language processing, but these models have some problems when dealing with longdistance dependency problems, such as vanishing gradients and exploding gradients. To solve these problems, researchers proposed the Transformer model, which uses the attention mechanism to better handle longdistance dependency problems. In this article, we will briefly introduce the Transformer model and use the Transformer to implement a simple wordlevel language model. This article refers to the word_language_model
example in the official PyTorch examples.
The code of this article can be found in the T06_word_lstm
folder in my GitHub repository https://github.com/jinli/pytorchtutorial.
Transformer Model and Attention Mechanism
In 2017, Google researchers published a paper “Attention is All You Need”, which proposed the Transformer model. Due to its excellent performance in language modeling, it quickly replaced LSTM and GRU models and became the mainstream model in the field of natural language processing.
The structure of the Transformer model is as follows:
As we can see, the Transformer model consists of an encoder and a decoder. The encoder and decoder are both stacked with multiple identical layers, each layer containing a multihead selfattention mechanism and a feedforward neural network.
SelfAttention Mechanism
The socalled selfattention mechanism means that the model can simultaneously focus on different positions in the input sequence to better capture the information in the input sequence. Multihead selfattention mechanism means that the model can simultaneously focus on different positions in the input sequence and learn different attention weights through multiple heads to better capture the information in the input sequence.
For example, suppose we have an input sequence [I, love, you]
, and we hope the model can predict love
based on the relationship between I
and you
. In the LSTM model, the model will process each word in the input sequence one by one, but in the Transformer model, the model can simultaneously focus on I
and you
to better capture the relationship between them. Since the Transformer model can simultaneously focus on different positions in the input sequence, it can better handle longdistance dependency problems. However, since the Transformer model does not have a recurrent structure, it cannot handle the order information in the sequence like LSTM, so we need to add positional encoding to the input sequence to represent the position information of the words.
The calculation process of selfattention is as follows:

First, we need to calculate the query, key, and value vectors. Here we use the word vectors of the input sequence as the query, key, and value vectors. These three vector representations can be obtained through a linear transformation, i.e., $Q = XW^Q$, $K = XW^K$, and $V = XW^V$, where $X$ is the word vector of the input sequence, and $W^Q$, $W^K$, and $W^V$ are the weights of the linear transformation.

Then, we calculate the attention scores $A$, which are the dot product of the query vector $Q$ and the key vector $K$, divided by $\sqrt{d_k}$, where $d_k$ is the dimension of the query vector $Q$. That is, $A = \frac{QK^T}{\sqrt{d_k}}$.

Next, we calculate the attention weights $W$, which are obtained by applying the Softmax function to the attention scores $A$. That is, $W = \text{Softmax}(A)$.

Finally, we calculate the selfattention output $O$, which is the weighted sum of the attention weights $W$ and the value vector $V$. That is, $O = W \cdot V$.
In practical applications, we usually use the multihead selfattention mechanism, which means that the word vectors of the input sequence are transformed into multiple sets of query, key, and value vector representations through multiple linear transformations, and then multiple sets of attention scores, attention weights, and selfattention outputs are calculated separately, and finally the multiple sets of selfattention outputs are concatenated and passed through a linear transformation to obtain the final output.
FeedForward Neural Network
The feedforward neural network is another important component in the Transformer model. It consists of two fully connected layers and an activation function. The calculation process of the feedforward neural network is as follows:

First, we use a fully connected layer to obtain the intermediate representation $M$ of the selfattention output $O$, i.e., $M = O \cdot W_1 + b_1$, where $W_1$ and $b_1$ are the weights and biases of the fully connected layer.

Then, we use an activation function (usually ReLU) to obtain the output $F$ of the feedforward neural network, i.e., $F = \text{ReLU}(M)$.

Finally, we use another fully connected layer to obtain the final output $O’$ of the feedforward neural network, i.e., $O’ = F \cdot W_2 + b_2$, where $W_2$ and $b_2$ are the weights and biases of the fully connected layer.
The feedforward neural network is used to perform a nonlinear transformation on the selfattention output $O$ to better capture the information in the input sequence.
Encoder and Decoder
Selfattention plus feedforward neural network form a layer in the Transformer model, and the entire Transformer model is composed of multiple such layers stacked together, as shown in the figure below:
The Transformer model is generally divided into an encoder and a decoder. The encoder is used to encode the input sequence into a context vector, and the decoder is used to generate the output sequence based on the context vector. Both the encoder and decoder are stacked with multiple identical layers, each layer containing a multihead selfattention mechanism and a feedforward neural network.
The input of the encoder is a word sequence, and the output is a context vector. The input of the decoder is a context vector and a word sequence, and the output is a word sequence. In tasks such as machine translation, we can use the word sequence of the source language as the input of the encoder and the word sequence of the target language as the input of the decoder to achieve translation from the source language to the target language.
The difference between the encoder and decoder is that when calculating selfattention, the decoder also calculates the attention of the encoder’s output, which is to better capture the relationship between the input sequence and the output sequence. In addition, the selfattention mechanism used by the decoder is the masked selfattention mechanism, which means that when calculating the attention weights, the decoder can only focus on the positions before the current position, not the positions after the current position. This is because our goal is to predict the current and subsequent words, so naturally we cannot use the information of the subsequent words, otherwise it would be cheating.
Classification of Transformer Models
Although the standard Transformer model consists of an encoder and a decoder, in practical applications, we can also use only the encoder or decoder, or use both the encoder and decoder at the same time. Moreover, the model that uses both the encoder and decoder is not necessarily better than the model that uses only the encoder or decoder, which depends on the specific task and dataset.
Encoder Model
The encoder model only contains the encoder, which is used to encode the input sequence into a context vector. The encoder model is commonly used in tasks such as text classification and sentiment analysis. Commonly used encoder models include BERT and RoBERTa.
Decoder Model
The decoder model only contains the decoder, which is used to generate the output sequence based on the context vector. The decoder model is commonly used in machine translation, text generation, and other tasks. Commonly used decoder models include T5 and GPT.
EncoderDecoder Model
The encoderdecoder model contains both the encoder and decoder, which is used to encode the input sequence into a context vector and generate the output sequence based on the context vector. The encoderdecoder model is also commonly used in machine translation, text generation, and other tasks. Commonly used encoderdecoder models include Transformer and BART.
Implementing WordLevel Language Model with Transformer in PyTorch
PyTorch provides the torch.nn.Transformer
module, which can be used to easily implement the Transformer model. In this article, we will replace the LSTM model in the previous article with the Transformer model to implement a simple wordlevel language model.
Prepare Data
As in the previous article, we will use the WikiText2 dataset. The method of downloading and processing the dataset can be found in the previous article, so I won’t repeat it here.
Define Model
PyTorch has a builtin torch.nn.Transformer
module, which we can use to implement the Transformer model. However, before using the torch.nn.Transformer
module, we need to define an embedding layer and a positional encoding layer.
Embedding Layer
The knowledge of word embedding has been introduced in the previous article. Here we can directly use the builtin torch.nn.Embedding
module to define an embedding layer.


where vocab_size
is the size of the dictionary in the dataset, and embed_size
is the dimension of the word embedding.
Positional Encoding Layer
The socalled positional encoding is to add position information to each word in the input sequence so that the model can better capture the information in the input sequence. We can use the following formula to calculate the positional encoding:
$p_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})$
$p_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})$
where $d_{model}$ is the dimension of the word embedding, $pos$ is the position of the word, and $i$ is the index of the dimension of the word embedding. In simple terms, the effect of the above positional encoding formula is to add a sine or cosine function position information to each dimension of each word. Using sine and cosine functions for positional encoding ensures that the distance between positional encodings of different positions is equal, so that the model can better capture the information in the input sequence.
We can use the following code to implement the positional encoding. Here we refer to the PositionalEncoding
class in the official PyTorch example code:


In the above code, we define a PositionalEncoding
class, which inherits from the nn.Module
class and is used to implement the positional encoding. In the __init__
method, we first define a positional encoding matrix pe
, then calculate the positional encoding of the sine and cosine functions, and finally add the positional encoding matrix pe
to the model’s buffer. In the forward
method, we add the input sequence x
and the positional encoding matrix pe
together, and then get the final output through the Dropout layer.
Transformer Model
With the embedding layer and positional encoding layer defined above, we can use the builtin torch.nn.Transformer
module to define a Transformer model. In addition to inheriting from the nn.Module
class and implementing the __init__
and forward
methods, we also need to define a generate_square_subsequent_mask
method to generate a mask matrix, which will be used when calculating the attention weights. In addition, we define an init_weights
method to initialize the model’s weights.


In the above code, we define a TransformerModel
class, which inherits from the nn.Transformer
class and is used to implement the Transformer model. In the __init__
method, we first call the super
function to initialize the nn.Transformer
class, then define the embedding layer, positional encoding layer, and linear layer, and finally call the init_weights
method to initialize the model’s weights. In the forward
method, we first generate a mask matrix, then pass the input sequence through the embedding layer and positional encoding layer, and finally pass the output through the encoder and decoder to get the final output.
Run the Model
With the model defined above, we can define a training function and a testing function, and then encapsulate a main function to load data, train, and test the model. Note that compared to the LSTM model in the previous article, the Transformer model requires an additional parameter nhead
to specify the number of attention heads. We can run our model by calling the following command:


Each time we run the training code, once the training exceeds one epoch, the original model file model.pt
will be overwritten, so if you want to save the previous model, you need to manually rename the model.pt
file, or specify the saved model file name (including the path) through the save
parameter when training again. After one epoch, the model will calculate the loss value on the validation set. If the loss value on the validation set is smaller than the previous minimum loss value, the model will be saved in the model.pt
file.
On my personal computer, if training with GPU (Nvidia GeForce RTX 4060 Ti), each epoch takes about 28 seconds, and the memory usage is about 548MB; if training with CPU (Intel i5 9600K), each epoch takes about 516 seconds. Here I trained a total of 50 epochs, and the loss values of the training set and validation set are shown in the figure below:
As we can see, the result is similar to the LSTM model in the previous article. The Transformer model converges slightly faster than the LSTM model, but the loss value on the validation set is slightly larger.
Generate Text
After training the model, the code will save the model in the model.pt
file by default. We can load this model and use it to generate some text. The method of generating text is similar to the previous article, and the code is in the generate_text.py
file. We can call the following command to generate some text:


An odd situation is that generating text can only be done with GPU on my computer. If the nocuda
parameter is specified to use CPU to generate text, my computer will crash and restart directly, and even the log is not output (at least I didn’t find it). One possible reason is that the CUDA version of my computer is 12.5, while the CUDA version of PyTorch is 12.1, and the crash and restart may be related to this (but I’m not sure, because training the model can be done without CUDA, only with CPU, theoretically generating text with CPU should not use CUDA, so the crash should not be related to CUDA). In addition, my computer has 64GB of memory, much larger than the GPU memory, so memory shortage should not be a problem. If someone has encountered similar problems, please leave a comment to let me know, and we can discuss it, thank you!
The speed of generating text with the Transformer model is slightly slower than the LSTM model. On my computer, generating 1000 words with GPU takes about 6 seconds. Since generating text with CPU will cause the computer to crash and restart, I did not test the speed of generating text with CPU.
Summary
In this article, we briefly introduced the Transformer model and the attention mechanism, and then referred to the word_language_model
example in the official PyTorch examples to implement a simple wordlevel language model using the torch.nn.Transformer
module in PyTorch. We also introduced the classification of Transformer models, including the encoder model, decoder model, and encoderdecoder model. The model’s training results are similar to the LSTM model in the previous article. Finally, we can use the trained Transformer model to generate some text.