Featured image of post Learn PyTorch by Examples (6): Language Model -- Implementing a Word-Level Language Model with LSTM (I)

Learn PyTorch by Examples (6): Language Model -- Implementing a Word-Level Language Model with LSTM (I)

Learn PyTorch by examples, implement a word-level language model with LSTM

Background

This is the sixth article in the “Learning PyTorch by Examples” series. In the fourth and fifth articles, we introduced the sequence prediction problem and implemented the prediction of the sine function with RNN, GRU, and LSTM.

In the fourth article, we mentioned that in addition to time series like the sine function, sequence data can also be word sequences in language models. In this article, we will introduce how to implement a word-level language model with LSTM.

The code for this article can be found in the T06_word_lstm folder in my GitHub repository https://github.com/jin-li/pytorch-tutorial.

Language Model

A language model is an important problem in natural language processing. It is a model used to evaluate the probability of a sentence. Language models can be used to predict the next word, or to generate a sentence. Language models have a wide range of applications in machine translation, speech recognition, text generation, etc.

A language model is also a sequence prediction problem, that is, when predicting the next word, we need to consider not only the current word but also the previous words. For example, if the current word is “apple”, when predicting the next word, we need to consider whether this “apple” refers to a fruit or a company, which requires information provided by the previous words. For example, if the previous words are “eat”, “banana”, “pear”, etc., then “apple” is likely to refer to a fruit; if the previous words are “phone”, “computer”, “Jobs”, etc., then “apple” is likely to refer to a company.

In the previous articles, we briefly introduced that the sequence prediction problem can be solved using recurrent neural networks (RNN) and its variants, such as long short-term memory networks (LSTM) and gated recurrent units (GRU). In the official PyTorch example code, there is an example of implementing a language model using RNN and Transformer. Here we will use LSTM to implement a simple word-level language model based on this example code. In addition, some of the images and code in this article are referenced and quoted from the YouTube blogger Donato Capitella’s video “LLM Chronicles #4.4: Building a Word-Level Language Model in PyTorch using RNNs”.

Of course, in recent years, with the advent of Transformer, the performance of language models has been greatly improved. However, LSTM, as a classic recurrent neural network, still has a wide range of applications. So here we will first introduce how to implement a simple language model using LSTM. In the following articles, we will introduce Transformer and its variants, and how to use Transformer to implement a language model.

Language models are divided into character-level, word-level, subword-level, etc., depending on how the sequence is segmented. Character-level means that single characters are used as units, word-level means that words are used as units, and subword-level means that words are segmented into morphemes. For example, the word “joyfulness” can be segmented into “joy”, “ful”, and “ness”.

Segmentation at different levels

Word-Level Language Model

A word-level language model is a model used to predict the next word. The input of a word-level language model is a word sequence, and the output is also a word sequence. During training, we use the previous words in a sentence as input and the following words as output, and train the model by minimizing the difference between the predicted word and the true word. During testing, we can use the model to predict the next word.

The process of implementing a word-level language model with LSTM is similar to that of implementing a sine function prediction with LSTM or RNN:

  1. Data Preparation: We need to convert the word sequence into an integer sequence so that it can be input into the model.
  2. Model Construction: We need to build an LSTM model to predict the next word.
  3. Model Training: We need to train the model using the dataset so that the model can predict the next word.
  4. Model Testing: We need to use the model to predict the next word.

Data Preparation

Here we use the WikiText-2 dataset, which is a common language model dataset containing some Wikipedia entries. The content of the dataset can be found here: WikiText-2.

WikiText-2

The difference between a language model and a sine function sequence model is that the input of the sine function sequence is a number, while the input of the language model is a word. PyTorch can handle mathematical tensors, so we need to convert words into numbers.

Data Preprocessing

The WikiText-2 dataset is a text file, and we need to convert the text file into a word sequence. PyTorch has a torchtext library that can be used to process text data, but after April 2024, this library was no longer maintained. So here we need to write some code to preprocess the training data ourselves.

The basic idea of preprocessing is:

  1. Read all the text and use a dictionary to save all the unique words. The keys of the dictionary are words, and the values are the numbers corresponding to the words.
  2. Replace all the words in the original text with numbers.

The code we use here is from the official PyTorch example code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import os
from io import open
import torch

class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = line.split() + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

Here we define a Dictionary class and a Corpus class. The Dictionary class is used to save the correspondence between words and numbers, and the Corpus class is used to read text files and convert them into integer sequences. Here we convert the training, validation, and test sets separately.

After preprocessing, we get three integer sequences for the training, validation, and test sets, with lengths of 2088628, 217646, and 245569, respectively. The size of the dictionary is 33278.

Data Batching

After preprocessing, we get a very long integer sequence. To facilitate training, we divide this sequence into several small sequences. For example, if the length of the sequence is 10000, and we want to divide it into 20 batches, then the length of each batch is 500. In this way, we get a matrix of shape $500 \times 20$.

Note:

  • The converted matrix is column-major, i.e., the first column is the data of the first batch, the second column is the data of the second batch, and so on.
  • If the length of the data cannot be divided by the batch length, we discard the extra data.
  • The batches are independent of each other, i.e., there is no relationship between the last data of the first batch and the first data of the second batch. This also means that some context information will be lost after batching.

The code for batching is very simple:

1
2
3
4
5
6
7
8
def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data

Word Embedding

In the previous articles, we used one-hot encoding to encode labels. But for words, one-hot encoding is not suitable because there are too many words (e.g., there are 33278 words in the dictionary of WikiText-2), and one-hot encoding will result in high dimensions and high computational complexity. Therefore, we need to use another method to represent words, mapping words to a low-dimensional space. This step is called word embedding.

In simple terms, the idea of word embedding is to use several features to represent a word, such as the part of speech, sentiment, semantics, etc. For example, we select 7 features: “living being”, “feline”, “human”, “gender”, “royalty”, “verb”, “plural”, and use a number between -1 and 1 to describe the degree of these features for a word, combining the values representing all features to get a 7-dimensional vector, which is the word embedding of this word. For example, for the word “man”, we can use the vector $[0.6, -0.2, 0.8, 0.9, -0.1, -0.9, -0.7]$ to represent its word embedding.

Similarly, we can represent all the words in the dictionary as word embeddings. The following figure shows some examples:

Word Embedding Examples

Word embedding is a very important concept in language models because it can help our language model better represent the semantic information of words and clearly see the relationship between words. For example, if we subtract “man” from “king” and then add to “woman”, the result should be very close to “queen”.

Of course, in practice, we will not use only 7 features, but generally use hundreds of features, so that we can fully represent the semantic information of words. For this problem, we choose to use 200 features, so we need to map the 33278 words in the dictionary to a 200-dimensional space. If we use a matrix to represent this, the word embedding matrix is a $33278 \times 200$ matrix.

This idea seems reasonable, but how do we get this word embedding matrix? We can also train a neural network to get this word embedding matrix. The input of this neural network is the word index, and the output is the word embedding of this word. The training objective of this neural network is to minimize the difference between the predicted word embedding and the true word embedding. The training process of this neural network is similar to the training process of the language model, except that the input and output are different.

Of course, here we don’t need to train this word embedding matrix ourselves, because researchers have trained many such word embedding matrices, and we can directly use these word embedding matrices. These word embedding matrices can be general or trained for a specific task. In PyTorch, we can use torch.nn.Embedding to load these word embedding matrices.

1
torch.nn.Embedding(ntoken, emsize)

The first parameter is the size of the dictionary, and the second parameter is the dimension of the word embedding. This function will return an Embedding object, which we will use as a layer in the LSTM model.

LSTM Model

With the word embedding, we can build an LSTM model. Here we use torch.nn.LSTM in PyTorch as the basis to build a simple LSTM model. The structure of this model is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch.nn as nn
import torch.nn.functional as F

class LanguageLSTM(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False):
        super(LanguageLSTM, self).__init__()
        self.ntoken = ntoken
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.bias)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, input, hidden):
        emb = self.drop(self.encoder(input))
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.ntoken)
        return F.log_softmax(decoded, dim=1), hidden

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                weight.new_zeros(self.nlayers, bsz, self.nhid))

This model is similar to the previous LSTM model, but here we use a word embedding layer. The input of this model is an integer sequence, and the output is a probability distribution, indicating the probability that the next word is which word. The training objective of this model is to minimize the difference between the predicted probability distribution and the true probability distribution.

Training and Testing the Model

With the data and model, we can start training the model. The training code is similar to the previous LSTM model, but here we generally use CrossEntropyLoss as the loss function.

In addition to outputting the loss value on the command line during training, we also return the weighted average of the loss values of all batches in an epoch at the end, which is convenient for plotting later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def train(device, model, epoch, train_data, batch_size, criterion, lr, log_interval, seq_len):
    model.train()
    total_loss = 0.
    loss_all = []
    data_cnt = []
    start_time = time.time()
    hidden = model.init_hidden(batch_size)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, seq_len)):
        data, targets = get_batch(train_data, i)
        data, targets = data.to(device), targets.to(device)
        model.zero_grad()
        hidden = repackage_hidden(hidden)
        output, hidden = model(data, hidden)
        loss = criterion(output, targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)
        for p in model.parameters():
            p.data.add_(p.grad, alpha=-lr)
        total_loss += loss.item()
        loss_all.append(loss.item())
        data_cnt.append(len(data))
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // seq_len, lr,
                elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()
    return np.average(loss_all, weights=data_cnt)

The testing code is similar to the previous LSTM model, and we don’t need to change it much.

Model Performance

All the code can be found in the T06_word_lstm folder in my GitHub repository https://github.com/jin-li/pytorch-tutorial. After setting up the environment, we can run language_lstm.py to train the model.

1
python language_lstm.py

On my personal computer, if training with GPU (Nvidia GeForce RTX 4060 Ti), each epoch takes about 26 seconds, and the memory usage is about 540MB; if training with CPU (Intel i5 9600K), each epoch takes about 506 seconds. Here I trained a total of 50 epochs, and the loss values of the training and validation sets are shown in the following figure:

Loss values of the training and validation sets

It can be seen that after training for 20 epochs, the loss value of the model basically stabilized at around 4.1, and the loss value of the validation set also basically stabilized at around 4.7. This indicates that the model’s generalization ability is generally good. Moreover, our training data is not large, so training for 20 epochs is basically enough.

Generating Text with the Model

When the model training is completed, the trained model will be saved as the model.pt file in the current directory. We can use this model to generate text. The code for generating text is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def generate_text(device, checkpoint, data_source, words, temperature, log_interval):
    
    with open(checkpoint, 'rb') as f:
        model = torch.load(f, map_location=device)
    model.eval()

    corpus = data.Corpus(data_source)
    ntokens = len(corpus.dictionary)

    hidden = model.init_hidden(1)
    input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)

    generated_text = []
    with torch.no_grad():  # no tracking history
        for i in range(words):
            output, hidden = model(input, hidden)
            word_weights = output.squeeze().div(temperature).exp().cpu()
            word_idx = torch.multinomial(word_weights, 1)[0]
            input.fill_(word_idx)

            word = corpus.dictionary.idx2word[word_idx]
            generated_text.append(word)

            if i % log_interval == 0:
                print('| Generated {}/{} words'.format(i, words))
    
    return generated_text

In this function, checkpoint is the path of the model file, data_source is the path of the dataset, words is the number of words to generate, temperature is a parameter that controls the diversity of the generated text, and log_interval is the number of words to output every time. The complete code for generating text and saving it can be found in generate_text.py. Run this code:

1
python generate_text.py

We can get the generated text. Here I generated 1000 words, and part of the generated text is as follows:

1
– <unk> , a year then with the software . It usually was sold for nearly half the day time . For this reason , the Nevermind run surpassed and a new group of canned <unk> . It had benefited from the unhealthy content , which have been leveled on the <unk> 's gates through the design the effects products associated with other birds and tested stewardship of those articles , ranging from an upright system with <unk> <unk> .

Because the WikiText-2 dataset contains many non-ASCII characters, there may be some <unk> characters in the generated text, which is because these characters are not in our dictionary. In addition, the generated text is randomly generated, so it may not be coherent. We can adjust the temperature parameter to control the diversity of the generated text. The larger the temperature, the more diverse the generated text, and the smaller the temperature, the more conservative the generated text.

Summary

This article introduces how to implement a simple word-level language model using LSTM. Language models are an important problem in natural language processing, which can be used to predict the next word or generate a sentence. Language models are a sequence prediction problem, and here we use LSTM to solve this problem, but other RNN models can also be used.

Although using RNN and its variants to implement language models can achieve good results, their performance on complex tasks is still limited. In recent years, with the advent of Transformer, the performance of language models has been greatly improved. In the following articles, we will introduce Transformer and its variants, and how to use Transformer to implement a language model.

comments powered by Disqus