Transformers for Language Modeling and Sentiment Analysis

Matheus Schmitz
LinkedIn
Github Portfolio

Open In Colab

In this problem we will learn how to implement the building blocks for "Transformer" models, and then implement a pre-training procedure for such models via BERT-style language modeling and then fine-tune a pre-trained model on sentiment analysis tasks on the IMDB movie review dataset. Typically, transformer models are very large and are pre-trained on language modeling tasks with massive datasets with huge computational resources. As such, we will only implement the pre-training procedure, without expecting you to pre-train a model to completion. We will then load in a pre-trained model for you to perform fine-tuning on a sentiment analysis task.

We will complete the following steps in this problem:

  1. Implement a multi-head-attention (MHA) layer.
  2. Implementlink text "Transformer block" layers which use MHA layers, linear layers, and residual connections.
  3. Implement a full Transformer model comprised of Transformer blocks.
  4. Implement BERT-style language model pre-training for the Transformer model.
  5. Fine-tune our trained language model on a sentiment analysis task.

In order to run on GPU in Colab go to Runtime -> Change runtime type and select GPU under the Hardware accelerator drop-down box.

1 - Scaled Dot Product Attention [8 points]

The attention mechanism describes a recent new group of layers in neural networks that has attracted a lot of interest in the past few years, especially in sequence tasks. Here we use the following definition: the attention mechanism describes a weighted average of (sequence) elements with the weights dynamically computed based on an input query and elements’ keys. In other words, we want to dynamically decide on which inputs we want to “attend” more than others based on their values. In particular, an attention mechanism has usually 4 parts we need to specify:

The weights of the average are calculated by a softmax over all score function outputs. Hence, we assign those value vectors a higher weight whose corresponding key is most similar to the query.

The attention applied inside the Transformer architecture is called self-attention. In self-attention, each sequence element provides a key, value, and query. For each element, we perform an attention layer where based on its query, we check the similarity of the all sequence elements’ keys, and returned a different, averaged value vector for each element.

The core concept behind self-attention is the scaled dot product attention. The dot product attention takes as input a set of queries $Q \in \mathbb{R}^{T \times d_k}$, keys $K \in \mathbb{R}^{T \times d_k}$ and values $V \in \mathbb{R}^{T \times d_v}$ where $T$ is the sequence length, and $d_k$ and $d_v$ are the hidden dimensionality for queries/keys and values respectively. The attention value from element $i$ to $j$ is based on its similarity of the query $Q_i$ and key $K_j$, using the dot product as the similarity metric. Mathmatically:

$$Attention(Q,K,V)=\text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V $$

The matrix multiplication $Q K^T$ performs the dot product for every possible pair of queries and keys, resulting in a matrix of the shape $T \times T$. Each row represents the attention logits for a specific element $i$ to all other elements in the sequence. We apply a softmax and multiply with the value vector to obtain a weighted mean (the weights being determined by the attention). The computation graph is visualized below.

drawing

Before you continue, run the test code listed below. It will generate random queries, keys, and value vectors, and calculate the attention outputs. Make sure you can follow the calculation of the specific values here, and also check it by hand.

2 - Build Multi-Head-Attention Layer [8 points]

Now we will implement multi-head-attention, first introduced by Attention is All you Need (Vaswani et al. 2017). The scaled dot product attention allows a network to attend over a sequence. However, often there are multiple different aspects a sequence element wants to attend to, and a single weighted average is not a good option for it. This is why we extend the attention mechanisms to multiple heads, i.e. multiple different query-key-value triplets on the same features.

A multi-head-attention layer works by employing several self-attention layers in parallel. Given a query, key, and value matrix, we transform those into $h$ sub-queries, sub-keys, and sub-values, which we pass through the scaled dot product attention independently where $h$ is the number of heads. Afterward, we concatenate the heads and combine them with a final weight matrix. Mathmatically,

$$Multihead(Q,K,V)=Concat(head_1, ..., head_h)W^O,$$

where

$$head_i=Attention(QW^Q_i, KW^K_i, VW^V_i).$$

We refer to this as Multi-Head Attention layer with the learnable parameters $W^Q_{1...h}\in \mathbb{R}^{d_{in}\times d_k}$, $W^K_{1...h}\in \mathbb{R}^{d_{in}\times d_k}$, $W^V_{1...h}\in \mathbb{R}^{d_{in}\times d_v}$, and $W^O\in \mathbb{R}^{h\cdot d_k \times d_{out}}$ where $d_{in}$ is the input dimensionality, and $d_{out}$ is the output dimensionality. The visualized computational graph is shown below.

drawing

Looking at the computation graph above, a simple but effective implementation is to set the current feature map $X$ in a NN, $X\in \mathbb{R}^{B\times T\times d_{model}}$, where $B$ is the batch size, $T$ is the sequence length, and $d_{model}$ is the hidden dimentionality of $X$. <!-- Attention works by computing three vectors for each input vector (e.g. embedded token): Query, Key, and Value. They can be computed via a fully connected layer. Below is a diagram of the multi-head attention layer.

We can represent scaled Dot-Product Attention with the following equation (assuming $Q, K, V \in \mathbb{R}^{t \times h}$ where $t$ is the length of the sequence and $h$ is the dimensionality of the attention head): -->

Let's check that your MHA layer works and returns a tensor of the correct shape

3 - Build Transformer Blocks [8 points]

Now we construct the blocks from which transformer models are comprised of.

Originally, the Transformer model was designed for machine translation. Hence, it got an encoder-decoder structure where the encoder takes as input the sentence in the original language and generates an attention-based representation, the decoder attends over the encoded information and generates the translated sentence in an autoregressive manner, as in a standard RNN. The visualized computational graph is shown below. Here we will mainly focus on the encoder part and implement the encoder block.

drawing

A Transformer encoder block consists of the following modules in this order:

  1. Multi-Head Attention (we implemented above)
  2. Dropout
  3. Residual connection to the input (simply add the input of the block to the output of the previous dropout layer).
  4. Layer Norm - https://arxiv.org/abs/1607.06450
  5. Linear layer
  6. Activation function (typically gelu - https://arxiv.org/abs/1606.08415)
  7. Linear layer
  8. Dropout
  9. Residual connection to 4 (add the output of 4 to 8)
  10. Layer Norm

According to the listed modules, please implement:

class TransformerBlock(nn.Module)

Let's once again check that the code runs without error and outputs the correct shape (note, this is not a guarantee that you have implemented it correctly).

4 - Position Encoding [0 points]

In tasks like language understanding, the position is important for interpreting the input words. The position information can therefore be added via the input features. We could learn a embedding for every possible position, but this would not generalize to a dynamical input sequence length. Hence, the better option is to use feature patterns that the network can identify from the features and potentially generalize to larger sequences. Mathmatically:

$$PE(pos, i) = \left\{\begin{matrix} \sin (\frac{pos}{10000^{i/d_{model}}}) & \text{if } i\mod 2=0\\ \cos (\frac{pos}{10000^{(i-1)/d_{model}}}) & \text{otherwise} \end{matrix}\right.$$

$PE(pos,i)$ represents the position encoding at position $pos$ in the sequence, and hidden dimensionality $i$. These values, concatenated for all hidden dimensions, are added to the original input features, and constitute the position information. The intuition behind this encoding is that you can represent $PE(pos+k,:)$ as a linear function of $PE(pos,:)$, which might allow the model to easily attend to relative positions.

5 - Build a BERT model [8 points]

A BERT model consists of:

  1. An input embedding layer. This converts a token index into a vector embedding. Make sure to include an extra embedding for the masked tokens! In other words, learn vocab_size + 1 embeddings.
  2. Positional encodings. This layer (implemented for you already) encodes the position of each token since multi-head-attention layers have no notion of positional locality or order. It takes as input the the token embeddings from (1) and returns them with positional embeddings added.
  3. Several stacked Transformer blocks (the number specified by n_layers)
  4. Output linear layer that predicts masked words for pre-training. Takes final embedding of last block and outputs probability distribution over the vocabulary.

Let's once again check that the code runs without error and outputs the correct shape (note, this is not a guarantee that you have implemented it correctly).

6 - Implement BERT Pre-Training [8 points]

In order to pre-train our language model, we randomly permute mask_rate% of the tokens and attempt to predict the original tokens. The permutation is as follows:

The prediction task is then to predict the original token for only the permuted tokens. You should use nn.CrossEntropyLoss. Note that this module has a keyword argument ignore_index which specifies a label index for which we do not compute the loss. It is -100 by default. This can be used to only do prediction for the permuted tokens.

For more details, please look at Task 1 in Section 3.1 of the BERT paper. We do not consider the second pre-training task (Next Sentence Prediction) for this assignment.

We do not expect you to complete the pre-training procedure, which is not feasible given your computational resources. We are simply asking you to implement one step of training with synthetic data.

7 - Fine-Tune Pre-Trained Model on Sentiment Analysis [8 points]

In the previous section we implemented the pre-training procedure specified in the BERT paper. Now, we will take a fully-trained BERT model and use its learned representations for performing a sentiment analysis task.

We will use the transformers library to get pre-trained transformers and use them as our embedding layers. We will freeze (not train) the transformer and only train the remainder of the model which learns from the representations produced by the transformer. In this case we will be using a multi-layer bi-directional GRU, however any model can learn from these representations.

The goal of this sentiment analysis task is to predict the "sentiment" of a particular sequence. In this case the sequences are movie reviews are we're predicting whether they are positive or negative. Our model outputs a probability of positive sentiment for each input sequence. Use nn.BCEWithLogitsLoss to fine-tune the model on this task.

Preparing Data

The transformer has already been trained with a specific vocabulary, which means we need to train with the exact same vocabulary and also tokenize our data in the same way that the transformer did when it was initially trained.

Luckily, the transformers library has tokenizers for each of the transformer models provided. In this case we are using the BERT model which ignores casing (i.e. will lower case every word). We get this by loading the pre-trained bert-base-uncased tokenizer.

Set constants regarding text tokenization and processing such that we are consistent with how the model was trained.

Define tokenization functions and set up IMDB dataset

Create iterator to sample batches from the dataset.

Build the Model

Next, we'll load the pre-trained model, making sure to load the same model as we did for the tokenizer.

Next, we'll define our actual model.

Instead of using an embedding layer to get embeddings for our text, we'll be using the pre-trained transformer model. These embeddings will then be fed into a GRU to produce a prediction for the sentiment of the input sentence. We get the embedding dimension size (called the hidden_size) from the transformer via its config attribute. The rest of the initialization is standard.

Within the forward pass, we wrap the transformer in a no_grad to ensure no gradients are calculated over this part of the model. The transformer actually returns the embeddings for the whole sequence as well as a pooled output. The documentation states that the pooled output is "usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence", hence we will not be using it. The rest of the forward pass is the standard implementation of a recurrent model, where we take the hidden state over the final time-step, and pass it through a linear layer to get our predictions. When using a bidrectional GRU, we concatenate the final step of the forward and backward direction

Next, we create an instance of our model. You need to select hyperparameters.

In order to freeze BERT paramers (not train them) we need to set their requires_grad attribute to False. To do this, we simply loop through all of the named_parameters in our model and if they're a part of the bert transformer model, we set requires_grad = False.

Train the Model

As is standard, we define our optimizer and criterion (loss function).

Next, we'll define functions for: calculating accuracy, performing a training epoch, performing an evaluation epoch and calculating how long a training/evaluation epoch takes.

Finally, we'll train our model.

Please train your model such that it reaches 90% validation accuracy. This is possible to accomplish within 15 minutes of training on GPU with the correct implementation and hyperparameters. Feel free to adjust the hyperparameters defined above in order to get the desired performance. Your points received will scale linearly from 0 for 50% accuracy to 8 for at least 90% accuracy.

Load up the parameters that gave us the best validation loss and try these on the test set

Inference

We'll then use the model to test the sentiment of some sequences. We tokenize the input sequence, trim it down to the maximum length, add the special tokens to either side, convert it to a tensor, add a fake batch dimension and then pass it through our model. Feel free to add more test cases!

Conceptual Questions

  1. Why is the residual connection is crucial in the Transformer architecture? [5 points]

Residual connections help the network train by allowing gradients to flow through the networks directly without passing nonlinear activation. Also it helps to retain original information.

  1. Why is Layer Normalization important in the Transformer architecture? [5 points]

Layer normalizations are used to stabilize the network. And it can reduce training time. It also plays important role in controlling the gradient scales and reduces internal covariance shift.

  1. Why do we use the scaling factor of $1/\sqrt{d_k}$ in Scaled Dot Product Attention? If we remove it, what is going to happen? [5 points]

Scaling factor prevents the dot product value from being too large. If we don’t multiply scaling factor, the gradient value(of softmax) will be extremely small, and it will cause gradient vanishing problem.

End

Matheus Schmitz
LinkedIn
Github Portfolio