This repository contains the implementation of two Transformer variants (Performer and Longformer) along with the base Transformer. These are more memory efficient variations proposed after the Transformer particularly when we have to deal with longer sequences. We have also compared their performance on a classification task. Specifically, we have focused on sentiment classification using only the encoder portion of the Transformer model, since the decoder is not needed for this task. The base transformer model is derived from the Annotated Transformer tutorial, and the two variants—Performer and Longformer—are implemented by modifying just the self-attention mechanism. The focus is on enhancing performance (both in terms of time and memory), and comparing the efficacy of these models on a sentiment classification dataset.
For a deeper theoretical understanding, please refer to the blog post titled Benchmarking Efficient Transformer Variants which covers these two variants along with the base Transformer.
The dataset used in this project consists of fashion product reviews, which was downloaded from Amazon Reviews 2023. The dataset was labeled for binary sentiment classification as:
- Positive (1): Ratings of 4 or 5.
- Negative (0): Ratings of 1, 2, or 3.
A total of 120,000 samples were randomly selected, divided into training (80k), testing (20k), and validation (20k) subsets.
This is a base Transformer architecture as described in the Attention Is All You Need paper. In a typical Transformer, the model consists of an encoder-decoder structure. However, for a classification, we usually require just the encoder followed by Feed Forward Network using a classification head. For Transformer implementation, we will refer to The Annotated Transformer tutorial which walks through the entire process of implementing the Transformer model from scratch using PyTorch providing extensive annotations and explanations for each part of the model.
The Performer addresses the quadratic complexity by introducing an efficient attention mechanism called FAVOR+ (Fast Attention Via Orthogonal Random features). It approximates the traditional softmax-based self-attention with linear complexity O(N) as shown in the image below. Instead of directly computing the dot product between all pairs of tokens, FAVOR+ uses kernel-based methods to project the input into a lower-dimensional space where interactions can be computed more efficiently. Performer utilizes random feature maps to approximate the softmax function. This technique enables it to compute attention using a linear number of operations relative to the sequence length, significantly reducing the memory footprint and computational cost.
The Longformer modifies the Transformer architecture to handle long sequences effectively. Longformer introduces a sliding window attention mechanism, where each token attends to a fixed number of neighboring tokens within a sliding window, reducing the complexity from O(N^2) to O(N). Additionally, it incorporates dilated (or strided) attention, which allows tokens to attend to other tokens at regular intervals outside their immediate neighborhood, capturing broader context while still being efficient. For tasks requiring some tokens to attend globally across the entire sequence (e.g., classification tokens, question tokens in QA tasks), Longformer allows certain tokens to attend to all tokens in the sequence (global attention), while most tokens continue using the local sliding window attention. This hybrid approach balances efficiency with the ability to capture long-range dependencies when needed. As shown in the image below, all the tokens (along the diagonal) attend to their neighbouring tokens along with all the global tokens (for example CLS), and the global tokens attend all the tokens in the sequence.
The table below provides a summary of the common key training parameters used to train these Transformer variants.
Hyperparameter | Value |
---|---|
Number of Encoder Layers | 6 |
Number of Heads | 8 |
Embedding Dimension | 128 |
Dropout | 0.1 |
Optimizer | Adam |
Learning Rate | 0.0001 |
Early Stopping | True |
Batch Size | 16 |
To evaluate the training performance of the three Transformer variants - Transformer, Performer, and Longformer, their validation loss and validation accuracy was compared on the validation dataset (against number of epochs). Here's a pictorial summary of the findings:
The models were primarily assessed using accuracy metric since it was trained on a balanced dataset.
The code was tested using Python 3.9.18 and the dependencies can be installed using pip
. Follow the steps to use this implementation:
-
Clone the repository:
git clone https://github.com/infocusp/varformers.git
-
Install the required packages:
pip install -r requirements.txt
-
Update the
config.yml
accordingly to use the correct datapath, to use a particular model variant, and to configure the model parameters like number of encoder layers, number of heads in an encoder layer, etc. -
Run the training script from the root directory:
python3 train.py
Contributions are welcome! Please fork this repository and submit a pull request for any enhancements or bug fixes.
- Attention Is All You Need: Transformer Paper
- Rethinking Attention with Performers: Performer Paper
- Longformer: The Long-Document Transformer: Longformer Paper
- Annotated Transformer: Transformer Implementation
- Phil Wang: Performer Implementation
- Transformers: State-of-the-Art Natural Language Processing: Huggingface implementation of Longformer