Transformers
Overview
The transformer architecture, introduced by Vaswani et al. in 2017 (Vaswani et al. 2017), revolutionised natural language processing by eliminating the sequential processing constraints of recurrent networks. This architecture enables fully parallel computation and superior handling of long-distance dependencies through its self-attention mechanism, becoming the foundation for all subsequent large language models.
Key Components
Input Processing and Embeddings
Transformers process sequences of tokens, converting words into numerical representations through encoding and embedding. Words are first encoded into one-hot vectors, then transformed into dense embeddings that capture semantic information. For GPT-3, the vocabulary contains 50,257 words, which are embedded into 12,288-dimensional vectors.
Positional Encoding
Since transformers process all positions in parallel, they require explicit positional information to understand word order. Positional encoding adds this information to the embeddings, using trigonometric functions to create position-dependent encodings that can generalise to sequences longer than those seen during training (Vaswani et al. 2017). The input matrix combines word embeddings and positional encodings: X = X_{\text{WordEmbedding}} + X_{\text{PositionalEncoding}}.
Self-Attention
Self-attention is the core innovation of transformers, allowing the model to weigh the importance of different words when processing each position. The mechanism uses three matrices—Query (Q), Key (K), and Value (V)—derived from the input through learned projections. Attention scores are computed as:
\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
This enables the model to focus on relevant parts of the input, similar to how humans concentrate on significant aspects of complex information.
Multi-Head Attention
Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This allows the model to capture various types of relationships simultaneously—syntactic, semantic, and long-range dependencies. GPT-3, for example, uses 96 attention heads, with each head focusing on different aspects of the sequence.
Masked Self-Attention
For autoregressive models like GPT, masked self-attention prevents access to future words during training. This maintains the autoregressive property where each word is predicted based only on previous words, while still allowing parallel processing. This differs from bidirectional models like BERT (Devlin et al. 2018), which can attend to all positions in both directions.
Feedforward Networks and Normalisation
Each transformer layer includes a feedforward neural network that non-linearly transforms attention outputs, enhancing the model’s expressive capabilities. Residual connections and layer normalisation (Ba, Kiros, and Hinton 2016) stabilise training and enable deep stacking—GPT-3 uses 96 layers—by alleviating vanishing gradients and normalising layer inputs.
Advantages
The transformer architecture offers several key advantages:
- Parallelisation: All sequence positions can be processed simultaneously, significantly speeding up training and inference compared to sequential RNNs
- Long-range Dependencies: Self-attention directly models relationships between any two positions, regardless of distance
- Computational Efficiency: Despite quadratic complexity with sequence length, parallelisation and fewer required layers make transformers efficient for long sequences
- Interpretability: Attention weights can be visualised to understand which parts of the input the model focuses on
Variants and Extensions
Numerous transformer variants have been developed to address specific challenges:
- Transformer-XL (Dai et al. 2019): Handles longer sequences through segment-level recurrence
- ALBERT (Lan et al. 2019): Reduces parameters through sharing while maintaining performance
- XLNet (Yang et al. 2019): Combines benefits of autoregressive and bidirectional approaches
These innovations continue to advance transformer-based models, enabling increasingly capable language models (Zhao et al. 2023).
Impact
The transformer architecture has become the universal foundation for modern large language models. All state-of-the-art LLMs, including GPT-3 (Brown et al. 2020), GPT-4 (Achiam et al. 2023), PaLM (Chowdhery et al. 2023), and LLaMA (Touvron et al. 2023), are built on transformer architectures. Combined with advances in computational resources and training techniques, this has enabled models with hundreds of billions of parameters that demonstrate remarkable capabilities across diverse tasks.