Transformers

Overview

The transformer architecture, introduced by Vaswani et al. in 2017 (Vaswani et al. 2017), revolutionised natural language processing by eliminating the sequential processing constraints of recurrent networks. This architecture enables fully parallel computation and superior handling of long-distance dependencies through its self-attention mechanism, becoming the foundation for all subsequent large language models.

Key Components

Input Processing and Embeddings

Transformers process sequences of tokens, converting words into numerical representations through encoding and embedding. Words are first encoded into one-hot vectors, then transformed into dense embeddings that capture semantic information. For GPT-3, the vocabulary contains 50,257 words, which are embedded into 12,288-dimensional vectors.

Positional Encoding

Since transformers process all positions in parallel, they require explicit positional information to understand word order. Positional encoding adds this information to the embeddings, using trigonometric functions to create position-dependent encodings that can generalise to sequences longer than those seen during training (Vaswani et al. 2017). The input matrix combines word embeddings and positional encodings: X = X_{\text{WordEmbedding}} + X_{\text{PositionalEncoding}}.

Self-Attention

Self-attention is the core innovation of transformers, allowing the model to weigh the importance of different words when processing each position. The mechanism uses three matrices—Query (Q), Key (K), and Value (V)—derived from the input through learned projections. Attention scores are computed as:

\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

This enables the model to focus on relevant parts of the input, similar to how humans concentrate on significant aspects of complex information.

Multi-Head Attention

Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This allows the model to capture various types of relationships simultaneously—syntactic, semantic, and long-range dependencies. GPT-3, for example, uses 96 attention heads, with each head focusing on different aspects of the sequence.

Masked Self-Attention

For autoregressive models like GPT, masked self-attention prevents access to future words during training. This maintains the autoregressive property where each word is predicted based only on previous words, while still allowing parallel processing. This differs from bidirectional models like BERT (Devlin et al. 2018), which can attend to all positions in both directions.

Feedforward Networks and Normalisation

Each transformer layer includes a feedforward neural network that non-linearly transforms attention outputs, enhancing the model’s expressive capabilities. Residual connections and layer normalisation (Ba, Kiros, and Hinton 2016) stabilise training and enable deep stacking—GPT-3 uses 96 layers—by alleviating vanishing gradients and normalising layer inputs.

Advantages

The transformer architecture offers several key advantages:

Parallelisation: All sequence positions can be processed simultaneously, significantly speeding up training and inference compared to sequential RNNs
Long-range Dependencies: Self-attention directly models relationships between any two positions, regardless of distance
Computational Efficiency: Despite quadratic complexity with sequence length, parallelisation and fewer required layers make transformers efficient for long sequences
Interpretability: Attention weights can be visualised to understand which parts of the input the model focuses on

Variants and Extensions

Numerous transformer variants have been developed to address specific challenges:

Transformer-XL (Dai et al. 2019): Handles longer sequences through segment-level recurrence
ALBERT (Lan et al. 2019): Reduces parameters through sharing while maintaining performance
XLNet (Yang et al. 2019): Combines benefits of autoregressive and bidirectional approaches

These innovations continue to advance transformer-based models, enabling increasingly capable language models (Zhao et al. 2023).

Impact

The transformer architecture has become the universal foundation for modern large language models. All state-of-the-art LLMs, including GPT-3 (Brown et al. 2020), GPT-4 (Achiam et al. 2023), PaLM (Chowdhery et al. 2023), and LLaMA (Touvron et al. 2023), are built on transformer architectures. Combined with advances in computational resources and training techniques, this has enabled models with hundreds of billions of parameters that demonstrate remarkable capabilities across diverse tasks.

References

Achiam, Josh, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, et al. 2023. “Gpt-4 Technical Report.” arXiv Preprint arXiv:2303.08774.

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. “Layer Normalization.” arXiv Preprint arXiv:1607.06450.

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. 2023. “Palm: Scaling Language Modeling with Pathways.” Journal of Machine Learning Research 24 (240): 1–113.

Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. “Transformer-Xl: Attentive Language Models Beyond a Fixed-Length Context.” arXiv Preprint arXiv:1901.02860.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. “Albert: A Lite Bert for Self-Supervised Learning of Language Representations.” arXiv Preprint arXiv:1909.11942.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, et al. 2023. “Llama: Open and Efficient Foundation Language Models.” arXiv Preprint arXiv:2302.13971.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.

Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. “Xlnet: Generalized Autoregressive Pretraining for Language Understanding.” Advances in Neural Information Processing Systems 32.

Zhao, Wayne Xin, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, et al. 2023. “A Survey of Large Language Models.” arXiv Preprint arXiv:2303.18223.