Week 6 — Transformers from Scratch

Overview

This is the architectural centerpiece of the course: you build a transformer from scratch. Having understood attention as query/key/value lookup (Week 5), you now drop recurrence entirely and build a model whose every layer is attention plus a feedforward network — the architecture behind GPT, BERT, and essentially every modern LLM. The transformer’s wins are that it processes all positions in parallel (no sequential bottleneck) and connects any two positions in one step (no vanishing long-range signal). Building it yourself — multi-head self-attention, positional encoding, residual connections, layer norm, and a causal decoder-only LM — is the single most valuable implementation exercise in NLP.

Course 5’s linear algebra (the projections and the scaled dot product) and the optimization/normalization foundations are applied throughout; the new content is the precise composition of components that makes the architecture trainable and powerful.

Readings

J&M Ch. 8: transformers — self-attention, multi-head attention, positional encoding, and the block structure. Extract: the exact computation of a transformer block.
CS224N: self-attention and transformer material (Assignment 3 style). Extract: the masking and multi-head details.
6.S191: LLM material. Extract: how the decoder-only transformer becomes a language model.

Key Concepts

Scaled dot-product self-attention

Project each token into queries, keys, and values; attention is

\[ \text{Attn}(Q,K,V)=\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V. \]

Every token attends to every token (self-attention) in one parallel operation. The \(\sqrt{d_k}\) scaling keeps the dot products from saturating the softmax (a numerical-stability point connecting to Course 1/Course 5). This is Week 5’s attention with queries, keys, and values all derived from the same sequence.

Multi-head attention

Run \(h\) attention operations in parallel on lower-dimensional projections and concatenate. Different heads learn different relations (syntactic, positional, coreference). Multi-head attention is strictly more expressive than single-head at the same cost and is essential to performance.

Positional encoding

Self-attention is permutation-invariant — it has no inherent notion of order. Positional encodings (sinusoidal or learned) are added to token embeddings to inject position. Modern variants (RoPE, ALiBi) are worth knowing; the principle is that order must be supplied explicitly because attention treats the input as a set.

The block, and the decoder-only LM

A transformer block = multi-head attention + position-wise feedforward, each wrapped in a residual connection and layer normalization (these are what make deep stacks trainable — connecting to Course 1’s training stability). Stack \(N\) blocks. For a decoder-only LM, apply a causal mask so position \(t\) attends only to \(\le t\), and train with next-token cross-entropy (Week 1’s objective). This is GPT.

Theory Exercises

Derive scaled dot-product attention and explain the \(1/\sqrt{d_k}\) factor in terms of variance/softmax saturation.
Show multi-head attention’s parameter/compute cost equals single-head at model dim \(d\) and explain its expressiveness gain.
Explain why self-attention is permutation-invariant and how positional encoding restores order; contrast sinusoidal vs learned vs RoPE.
Explain the role of residual connections and layer norm in training deep transformers (Course 1 stability link).
Derive the causal mask and explain how it makes the transformer a valid left-to-right LM; analyze the \(O(n^2)\) attention cost.

Implementation

Implement a decoder-only transformer LM from scratch in transformer_from_scratch/ (PyTorch, no nn.Transformer): token + positional embeddings, multi-head causal self-attention, position-wise FFN, residual + layer norm, and a final LM head. Train on the Week 1 corpus; verify against a reference implementation on a tiny config.

Experiments

Perplexity vs the Week 3 LSTM LM (transformer should win, especially at longer context). Ablations: number of heads, number of layers, with/without positional encoding (should collapse), with/without residual+layernorm (training should fail). Attention-pattern visualization across heads/layers. Throughput vs sequence length to show the \(O(n^2)\) cost.

Expected baselines: the transformer beats the LSTM on perplexity and trains faster (parallelism); removing positional encoding destroys it; removing residual/layernorm makes deep stacks untrainable; attention maps show interpretable head specialization; compute grows quadratically with context length.

Connections

This is the architecture for everything that follows: the LLM (Week 7) is this scaled up and pretrained; BERT (Week 7) is the encoder variant; post-training (Week 8) fine-tunes it; RAG (Week 9) uses it as the generator. The QKV attention is Week 5’s idea; the math is Course 5’s linear algebra; the trainability tricks tie to Course 1. Implementing it by hand is what makes all later LLM work legible.