Week 5 — Loss Functions, Backprop, and a Tiny Learned Predictor

Overview

This week you train an actual model. Having defined the multimodal prediction representation and its metrics in Week 4, you now build a small neural predictor, derive and implement its training, and — most importantly — learn to debug training when it goes wrong. The emphasis is on understanding backpropagation as the chain rule applied to a computation graph, and on the practical craft of getting a small model to learn: loss design, initialization, learning rate, and the diagnostic loop. Course 5 supplied convex optimization, gradient methods, and the analysis behind them; here we apply gradient descent to a non-convex model and confront the realities theory abstracts away.

The deliverable is a tiny learned trajectory predictor that beats the Week 4 constant-velocity baseline on minADE/miss-rate. Keeping it small is deliberate: a model you can fully instrument teaches more than a large one you treat as a black box.

Readings

CS231n: neural networks, backpropagation, and training/debugging. Extract: the computation-graph view of backprop and the practical training checklist.
GBC (Deep Learning): deep feedforward networks and optimization for training. Extract: initialization, the role of depth/width, and why optimization for deep nets differs from convex optimization.
(Convex sets/functions, gradient descent, Taylor approximation, and the optimization analysis: assumed from Course 5.)

Key Concepts

Backprop as reverse-mode autodiff

A network is a composition of differentiable functions; the loss is a scalar at the end. Backprop computes \(\partial L/\partial \theta\) by a single reverse pass applying the chain rule, reusing intermediate gradients — cost comparable to one forward pass. Understanding it as reverse-mode automatic differentiation on a graph (not a special algorithm) demystifies every framework. You will implement it by hand for a small MLP to prove you can.

Loss design for the prediction task

The loss must match the Week 4 metrics. A common choice for \(K\)-mode prediction: a regression term on the closest mode (so you don’t average modes) plus a classification (cross-entropy) term on the mode probabilities — a “winner-take-all” or mixture loss. Choosing L1/Huber vs L2 changes sensitivity to outliers (L2 over-weights rare large errors, pulling predictions toward the mean — the very failure Week 4 warned about).

The training loop and its failure modes

Initialization (avoid vanishing/exploding activations), learning rate (the single most important hyperparameter), batch size, and normalization. Diagnostic discipline: overfit a single batch first (proves the model can learn and the gradient is correct), watch the loss curve, check gradient norms. Most “the model won’t learn” bugs are a wrong loss, a data/label mismatch, or a learning rate off by an order of magnitude.

Theory Exercises

Derive backprop for a 2-layer MLP with ReLU and an L2 loss; write the gradient for each parameter explicitly.
Show why an L2 loss on multimodal targets pulls the prediction toward the mean, and how a min-over-modes loss avoids it.
Derive the gradient of the softmax + cross-entropy mode-classification term.
Explain why overfitting a single batch is the first debugging step and what failing it implies.
Relate learning-rate choice to the local curvature (Hessian/Lipschitz constant) from Course 5’s optimization analysis.

Implementation

Implement a small MLP predictor in PyTorch that outputs \(K\) modes + probabilities from the encoded scene (Week 2 representation). Implement the mixture loss matching Week 4 metrics. First implement backprop by hand for a tiny version and check against autograd. Train on the synthetic scenarios; log loss curves and metrics.

Benchmark

Track training/val minADE, minFDE, miss-rate, and NLL across epochs against the Week 4 baselines. Run an ablation: L2 vs min-over-modes loss, with/without the classification term, two learning rates. Record wall-clock training time on Mac (MPS) vs Jetson (CUDA).

Expected baselines: the learned predictor beats constant-velocity on miss-rate at decision points; the L2 ablation visibly mode-averages; the single-batch overfit reaches near-zero loss, confirming correct gradients.

Connections

The hand-derived backprop and the trained model become Week 7’s profiling and kernel target (a real ML workload to optimize). The predictor’s outputs feed Week 6’s planner. Mixed precision (Week 3) applies to its training. Course 5’s optimization theory is the scaffolding; this week is where non-convexity and the practical training loop live.