Week 8 — Post-Training: Instruction Tuning, RLHF, DPO, LoRA, Alignment
Overview
A pretrained LLM (Week 7) predicts plausible text, but it is not yet a helpful, harmless assistant — it will happily continue a prompt rather than answer it. Post-training is what closes that gap: instruction tuning teaches the model to follow instructions, and preference optimization (RLHF, DPO) aligns its outputs with human preferences. This week covers the post-training pipeline and the parameter-efficient methods (LoRA) that make fine-tuning feasible on modest hardware. You will instruction-tune a small model and implement a preference-optimization method.
Course 5’s optimization is the backbone — RLHF is policy-gradient reinforcement learning and DPO is a clever reformulation as a classification-style loss — and Course 1’s training/efficiency concerns (LoRA is a memory/compute win) apply directly. The new content is the objectives that turn a base model into an aligned assistant.
Readings
- J&M Ch. 9: post-training — instruction tuning, alignment (RLHF, DPO), and test-time compute. Extract: the three-stage pipeline (SFT → reward/preference → policy optimization).
- CS224N: instruction tuning, RLHF, DPO, LoRA. Extract: the DPO objective and LoRA’s low-rank decomposition.
- 6.S191: fine-tuning lab. (Policy gradients and preference objectives as optimization: Boyd review from Course 5.)
Key Concepts
Instruction tuning (SFT)
Supervised fine-tuning on (instruction, response) pairs teaches the base model the format of being an assistant — that a prompt is to be answered, not continued. It is ordinary next-token training (Week 1 objective) on curated data. SFT alone gets a long way; alignment refines it.
RLHF
Reinforcement Learning from Human Feedback: (1) collect human preference comparisons between model outputs, (2) train a reward model to predict the preferred output, (3) optimize the policy (the LLM) to maximize reward via PPO, with a KL penalty to stay near the SFT model (so it doesn’t degenerate). This is policy-gradient RL (Course 5 optimization) applied to text generation; the KL term is a Course 5 information-theory regularizer.
DPO
Direct Preference Optimization skips the explicit reward model and RL loop: it shows the preference objective can be rewritten as a simple classification-style loss directly on the policy, comparing log-probabilities of preferred vs dispreferred responses. It is far simpler and more stable than RLHF and now the common choice. Deriving DPO from the RLHF objective is the key theoretical exercise.
LoRA and parameter-efficient fine-tuning
Full fine-tuning updates all weights — expensive. LoRA freezes the base weights and learns a low-rank update \(\Delta W = BA\) (rank \(r \ll d\)) per layer, training a tiny fraction of parameters with a fraction of the memory, and merging or swapping adapters at inference. This is a direct application of low-rank approximation (Course 5 linear algebra) and a Course 1-style efficiency win that makes fine-tuning possible on a single GPU.
Alignment and safety
Alignment aims for helpful, harmless, honest behavior; evaluation includes refusal of harmful requests, calibration, and red-teaming. Acknowledge the limits — alignment is partial and gameable — as part of responsible engineering.
Theory Exercises
- Explain the SFT → reward-model → policy-optimization pipeline and what each stage contributes.
- Write the RLHF objective with the KL-to-reference penalty; explain why the penalty is needed (Course 5 KL).
- Derive the DPO loss from the RLHF objective; explain why it removes the explicit reward model and RL loop.
- Derive LoRA’s parameter count for rank \(r\) vs full fine-tuning; explain the memory savings (Course 5 low-rank).
- Describe how to evaluate alignment (helpfulness, harmlessness, calibration) and a way each metric can be gamed.
Implementation
In post_training/: instruction-tune a small base model (SFT) on an instruction dataset using LoRA. Implement DPO on a preference dataset and compare to the SFT-only model. (Optionally sketch the RLHF reward-model + PPO path.) Use the Course 1 mixed-precision/efficiency tooling to fit training on available hardware.
Experiments
Instruction-following quality: base vs SFT vs SFT+DPO on held-out instructions (win-rate via a judge or human eval). LoRA rank sweep: quality vs trainable-parameter count and memory. Alignment probes: helpfulness and harmful-request refusal before/after.
Expected baselines: SFT dramatically improves instruction-following over the base model; DPO further improves win-rate and is more stable than an RLHF attempt; LoRA matches full fine-tuning quality at a fraction of the parameters/memory; alignment probes improve but reveal residual failures.
Connections
The post-trained model is the assistant that powers the RAG system (Weeks 9–10) — a RAG assistant needs an instruction-following, aligned generator. LoRA and mixed precision tie to Course 1’s efficiency engineering; the optimization objectives are Course 5’s. This is the step that turns Week 7’s raw capability into a usable product.
Further Reading
- J&M Ch. 9.
- Ouyang et al. (InstructGPT/RLHF); Rafailov et al. (DPO); Hu et al. (LoRA).
- CS224N post-training lectures.