Week 9 — Convex Problems, Duality, and Optimization Algorithms
Overview
With convexity in hand, this week assembles the optimization toolkit. First, the standard problem forms (LP, QP, and the general convex program) and why putting a problem in one of them matters. Then Lagrangian duality and the KKT conditions — the certificate of optimality that turns “I found a point” into “I proved it is optimal.” Then the modeling layers of approximation/fitting and statistical estimation, which is where most ML objectives come from. Finally the core algorithms: gradient descent, Newton’s method, and their equality-constrained versions.
Readings
- Boyd Ch 4 — Convex optimization problems. Standard form, LP, QP, geometric programming, generalized-inequality constraints.
- Boyd Ch 5 — Duality. Lagrangian and dual function, weak/strong duality, geometric interpretation, KKT optimality conditions, sensitivity.
- Boyd Ch 6 — Approximation and fitting. Norm/least-norm problems, regularization, robust approximation.
- Boyd Ch 7 — Statistical estimation. Maximum-likelihood and MAP as convex problems, optimal detector design, experiment design.
- Boyd Ch 9–10 — Unconstrained and equality-constrained minimization. Gradient descent, Newton’s method, self-concordance, equality-constrained Newton.
Key Concepts
Standard problem forms
A convex program minimizes a convex \(f_0\) subject to convex inequality constraints \(f_i(x)\le 0\) and affine equality constraints \(Ax=b\). Special cases: linear programs (linear objective and constraints), quadratic programs (convex quadratic objective), second-order cone and semidefinite programs. Recognizing which form your problem fits determines which solver and which theory apply — and least squares (Week 2) is the simplest QP.
Lagrangian duality
Form the Lagrangian \(L(x,\lambda,\nu) = f_0(x) + \sum_i \lambda_i f_i(x) + \nu^\top(Ax-b)\) and the dual function \(g(\lambda,\nu) = \inf_x L\). The dual is always concave and gives a lower bound (weak duality): \(g(\lambda,\nu) \le p^\star\). Under a constraint qualification (e.g. Slater’s condition) strong duality holds, \(d^\star = p^\star\), and the duality gap closes. The dual variables are exactly the sensitivities — the shadow prices of the constraints.
KKT conditions
For a convex problem with strong duality, \(x^\star,\lambda^\star,\nu^\star\) are optimal iff they satisfy the KKT conditions: primal feasibility, dual feasibility \(\lambda^\star \ge 0\), complementary slackness \(\lambda_i^\star f_i(x^\star)=0\), and stationarity
\[\nabla f_0(x^\star) + \sum_i \lambda_i^\star \nabla f_i(x^\star) + A^\top \nu^\star = 0.\]
These generalize “set the gradient to zero” to constrained problems and are the optimality certificate used everywhere downstream.
Approximation, fitting, estimation
Norm approximation \(\min \|Ax-b\|\), regularized forms \(\min \|Ax-b\|_2^2 + \lambda\|x\|\) (ridge/Tikhonov for \(\ell_2\), lasso/sparsity for \(\ell_1\)), and robust formulations all sit here. Maximum likelihood and MAP estimation become convex problems for log-concave models — so the inference of Week 7 is solved by the algorithms below, and ML regularization is a Bayesian prior in disguise.
Algorithms
For unconstrained smooth convex \(f\):
- Gradient descent \(x_{k+1} = x_k - t_k \nabla f(x_k)\). Convergence rate depends on the condition number \(\kappa\) of the Hessian (Week 3) — ill-conditioning means zig-zagging and slow progress.
- Newton’s method \(x_{k+1} = x_k - t_k\,[\nabla^2 f(x_k)]^{-1}\nabla f(x_k)\) — affine-invariant, quadratically convergent near the optimum, analyzed cleanly via self-concordance. The Newton step solves a linear system (Weeks 3–4).
For equality-constrained problems, the KKT system is solved at each step (equality-constrained Newton, infeasible-start Newton); inequality constraints are handled by interior-point/barrier methods (read lightly). Step sizes come from backtracking line search.
Connections
- Forward: The maximum-entropy problems of Week 10 are convex programs solved by exactly this duality machinery.
- Across courses: This is the engine under training (Courses 1, 3) — losses, regularizers, and the optimizers that minimize them — and under any model-predictive or trajectory optimization (Courses 1, 2).
Further Reading
- Boyd & Vandenberghe, Convex Optimization, Chapters 4–7, 9–10.
- Nocedal & Wright, Numerical Optimization, for the algorithmic details and line-search theory.