Week 1 — GPU Compute First Principles: CUDA and Metal

Overview

Every ML system that matters runs on a parallel accelerator, and the difference between a fast pipeline and a slow one is almost always whether the engineer understands the execution model underneath the framework. This week establishes that model from first principles on both platforms you own: CUDA on the Jetson Orin Nano and Metal on the Mac. The goal is not to memorize APIs but to internalize the mental model — thousands of threads grouped into blocks/threadgroups, a memory hierarchy where bandwidth is the scarce resource, and a launch cost that punishes tiny kernels. By the end you should be able to write vector_add, saxpy, dot_product, and a reduction, and explain every performance number you measure.

This is the platform the rest of the course’s systems work stands on. Week 3’s mixed-precision tradeoffs, Week 7’s custom kernels, and the inference budget of the Jetson capstone all assume you can reason about occupancy, coalescing, and memory traffic. The linear-algebra and floating-point foundations are assumed from Course 5 — here we apply them to silicon.

Readings

CUDA Book (Motta): threads, blocks, grids, global vs shared memory, synchronization, and reductions. Extract: the thread/block/grid hierarchy and why shared memory exists.
MBT (Metal by Tutorials): the compute pipeline, buffers, threadgroups, and dispatching. Extract: the Metal analogues of CUDA concepts (threadgroup ≈ block, threadgroup memory ≈ shared memory).
CA (Computer Architecture): memory hierarchy and parallel architectures. Extract: why DRAM bandwidth and latency dominate, and what a cache line buys you.
Optional CS231n skim: CNN workloads as matrix/conv workloads — motivation for why these primitives matter.

Key Concepts

The execution hierarchy

A kernel launches a grid of blocks (CUDA) or threadgroups (Metal), each containing many threads. Threads within a block execute in lockstep groups (warps of 32 on NVIDIA, SIMD-groups of 32 on Apple) and can share fast on-chip memory and synchronize cheaply; threads in different blocks cannot. This hierarchy is the entire reason GPU code is structured the way it is: data that must be shared and synchronized has to live within a block.

The memory hierarchy and bandwidth wall

Registers → shared/threadgroup memory → L2 → global (DRAM), in increasing size and decreasing speed. Most ML kernels are memory-bound: limited by how fast bytes move, not by arithmetic. The key metric is arithmetic intensity (FLOPs per byte). The roofline model says performance is min(peak_FLOPs, intensity × bandwidth) — below a threshold intensity you are bandwidth-limited no matter how fast the ALUs are. vector_add (one add per 12 bytes) is hopelessly bandwidth-bound; tiled matmul can be compute-bound.

Coalescing and occupancy

Coalesced access — consecutive threads reading consecutive addresses — lets the hardware service a warp with a few wide memory transactions; strided or random access wastes most of each transaction. Occupancy is the ratio of active warps to the maximum; enough warps in flight lets the scheduler hide memory latency by switching to ready warps. Both are levers you will measure directly.

Reductions and the launch-overhead floor

A reduction (sum, max) is the canonical non-trivial parallel pattern: tree-reduce within a block using shared memory, then combine block partials. It exposes synchronization, shared memory, and the fact that a kernel launch has a fixed cost (microseconds) that makes tiny launches inefficient — the reason small-batch inference often loses to CPU.

Theory Exercises

Compute the arithmetic intensity (FLOPs/byte) of vector_add, saxpy, dot_product, and naive matmul. Predict which are memory-bound.
Using a roofline with the Jetson Orin Nano’s peak bandwidth and FLOPs, predict the achievable throughput of each kernel.
Explain why a warp of 32 threads reading a column of a row-major matrix is uncoalesced, and how to fix it.
Derive the step count and shared-memory traffic of a tree reduction over n elements in a block of b threads.
Estimate the break-even problem size where GPU beats CPU given a fixed launch overhead.

Implementation

Write the four kernels in three backends: CPU C++ (baseline), CUDA (Jetson), and Metal (Mac). Keep a single host harness that fills inputs, runs each backend, and verifies results against the CPU reference within a floating-point tolerance. For the reduction, implement the shared-memory tree version and a naive global-atomic version to contrast them.

Benchmark

Sweep five problem sizes spanning L2-resident to DRAM-bound. Record wall time, derived GB/s, and GFLOP/s for each kernel × backend, plus CPU↔︎GPU transfer time separately (it often dominates for small sizes). Plot achieved GB/s against the roofline. Note where launch overhead, transfer cost, or bandwidth is the binding constraint.

Expected baselines: vector_add/saxpy saturate bandwidth well below peak FLOPs; the shared-memory reduction beats the atomic version by a wide margin; small sizes are dominated by launch + transfer, confirming the CPU break-even prediction.

Connections

This execution model is the substrate for Week 3 (mixed precision changes bytes-per-element, hence bandwidth) and Week 7 (writing a fused kernel to beat the framework). The roofline and break-even reasoning recur whenever you decide what runs on the Jetson in the Week 10 capstone. Course 5 supplied the linear algebra and floating-point model; this week is where those abstractions meet measured bandwidth.