Week 10 — Capstone: Performance Engineering, Final Demo, and Packaging

Overview

The capstone makes the simulation scale and turns it into a polished portfolio demo. It consolidates the performance-engineering and final-integration weeks: you first attack the bottlenecks the earlier weeks exposed (draw-call count from Week 3, sensor readback from Week 7, agent/collision cost from Weeks 5–6) with GPU instancing, frustum culling, data-oriented agent storage, and multithreaded command recording, then integrate everything into a single demo world with a benchmark report and portfolio packaging.

The themes are measure-driven optimization and systems integration. A reviewer should be able to run one demo that shows the road world, traffic, ego driving under planning, camera sensors with ground truth, perception baselines, and a performance overlay — backed by a benchmark report that quantifies every system. This is the headline artifact of the course.

Readings

Vulkan: instancing, descriptor management, command buffers, and synchronization (multithreaded recording). Extract: how to draw many objects cheaply and record commands in parallel.
CA: parallelism, memory hierarchy, and cache behavior. Extract: why data-oriented (SoA) agent storage is cache-friendly.
HLW: processes/threads skim. Extract: the threading model for parallel command recording.
Comparison: CARLA/Gazebo/Isaac-style concepts; Waymo/Figure system-design points. Extract: how to frame the project against real simulators.

Key Concepts

Instancing and culling

Many identical objects (road segments, vehicles, signs) should be drawn with instanced rendering — one draw call, per-instance data in a buffer — instead of one draw per object (the Week 3 bottleneck). Frustum culling skips objects outside the camera view before submission. Together they cut draw-call count and GPU work dramatically for a populated world.

Data-oriented agent storage

Storing agents as an array-of-structs (AoS) scatters the fields a hot loop touches across cache lines; structure-of-arrays (SoA) packs each field contiguously so the agent-update and collision loops stream cache-efficiently (Course 4 territory, applied here). This is the classic data-oriented-design win and shows up clearly in the agent-count benchmark.

Multithreaded command recording

Vulkan command buffers can be recorded on multiple threads in parallel and submitted together, overlapping CPU recording with itself. Combined with pipelined sensor readback (Week 7), this hides latency that a single-threaded loop exposes. Mind the synchronization model from Week 2.

Integration and packaging

Bring every system into one demo scene (the four-way intersection with traffic, ego under planning, active camera sensors, perception overlay, performance HUD). Produce a benchmark report (the metrics collected all course), a README, and a recorded demo. Frame it honestly against CARLA/Gazebo: small and understandable, not a replacement.

Theory Exercises

Compute the draw-call reduction from instancing \(n\) identical objects; relate it to the Week 3 scaling curve.
Explain frustum culling and estimate the fraction of objects culled for a typical camera in a large world.
Contrast AoS vs SoA cache behavior for the agent-update loop; predict the speedup and tie it to cache-line size.
Describe a safe multithreaded command-recording scheme and the synchronization it requires (Week 2).
Build the end-to-end per-frame time budget for the full demo and identify the dominant cost.

Implementation

Add instanced rendering and frustum culling; convert agent storage to SoA; record command buffers on multiple threads; pipeline sensor readback (Week 7). Integrate all systems into one demo world. Build the benchmark-analysis tooling, the performance HUD, a README, and a recorded demo.

Benchmark

Before/after for each optimization: frame time vs object count (instancing/culling), agent-update time AoS vs SoA, command-record time single- vs multi-threaded, sensor readback serial vs pipelined. Final integrated frame-time at the demo’s object/agent count, with the full benchmark report assembled.

Expected baselines: instancing+culling flattens the Week 3 draw-call scaling; SoA gives a clear agent-loop speedup; multithreaded recording reduces CPU frame time on multicore; pipelined readback hides sensor latency. The integrated demo holds a real-time frame rate at the target scene complexity.

Connections

This capstone composes every prior week into one demo and quantifies it — the portfolio deliverable. The performance work draws on Course 4 (cache/data-oriented design) and Course 1 (profiling discipline). The honest framing against CARLA/Gazebo and the Waymo/Figure alignment is the staff-level story. The optional Weeks 11+ directions extend toward richer physics, sensors, or scenario coverage.