Week 7 — Camera Sensor Simulation: RGB, Depth, and Semantic Ground Truth

Course 2 syllabus

Overview

This is the week the engine becomes a sensor simulator — the thing that makes it useful for testing autonomy. You render the world from an ego-mounted virtual camera into offscreen targets: an RGB image (what a real camera sees), a depth buffer (per-pixel distance), and a semantic-segmentation buffer (per-pixel class/instance ID, perfect ground truth). It consolidates the RGB-camera and depth/semantic weeks of the longer course, because once you can render to one offscreen target you can render to several in one pass with multiple render targets (MRT).

The synthetic ground truth is the key asset: a real dataset needs hand labeling, but here every pixel’s class and depth is known exactly, for free. Course 5’s geometrical optics and image formation underlie the camera model; here we implement it and export labeled data that Week 8’s perception consumes.

Readings

  • FCG: camera model, projection, render targets, framebuffers, and image readback. Extract: configuring an offscreen framebuffer and reading it back to CPU.
  • MIT Machine Vision: image formation and perspective projection. Extract: the pinhole model and intrinsics.
  • CS231n: detection/segmentation skim. Extract: the label formats perception expects.
  • (Geometrical optics and the pinhole/image-formation model: assumed from Course 5.)

Key Concepts

The virtual camera

Model a pinhole camera with intrinsics (focal length, principal point) and an extrinsic pose (mounted on the ego, Week 6). The projection is the standard perspective transform (Week 1 pipeline) with the camera as the view. Render into an offscreen framebuffer sized to the sensor resolution, at the sensor’s cadence (e.g. 30 Hz — every 4 ticks at 120 Hz, per Week 1).

Multiple render targets

RGB, depth, and semantic buffers can be produced in one render pass with MRT: the fragment shader writes color to attachment 0, while depth comes from the depth attachment and the semantic ID is written to attachment 1 (an integer buffer) using a per-object semantic_id/instance_id set via push constants. One pass, three coherent outputs — far cheaper than three passes.

Ground-truth export

Read the buffers back to CPU and export: the RGB image (PNG), a depth map (float), and a semantic/instance map (integer) plus a JSON manifest (camera intrinsics/extrinsics, object list with classes and 2D/3D boxes). This labeled output is exactly what Week 8’s detector is evaluated against and what a real perception pipeline would consume.

Readback cost

GPU→CPU readback is the expensive part: depth is 4× the bytes of 8-bit RGB, and a synchronous readback stalls the pipeline. Stage into separate staging buffers and use a fence (Week 2 synchronization); pipelining hides latency but not bandwidth.

Theory Exercises

  1. Write the pinhole projection from a world point to a pixel given intrinsics and the camera extrinsic; identify each transform stage (Week 1).
  2. Explain how a single MRT render pass produces RGB + depth + semantic outputs and why it beats three passes.
  3. Derive how to recover a 3D point from a depth-buffer value and the camera intrinsics (back-projection).
  4. Compute the per-frame readback bytes for RGB + depth + semantic at a given resolution; identify the bandwidth bottleneck.
  5. Explain why the semantic buffer is exact ground truth here but expensive/noisy to obtain from real data.

Implementation

Add a sensors/ module: an offscreen camera that renders the scene via MRT to RGB, depth, and semantic/instance attachments at a configurable resolution and cadence. Tag objects with semantic/instance IDs. Read back the buffers and export images + a JSON ground-truth manifest. Visualize all three buffers in the ImGui overlay.

Benchmark

Sensor render time and readback time per buffer; RGB vs depth vs semantic cost; total sensor cost vs render resolution. Verify back-projected depth matches known geometry, and that semantic IDs match the rendered objects.

Expected baselines: the MRT render adds modest cost over the main pass; readback dominates and scales with byte count (depth/instance buffers are the heavy ones); back-projected depth matches ground-truth geometry within precision. At 720p with all three buffers, readback is the clear bottleneck — the cost Week 10 will pipeline.

Connections

The exported RGB/depth/semantic + ground truth is the input and the evaluation oracle for Week 8’s perception baselines. The offscreen-rendering and synchronization machinery extends Week 2; the camera math is Course 5’s optics applied. Week 10 pipelines the readback to keep the sensor affordable at scale.

Further Reading

  • FCG camera and framebuffer chapters.
  • MIT Machine Vision image-formation lectures (Course 5 optics context).
  • CARLA sensor documentation (conceptual comparison for camera/depth/semantic sensors).