Week 8 — Virtual Memory, mmap, Caches, Filesystems, and I/O

Overview

This week opens up two illusions the OS maintains: that each process has its own large, contiguous memory, and that files are simple byte streams. It consolidates the virtual-memory week with the filesystems/I/O week because both are about the layers between a program’s simple view and the hardware reality — page tables and the MMU for memory, inodes and the block layer for files, with the CPU cache and the page cache as the performance-critical middle. You will draw a page-table walk, watch a page fault, write a cache-stride microbenchmark that exposes the L1/L2/DRAM hierarchy, and trace a file read from open down to blocks.

This is where the layout knowledge of Week 6 meets measured performance, and where the /proc/[pid]/maps reading of Week 7 is explained. It is the most performance-relevant systems week and the direct foundation for Week 10’s optimization work and Course 2’s cache-aware design.

Readings

HLW Ch. 1 (memory/kernel) and Ch. 8 (memory/resource): the process address space and how the kernel manages memory. Extract: virtual vs physical, the page cache.
CA Ch. 10 and ARM Ch. 8 (memory): the memory hierarchy, caches, and the MMU. Extract: page-table translation, TLB, and cache organization.
HLW Ch. 3–4: devices, disks, and filesystems. Extract: the block layer, inodes, and the file-to-blocks path.

Key Concepts

Virtual memory and the page-table walk

Each process sees a private virtual address space; the MMU translates virtual to physical addresses via multi-level page tables (AArch64 uses a 4-level walk with 4 KB pages — a 48-bit address splits into four 9-bit indices + a 12-bit offset, the base held in TTBR0/TTBR1). A virtual address splits into page-table indices + offset; each level is a memory load. The TLB caches recent translations so the walk isn’t repeated every access — a TLB hit is nearly free, a miss costs the full walk. mmap maps files or anonymous memory into the address space; a page fault on first touch lets the kernel lazily allocate/load a page (demand paging, copy-on-write from Week 7’s fork).

This is one of the places ARM64 and x86-64 are nearly identical — same idea, different names. x86-64 also uses a 4-level walk (PML4 → PDPT → PD → PT) over 4 KB pages, with 48-bit canonical virtual addresses (the top 16 bits sign-extended) and the page-table base in the CR3 register instead of TTBR. Both have a TLB, both support huge pages (2 MB / 1 GB), and both fault identically into the kernel. The translation mechanism is essentially the same; only the register names and table acronyms differ — a good illustration that virtual memory is an architectural concept, not an ISA quirk.

The cache hierarchy and stride

Between registers and DRAM sit L1/L2(/L3) caches in cache-line units (typically 64 bytes). Sequential (stride-1) access streams whole lines efficiently; large strides waste most of each line and thrash the cache. A microbenchmark that sums an array at increasing strides reveals the hierarchy as distinct performance cliffs at L1, L2, and DRAM sizes — making Week 6’s AoS/SoA result a measured phenomenon.

Allocators

malloc is a user-space allocator over kernel memory (brk/mmap); it manages free lists and has overhead and fragmentation. An arena/bump allocator trades generality for speed by allocating linearly and freeing en masse — a common systems optimization worth implementing to feel the difference.

Filesystems and I/O

A file name resolves through directories to an inode (metadata + block pointers); the kernel’s page cache holds recently used file pages so repeated reads avoid the disk. A read goes: syscall → VFS → filesystem → page cache (hit) or block layer → device. File descriptors are per-process handles into the system-wide open-file table. Buffered (stdio) vs unbuffered (raw read) I/O differ by orders of magnitude in syscall count.

Theory Exercises

Decompose an AArch64 virtual address into the 4-level page-table indices + offset; count the memory accesses per translation and explain the TLB’s role. Then map the x86-64 equivalent (PML4→PDPT→PD→PT, CR3, 48-bit canonical addresses) and state what is genuinely different versus just renamed.
Explain demand paging and copy-on-write; describe what a page fault does and when it occurs after mmap/fork.
Predict the stride-vs-time curve for an array sum and mark where L1, L2, and DRAM transitions appear.
Contrast malloc with an arena allocator; explain fragmentation and when an arena wins.
Trace a buffered vs unbuffered read of a file; estimate the syscall count of each for a 1 MB file read byte-by-byte.

Implementation

Write a cache-stride microbenchmark (sum an array at varying strides) and an arena allocator compared against malloc. Use mmap to map a file and read it; trigger and observe page faults. Parse /proc/[pid]/maps and smaps to account a process’s memory (continuity with Week 7). Implement a small file utility using raw syscalls and compare to a buffered version.

Measurement / Inspection

Plot the stride microbenchmark and identify the L1/L2/DRAM cliffs and the cache-line size. Compare malloc vs arena allocation throughput. Measure buffered vs unbuffered read time and syscall count (strace). Verify /proc/[pid]/maps accounting against actual allocations.

Expected baselines: the stride curve shows clear cliffs at the cache sizes and a ~64-byte line effect; the arena allocator beats malloc for many small allocations; buffered I/O makes far fewer syscalls and is much faster than byte-by-byte unbuffered reads. maps/smaps accounting matches expectations.

Connections

The cache-stride result is Week 6’s AoS/SoA layout measured, and the foundation for Week 10’s optimization and Course 2’s data-oriented design. /proc/.../maps parsing feeds the syslens capstone. The address space here is the one Week 7’s processes inhabit; demand paging builds on Week 7’s copy-on-write. This is the systems substrate under Course 1’s Jetson memory budgeting.