Course 3 — Computer Architecture & ARM Assembly

A short bridge course: digital design and computer architecture theory, hands-on ARM64 (A64) assembly on Apple Silicon, and the Cortex-M architecture underneath Course 2’s firmware

This is a deliberately short supporting course that sits beside Course 2 — Embedded DSP: it opens the black box between the C I write and the silicon it runs on. It teaches the digital-design and computer-architecture theory of Harris & Harris, then puts it to work in hands-on ARM64 (A64) assembly programming on my Apple Silicon Mac — every exercise in this course assembles, runs, and is debugged locally with clang, as, and lldb. The goal is the mental model that makes Course 2’s firmware, and embedded systems and driver development in general, legible: what a register file, a pipeline, a cache, a stack frame, a calling convention, and an exception actually are — and what the compiler did with my C.

One theory, two ARMs. “ARM” on this site means two different instruction-set worlds, and this course teaches the difference explicitly:

A-profile / AArch64 — the A64 ISA of Apple Silicon (M-series), the Raspberry Pi 5, and the Jetson: 31 × 64-bit general registers, virtual memory and deep cache hierarchies, wide out-of-order pipelines. This is where all hands-on exercises run.
M-profile / Cortex-M4 — the STM32’s world, targeted by Course 2. Not ARM64: the Cortex-M4 is a 32-bit ARMv7E-M core running the Thumb-2 (T32) instruction set, with an NVIC, a fixed memory map, and no MMU. It is covered in theory (Yiu) and through cross-compiled disassembly read on the Mac — the bench work stays in Course 2.

Harris & Harris teaches with a third dialect, classic 32-bit A32 — fine, because Chapters 6–8 are about concepts (ISA design, pipelining, memory hierarchy) that transfer directly to both targets, and translating between the three dialects is itself part of the learning.

Note on AI use: As across this site, the module themes, reading map, and exercise statements are drafted with AI assistance for consistent structure. The substance is mine: every line of assembly, every measurement, every disassembly annotation, and every reconciliation note is written, run, and debugged by me on my own machine.

Prerequisites: comfortable C (Course 2’s level is plenty; H&H Appendix C is the refresher). No Course 1 dependency — only Module 5 nods to Section 3 (numerical linear algebra / floating point).

Book abbreviations:

H&H — Digital Design and Computer Architecture, ARM Edition (Harris & Harris). Chapters 1–8 each carry an exercise set; I work select exercises by hand.
Smith — Programming with 64-Bit ARM Assembly Language (Stephen Smith, 1st ed.). Linux/GNU-first; each exercises page carries the macOS adaptation notes (Mach-O, @PAGE/@PAGEOFF, lldb, Apple ABI). Chapter numbers past Ch 7 follow the 1st-edition list and should be confirmed against the copy in hand. Ch 8 (Pi GPIO) and Ch 10 (Kotlin/Swift) are skipped.
Yiu — The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 3rd ed. (Joseph Yiu). Reference/theory reading — no exercise sets; its material is exercised through cross-disassembly and the Module 6 write-ups.

Workflow: the code lives beside the Course 2 lab work in the companion repo diivanand/diiv_course2_embedded_dsp_labs, under a top-level course3/ folder: one folder per module (course3/m0/ … course3/m6/), each with a Makefile, its .s/.c sources, and a notes.md recording predicted vs. observed (register dumps, disassembly annotations, timings). Predicted-vs-observed tables on the exercise pages leave the observed cells as “…” — I fill them at the machine, same convention as Course 2’s bench labs.

Part I · Modules 0–2 — Foundations: Tools, Logic, and Architecture

Module 0 — Toolchain & First Program

Theme: The assemble–link–run–debug loop on macOS: clang driver vs. explicit as + ld; Mach-O vs. ELF at a glance; symbol naming (_main), PIE and ad-hoc code signing; reading disassembly with objdump -d / otool -tv; lldb essentials (breakpoints, stepping by instruction, register and memory inspection); a Makefile for .s files.

Read: Smith 1 (Getting Started), 3 (Tooling Up — read the GNU make and GDB sections, then work in lldb; the cross-compiling and CI sections are context).

Reference: Apple, Writing ARM64 code for Apple platforms (developer.apple.com) — the authoritative page for Apple’s ABI divergences, referenced throughout the course.

Practice: Module 0 exercises — toolchain bring-up, hello world three ways, first disassembly, first lldb session, deliberate build failures.

Module 1 — Bits, Logic, and Arithmetic Circuits

Theme: Number systems and two’s complement; logic gates and Boolean algebra; combinational design, K-maps, and building blocks; latches, flip-flops, FSMs, and synchronous timing; adders, ALUs, shifters, and memory arrays — the hardware vocabulary every later chapter assumes.

Read: H&H 1 (From Zero to One), 2 (Combinational Logic), 3 (Sequential Logic), 5.1–5.5 (Digital Building Blocks). Ch 4 (HDLs) is deliberately skipped; it’s a separate rabbit hole.

Practice: Module 1 exercises — select H&H exercises worked by hand, plus A64 tie-ins: flag-prediction tables, a ripple-carry adder in pure bitwise assembly verified against ADD, a saturating add, and a table-driven FSM.

Module 2 — ISA & Microarchitecture

Theme: What an ISA is: registers, addressing modes, machine encodings, and the hardware/software contract; single-cycle → multicycle → pipelined processors, hazards, and forwarding; performance analysis; the memory hierarchy — caches, associativity, and virtual memory. A32 (H&H) vs. A64 (Mac) vs. T32 (Cortex-M) kept explicitly side by side.

Read: H&H 6 (Architecture), 7 (Microarchitecture), 8 (Memory Systems); H&H Appendix B open as the A32 instruction reference.

Also read: Yiu 4 (Architecture — the M-profile programmer’s model), 5.1–5.6 (the Thumb-2 instruction set and UAL), 6.1–6.6 (memory system) — skim now for contrast, deep-read in Module 6.

Practice: Module 2 exercises — select H&H exercises by hand, plus measurement on the M-series: -O0 vs. -O2 disassembly forensics, a pointer-chase cache-latency ladder, a branch-predictor experiment, and a dependent-chain ILP probe.

Part II · Modules 3–5 — A64 Assembly on Apple Silicon

Module 3 — A64 Essentials: Data, Memory, and Control Flow

Theme: The A64 programmer’s model: w/x registers, MOV/MOVK/MOVN, operand shifts, ADD/ADC/SUB/SBC and the NZCV flags; condition codes, compares, branches, loops, and if/else patterns; logical and bit-field operations; data sections and alignment; ADRP + page-offset addressing; LDR/STR addressing modes and pre/post-indexing; endianness.

Read: Smith 2 (Loading and Adding), 4 (Controlling Program Flow), 5 (Thanks for the Memories).

Reference: Arm, Arm Architecture Reference Manual for A-profile — the A64 instruction descriptions; H&H App. B for the A32 ↔︎ A64 mental map.

Practice: Module 3 exercises — register-width and shift predict-verify tables, string case conversion and in-place reversal, integer→decimal ASCII, bit counting, min/max scans with conditional select.

Module 4 — Functions, the Stack, and the C Boundary

Theme: BL/BLR/RET and the link register; AAPCS64 — argument and result registers, caller- vs. callee-saved sets; stack frames, STP/LDP, frame pointer chains, leaf vs. non-leaf functions, recursion; calling libc from assembly and assembly from C; Apple’s documented ABI divergences (x18 reserved, variadic arguments on the stack, 16-byte stack alignment); why raw syscalls are off-limits on macOS and what to do instead.

Read: Smith 6 (Functions and the Stack), 7 (Linux Operating System Services — for the concepts; on macOS the libc boundary replaces raw svc syscalls), 9 (Interacting with C — 1st-ed. numbering).

Also read: Apple’s ABI page (Module 0 reference), §“Pass arguments to functions correctly”.

Practice: Module 4 exercises — recursion with real stack frames, an assembly routine called from a C harness and benchmarked against libc, a C callback driven from assembly, the Apple variadic-ABI gotcha demonstrated, and a by-hand frame-pointer backtrace in lldb.

Module 5 — Multiply, Floating Point, and NEON

Theme: Integer multiply/divide and multiply-accumulate (MADD, UMULH); the floating-point register file, arithmetic, conversions, and comparisons; fused multiply-add and its rounding behavior; NEON/ASIMD — vectors, lanes, LD1/ST1, vector arithmetic and horizontal reductions; saturating fixed-point multiplies (SQDMULH) and the Q15 world of Course 2; reading optimized compiler output; simple benchmarking discipline.

Read: Smith 11 (Multiply, Divide, and Accumulate), 12 (Floating-Point Operations), 13 (Neon Coprocessor), 14–15 (Optimizing / Reading and Understanding Code) — 1st-ed. numbering, to confirm.

Also read: H&H 5.3 (number systems for FP refresher). Ties to Course 1 §3 (floating point, conditioning) and Course 1 §11 (fixed-point DSP).

Practice: Module 5 exercises — the dot product four ways (scalar integer, scalar double, hand NEON, clang -O2 autovectorized C) with a benchmark table; an FMA rounding demonstration; a Q15 saturating MAC.

Part III · Module 6 — The Cortex-M Bridge

Module 6 — The Other ARM: Cortex-M for Embedded & Driver Work

Theme: M-profile vs. A-profile, systematically: Thumb-2/T32 vs. A64; R0–R15 + xPSR + MSP/PSP vs. x0–x30 + SP + PSTATE; the fixed memory map, bit-banding, and MPU vs. MMU-backed virtual memory; the exception model — vector table, NVIC, automatic stacking, EXC_RETURN, tail-chaining, faults; the FPU and lazy stacking; mixed C/assembly in firmware and CMSIS conventions; SVC/PendSV and why every RTOS uses them; the Cortex-M4 DSP extensions and CMSIS-DSP — the direct bridge into Course 2’s firmware modules.

Read: Yiu 3 (Technical Overview), 4 (Architecture), 5.1–5.6 (Instruction Set), 6 (Memory System), 7 (Exceptions and Interrupts).

Also read: Yiu 8 (Exception Handling in Detail), 10.1–10.4 (OS Support Features), 12 (Fault Exceptions), 13 (Floating Point), 20 (Assembly and Mixed Language Projects), 21 (Cortex-M4 and DSP Applications); skim 22 (CMSIS-DSP). H&H 9 / eChapter (I/O Systems) optional.

Practice: Module 6 exercises — all Mac-runnable via the arm-none-eabi cross-toolchain already installed for Course 2: A64 vs. T32 side-by-side disassembly of the same routine, exception-entry stack frames traced on paper, the Course 2 STM32 startup file annotated end to end, a CMSIS-DSP Q15 kernel read against Module 5’s NEON version, and the capstone note: one algorithm, two ARMs.

Where this course leads

Module 6’s capstone note is the course deliverable: a short staff-level write-up reconciling everything — the same dot product on an out-of-order A-profile core and an in-order M-profile core, compared across ISA, register file, pipeline, memory system, floating point, and interrupt model. After that, Course 2’s ADC interrupts, DMA buffers, CMSIS-DSP calls, and hard-fault handlers all read as applications of a machine model I actually hold — and driver-development material (memory-mapped I/O, volatile, barriers, ISRs) has the architecture underneath it already in place.