Week 4 — Sequence Labeling and Parsing: POS, NER, HMMs, Dependency Parsing

Overview

A great deal of practical NLP is not generation but structure: tagging each token with a part of speech or entity type, and recovering the grammatical skeleton of a sentence. This week consolidates sequence labeling (POS tagging, named-entity recognition, HMMs, BiLSTM/CRF taggers) and parsing (constituency and dependency) — two halves of “structured prediction over text” that belong together. You will implement a tagger and a dependency parser and evaluate them with structured metrics.

The classical algorithms here — Viterbi for HMMs, CKY for constituency parsing, transition-based dependency parsing — are dynamic-programming and graph algorithms (Course 5’s DPV material) applied to language. The neural taggers reuse the LSTMs from Week 3. So this week is largely composition of tools you have, aimed at a new problem class.

Readings

J&M Ch. 17 + Appendix A: sequence labeling for POS/NER, Hidden Markov Models, and the Viterbi algorithm. Extract: the HMM, Viterbi decoding, and the BiLSTM-CRF tagger.
J&M Ch. 18: constituency parsing and CKY. Extract: CFGs and the CKY dynamic program.
J&M Ch. 19: dependency parsing. Extract: transition-based (arc-standard) parsing and its evaluation.
CS224N: sequence models and dependency parsing. (Viterbi/CKY as dynamic programming: DPV review from Course 5.)

Key Concepts

Sequence labeling and the HMM

POS/NER assign a label to each token. The classical generative model is the HMM: hidden label states emit observed words, with transition \(P(y_t\mid y_{t-1})\) and emission \(P(w_t\mid y_t)\) probabilities. Decoding the best label sequence is the Viterbi algorithm — dynamic programming over the trellis (Course 5 DP). NER uses a tagging scheme (BIO) to mark entity spans.

Neural taggers: BiLSTM and CRF

A BiLSTM reads the sequence both directions (Week 3) so each token’s representation sees full context, then a per-token classifier emits labels. Adding a CRF layer models label-transition constraints jointly (e.g. an I-PER can’t follow a B-ORG), decoded again with Viterbi. This BiLSTM-CRF was the strong pre-transformer tagger and cleanly combines Week 3’s LSTM with the HMM’s structured decoding.

Constituency parsing and CKY

A context-free grammar describes phrase structure; CKY is the dynamic program that finds the best parse over all binary splits of every span — \(O(n^3)\). It connects to Sipser’s formal-language theory (Course 5): CFGs, Chomsky normal form, and the limits of context-free models.

Dependency parsing

Dependency grammar represents syntax as labeled head→dependent arcs. Transition-based parsing (arc-standard) processes the sentence with a stack and a buffer, choosing shift/left-arc/right-arc actions — a fast, greedy (or beam) parser whose action classifier can be neural. Evaluated by attachment score (UAS/LAS).

Theory Exercises

Write the HMM joint probability and derive the Viterbi recurrence; run it by hand on a short tagged sentence.
Explain the BIO tagging scheme and why a CRF transition layer improves NER over independent per-token classification.
Trace CKY on a short sentence with a small CFG; explain the \(O(n^3)\) cost.
Define the arc-standard transition system; parse a short sentence by hand, listing the action sequence.
Define UAS/LAS and contrast with token-level accuracy; explain why parsing needs structured metrics.

Implementation

Implement an HMM POS tagger with Viterbi and a BiLSTM(-CRF) NER tagger (reusing Week 3 LSTMs) in sequence_labeling_parsing/. Implement a transition-based dependency parser with a (neural) action classifier. Evaluate on standard datasets (e.g. CoNLL).

Experiments

Tagging: HMM vs BiLSTM vs BiLSTM-CRF accuracy/F1; show the CRF’s gain on entity-boundary consistency. Parsing: UAS/LAS for the dependency parser; CKY parse accuracy on a toy grammar. Error analysis on hard cases (rare words, long-distance dependencies).

Expected baselines: BiLSTM-CRF beats plain BiLSTM beats HMM on NER F1; the CRF mainly fixes boundary/transition errors; the dependency parser reaches respectable UAS/LAS, weaker on long arcs — foreshadowing why attention-based parsers and contextual embeddings (Week 7) help.

Connections

Viterbi/CKY are Course 5’s dynamic programming applied to language; BiLSTM taggers reuse Week 3. Structured prediction and its metrics reappear in information extraction (Week 9). Contextual embeddings (Week 7, BERT) later replace much of this pipeline — but understanding the structured-prediction foundation is what lets you use those models well.