docs: transcript formatting design — NLP enrichment pipeline with layers.pub

+296

1 changed file

expand all

docs

superpowers

specs

2026-04-12-transcript-formatting-design.md

+296

docs/superpowers/specs/2026-04-12-transcript-formatting-design.md

··· 1 + # Transcript Formatting: NLP Enrichment Pipeline 2 + 3 + **Date:** 2026-04-12 4 + **Status:** Approved 5 + 6 + ## Problem 7 + 8 + Transcripts are currently rendered as an infinitely long, unbroken run of text. Word-level timing and concept facets exist, but there is no structural formatting — no sentences, no paragraphs, no visual hierarchy. The goal is for transcripts to read as though they were essays. 9 + 10 + ## Constraints 11 + 12 + - **Text is immutable.** The pipeline adds structural annotations only — no words are modified, added, or removed. Transcript editing is a separate future concern. 13 + - **Reliability over ambition.** Each enrichment pass must be dependable enough to run unsupervised across all transcripts. Noisy output is worse than no output. 14 + - **Build-time processing.** NLP runs once in the batch pipeline; results are published as AT Protocol records. Zero runtime cost. 15 + - **Python NLP stack.** spaCy for sentence detection and NER; sentence-transformers for topic segmentation. 16 + 17 + ## Schema Design: layers.pub Integration 18 + 19 + The enrichment pipeline uses [layers.pub](https://layers.pub) (`pub.layers.*`) lexicons — composable AT Protocol schemas for linguistic annotation. This gives us a standard, interoperable representation with built-in support for multiple annotation passes, provenance tracking, and manual overrides. 20 + 21 + **Vendoring strategy:** layers.pub is at v0.5.0 draft. We vendor the specific lexicon definitions we use into `lexicons/pub/layers/` in this repo. Panproto lenses provide forward-compatibility — when layers.pub evolves, we define migrations rather than rewriting our pipeline. This follows the project's principle of prioritizing the lens layer for forward-compat. 22 + 23 + ### Record Architecture 24 + 25 + #### 1. Source transcript (existing) 26 + 27 + `tv.ionosphere.transcript` — compact storage format with `text`, `startMs`, and `timings` array. Stays as-is. Source of truth for raw transcription output. 28 + 29 + #### 2. Expression record 30 + 31 + `pub.layers.expression.expression` (kind: `"transcript"`) — the transcript text published as a layers.pub expression. Links back to the ionosphere transcript via `sourceRef`. This is the anchoring point for all annotations. 32 + 33 + #### 3. Segmentation record 34 + 35 + `pub.layers.segmentation.segmentation` — word-level tokenization derived from the transcript's compact timing data. Each token carries: 36 + - `textSpan`: UTF-8 byte offsets (`byteStart`, `byteEnd`) 37 + - `temporalSpan`: timing in milliseconds (`start`, `ending`) 38 + 39 + This replaces the per-word timestamp facets with a standard representation. 40 + 41 + #### 4. Annotation layers 42 + 43 + `pub.layers.annotation.annotationLayer` — one record per enrichment pass: 44 + 45 + | Pass | `kind` | `subkind` | `sourceMethod` | 46 + |------|--------|-----------|----------------| 47 + | Sentence detection | `span` | `sentence-boundary` | `automatic` | 48 + | Paragraph segmentation | `span` | `paragraph-boundary` | `automatic` | 49 + | Topic segmentation (future) | `span` | `topic-segment` | `automatic` | 50 + | Named entity recognition (future) | `span` | `ner` | `automatic` | 51 + | Concept linking (future) | `span` | `concept` | `automatic` | 52 + | Speaker attribution (future) | `span` | `speaker` | `automatic` | 53 + | Manual corrections (future) | varies | varies | `manual-native` | 54 + 55 + Each layer includes `metadata` (agent, tool, confidence, timestamp) for provenance. Pipeline parameters (e.g., paragraph pause threshold) are stored in `metadata.features` so provenance is complete and results are reproducible. 56 + 57 + #### 5. Manual override layer (future) 58 + 59 + A separate annotation layer with `sourceMethod: "manual-native"` and higher `rank`. The merge step prefers higher-ranked layers. Example: correcting "Blue Sky" to link to the Bluesky concept record is an annotation in this layer that supersedes the auto-detected concept. Published as first-class AT Protocol records — auditable, attributable, and preservable across pipeline re-runs. 60 + 61 + ### Relationship to existing `tv.ionosphere.annotation` 62 + 63 + The existing `tv.ionosphere.annotation` record type (concept mentions anchored to byte ranges) continues to work as-is for Phase 1. The NLP pipeline produces layers.pub annotation layers for structural annotations (sentences, paragraphs) — these are a different concern and do not conflict. 64 + 65 + In Phase 2, when we add NLP-based concept/entity detection, the layers.pub annotation system becomes the canonical source for all enrichment annotations. At that point, existing `tv.ionosphere.annotation` records are migrated to layers.pub annotation layers via a panproto migration. The appview indexer reads both formats during the transition period. 66 + 67 + ### Panproto Integration 68 + 69 + - **Lenses:** Transform between compact transcript format (`tv.ionosphere.transcript`) and layers.pub expression + segmentation format. Lens definitions live in `formats/tv.ionosphere/lenses/`. 70 + - **Schema validation:** Validate all layers.pub records before publishing to PDS. Runs in the TypeScript publish step (after the Python NLP pipeline outputs JSON). 71 + - **Migration support:** As layers.pub evolves from v0.5.0, panproto migrations keep ionosphere records compatible. Vendored lexicons in `lexicons/pub/layers/` are the pinned source of truth. 72 + - **Pipeline boundary:** The Python NLP pipeline outputs annotation layer JSON files. The TypeScript publish step validates them against panproto-parsed lexicons and publishes to PDS. This reuses the existing panproto TypeScript integration. 73 + 74 + ## Pipeline Architecture 75 + 76 + ``` 77 + transcript record (text + timings) 78 + | 79 + v 80 + +-------------------+ 81 + | Pass 1: Sentences | <-- spaCy sentence boundary detection 82 + +---------+---------+ 83 + | 84 + v 85 + +--------------------+ 86 + | Pass 2: Paragraphs | <-- pause data + sentence boundaries 87 + +---------+----------+ 88 + | 89 + v 90 + +--------------------+ 91 + | Pass N: (future) | <-- topics, entities, speaker linking 92 + +---------+----------+ 93 + | 94 + v 95 + +--------------------+ 96 + | Override layer | <-- manual corrections (higher rank) 97 + +---------+----------+ 98 + | 99 + v 100 + +--------------------+ 101 + | Merge & publish | <-- assemble RelationalText document 102 + +--------------------+ 103 + ``` 104 + 105 + Properties: 106 + - **Each pass is a standalone Python module** with a consistent interface: takes transcript text + timings + prior layer output, returns a new annotation layer. 107 + - **Passes are additive** — they never modify text, only emit new annotations. 108 + - **Override layer applies last** — manual corrections supersede auto-generated annotations at matching byte ranges via `rank`. 109 + - **Idempotent** — re-running the pipeline produces the same output; manual overrides are preserved because they are separate records. 110 + 111 + ### Pass 1: Sentence Boundary Detection 112 + 113 + **Tool:** spaCy with `en_core_web_sm`. The small model is nearly as accurate as the transformer model for sentence boundary detection (its most battle-tested feature), and runs without GPU on a standard dev machine. If accuracy proves insufficient on speech transcripts, upgrade to `en_core_web_trf` in a later pass. 114 + 115 + spaCy's sentence segmenter uses dependency parsing, which is significantly more robust than punctuation-splitting for speech transcripts where Whisper's punctuation can be unreliable. 116 + 117 + **Output:** An annotation layer with one annotation per sentence, anchored by byte span. 118 + 119 + **Reliability:** Very high (95%+ accuracy on messy speech text). 120 + 121 + ### Pass 2: Paragraph Segmentation 122 + 123 + **Tool:** Custom algorithm combining two signals. 124 + 125 + **Signal 1 — Pause duration:** The transcript's timing data encodes silence gaps as negative values. Pauses above a tunable threshold are paragraph boundary candidates. Default threshold: **2.0 seconds** (a conservative starting point — most speech pauses are under 1s; pauses over 2s reliably indicate topic transitions). 126 + 127 + **Signal 2 — Sentence alignment:** Paragraph breaks only occur at sentence boundaries (from Pass 1). A long pause mid-sentence is a speaker thinking, not a paragraph break. 128 + 129 + **Algorithm:** 130 + ``` 131 + for each silence gap > pause_threshold_ms (default: 2000): 132 + find the nearest sentence boundary (from Pass 1) 133 + if the sentence boundary is within proximity_words (default: 5) of the pause: 134 + emit paragraph break at that sentence boundary 135 + ``` 136 + 137 + The proximity constraint of 5 words allows for the common case where a speaker finishes a thought (pause), says a brief connective phrase ("so", "and then"), and starts the next topic — the paragraph break lands at the sentence boundary closest to the actual pause. 138 + 139 + Both `pause_threshold_ms` and `proximity_words` are stored in the annotation layer's `metadata.features` for reproducibility. 140 + 141 + **Reliability:** High. Pause duration is a genuine speech signal, and constraining to sentence boundaries eliminates false positives. 142 + 143 + ## Rendering 144 + 145 + ### Format Lexicon Updates 146 + 147 + Two new facet types added to `tv.ionosphere.facet`: 148 + 149 + | Facet type | `featureClass` | Description | 150 + |---|---|---| 151 + | `tv.ionosphere.facet#sentence` | `inline` | Wraps all words in a sentence as a contiguous inline span | 152 + | `tv.ionosphere.facet#paragraph` | `block` | Groups sentences into a block-level paragraph container | 153 + 154 + Note: the annotation _storage_ format is layers.pub annotation layers (on the PDS). The _rendering_ format is ionosphere facets in the RelationalText document. The document assembly step bridges these — it reads layers.pub annotations and emits ionosphere facets. This separation means the renderer does not need to know about layers.pub. 155 + 156 + ### DOM Structure 157 + 158 + The renderer groups words into sentence spans and sentences into paragraph blocks: 159 + 160 + ```html 161 + <div>  162 + <span>  163 + <span>word</span> <span>word</span> <span>word</span> 164 + </span> 165 + <span>  166 + <span>word</span> <span>word</span> 167 + </span> 168 + </div> 169 + <div>  170 + <span>  171 + <span>word</span> <span>word</span> 172 + </span> 173 + </div> 174 + ``` 175 + 176 + This mirrors the layers.pub expression hierarchy (transcript > paragraph > sentence) and maps directly to the format lexicon's `featureClass` system (`block` for paragraphs, `inline` for sentences). 177 + 178 + Sentence spans provide styling hooks for hover, selection, and transitions at sentence granularity. Paragraph blocks provide natural vertical whitespace. 179 + 180 + ### Data Model Changes: `extractData` → Hierarchical Structure 181 + 182 + The current `extractData` function in `src/lib/transcript.ts` returns a flat `{ words: WordSpan[], concepts, wordConcepts }`. This must change to return a hierarchical structure: 183 + 184 + ```typescript 185 + interface ParagraphSpan { 186 + byteStart: number; 187 + byteEnd: number; 188 + sentences: SentenceSpan[]; 189 + } 190 + 191 + interface SentenceSpan { 192 + byteStart: number; 193 + byteEnd: number; 194 + words: WordSpan[]; // existing WordSpan type, unchanged 195 + } 196 + 197 + interface TranscriptStructure { 198 + paragraphs: ParagraphSpan[]; 199 + concepts: ConceptSpan[]; 200 + // wordConcepts lookup remains flat (indexed by global word index) 201 + wordConcepts: ConceptSpan[][]; 202 + } 203 + ``` 204 + 205 + `extractData` builds this hierarchy by: 206 + 1. Extracting all word spans from `#timestamp` facets (existing logic, unchanged). 207 + 2. Reading `#paragraph` facets to get paragraph byte ranges. Sorting by `byteStart`. 208 + 3. Reading `#sentence` facets to get sentence byte ranges. Sorting by `byteStart`. 209 + 4. Assigning each word to its containing sentence (by byte range overlap). 210 + 5. Assigning each sentence to its containing paragraph (by byte range overlap). 211 + 6. Words not covered by any sentence facet form singleton sentences. Sentences not covered by any paragraph facet form singleton paragraphs. This graceful degradation means the renderer works identically on transcripts that have not yet been enriched. 212 + 213 + The brightness gradient system (`boundaryStartTime`/`boundaryEndTime`) continues to use the global word ordering — paragraph visual gaps do not affect the temporal continuity of the gradient. The existing `WordSpanComponent` is reused unchanged inside the sentence/paragraph wrappers. 214 + 215 + ### Document Assembly 216 + 217 + Document assembly is a **build-time step** that runs after the NLP pipeline and before publishing. It: 218 + 1. Reads the compact transcript record (`tv.ionosphere.transcript`). 219 + 2. Reads all layers.pub annotation layer records for this transcript. 220 + 3. Converts layers.pub sentence/paragraph annotations into `#sentence` and `#paragraph` ionosphere facets. 221 + 4. Merges with existing `#timestamp` and `#concept-ref` facets from `decodeToDocument`. 222 + 5. Writes the assembled RelationalText document onto the `tv.ionosphere.talk` record's `document` field. 223 + 224 + This replaces the current runtime assembly in the appview serve path with a pre-computed document. The appview serves the pre-assembled document directly — zero runtime cost. 225 + 226 + Annotation layers of different `subkind` values naturally have overlapping byte ranges (a paragraph span contains sentence spans, which contain word spans). This is expected and correct — they represent different levels of the hierarchy, not conflicting annotations. 227 + 228 + ### Scroll/Time Mapping 229 + 230 + Both `TranscriptView` and `WindowedTranscriptView` must account for paragraph whitespace in their scroll-to-time and time-to-scroll mappings. 231 + 232 + **TranscriptView:** The line-map computation already handles variable-height content. Paragraph `<div>` elements with margin/padding become part of the natural layout — no special handling needed beyond the existing line grouping logic. 233 + 234 + **WindowedTranscriptView:** The `computeMonospaceLayout` function currently returns `LineEntry[]` with uniform `LINE_HEIGHT`. Changes: 235 + - Accept an additional `paragraphBreaks: Set<number>` parameter (set of word indices where a paragraph starts). 236 + - When a word is a paragraph start, insert a gap of `PARAGRAPH_GAP` pixels (default: `LINE_HEIGHT`, i.e., one blank line) before its line entry. 237 + - `LineEntry` gains `isParagraphStart: boolean` for rendering the gap spacer. 238 + - Gap entries have no time range — `timeToScrollY` and `scrollYToTime` skip gaps by treating them as extensions of the preceding line's time range (scrolling through a gap seeks to the end of the previous paragraph). 239 + 240 + ## Testing Strategy 241 + 242 + **Python pipeline (pytest):** 243 + - Golden-file tests: run the sentence/paragraph pipeline on 2-3 known transcripts, compare output annotation layers to curated expected output. These transcripts should cover: a clean well-punctuated talk, a messy conversational panel, and a lightning talk with rapid transitions. 244 + - Unit tests for the paragraph algorithm: verify that paragraph breaks only land at sentence boundaries, that pauses below threshold produce no breaks, and that the proximity constraint works correctly. 245 + 246 + **TypeScript rendering (vitest):** 247 + - Unit tests for the updated `extractData`: verify hierarchical output from facets, and verify graceful degradation when sentence/paragraph facets are absent (flat word array wrapped in singleton sentence/paragraph). 248 + - Snapshot tests for `computeMonospaceLayout` with paragraph gaps. 249 + 250 + **Manual validation:** 251 + - After running the pipeline on all transcripts, spot-check 5-10 talks across different rooms/talk types. Verify paragraph breaks land at natural topic transitions, not mid-thought. Measure average sentences-per-paragraph (expect 3-8 for well-structured talks). 252 + 253 + ## Phase Roadmap 254 + 255 + ### Phase 1 — Structural formatting (this work) 256 + 257 + - Python NLP pipeline: sentence detection (spaCy) + paragraph segmentation (pause + sentence alignment) 258 + - layers.pub expression + segmentation records for each transcript 259 + - Sentence and paragraph annotation layers 260 + - Panproto lenses: compact transcript <-> layers.pub expression + segmentation 261 + - Document assembly reads annotation layers, emits structural facets 262 + - Renderer: sentences as inline spans, paragraphs as block elements 263 + - **Goal:** Transcripts read as paragraphed prose 264 + 265 + ### Phase 2 — Entity recognition + record linking 266 + 267 + - spaCy NER pass in the pipeline 268 + - AT Protocol record resolver: people -> Bluesky profiles (DID resolution via handle/display name lookup), projects -> `tv.ionosphere.concept` records 269 + - Entity annotation layer with `knowledgeRefs` to resolved records 270 + - Renderer: entity spans as links/tooltips to profiles and concept pages 271 + - **Goal:** People and projects mentioned in talks are clickable, linked to real AT Protocol identities 272 + 273 + ### Phase 3 — Topic segmentation 274 + 275 + - Sentence-transformer embedding pass (e.g., `all-MiniLM-L6-v2`) 276 + - Sliding-window cosine similarity topic boundary detection 277 + - Topic segment annotation layer 278 + - Renderer: section dividers or topic labels at major transitions 279 + - UI: topic-based navigation within a talk (jump to "Q&A", "Demo", etc.) 280 + - **Goal:** Long talks become navigable by topic 281 + 282 + ### Phase 4 — Manual curation layer 283 + 284 + - UI for creating manual override annotations (correct a concept link, fix an entity, adjust a paragraph break) 285 + - Published as AT Protocol records with `sourceMethod: "manual-native"`, higher `rank` 286 + - Pipeline respects overrides on re-run 287 + - Multi-user: anyone with write access can contribute corrections 288 + - **Goal:** Community-curated enrichment that improves over time 289 + 290 + ### Phase 5 — Concept enrichment + cross-talk linking 291 + 292 + - Supersede auto-detected concepts with curated concept records 293 + - Cross-reference talks that mention the same entities/concepts 294 + - `tv.ionosphere.facet#talk-xref` links between related talks 295 + - Knowledge graph across the entire conference 296 + - **Goal:** The archive becomes a connected knowledge base, not just isolated transcripts

Configure Feed

Configure Feed