Ionosphere.tv
3
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: transcript formatting implementation plan — 15 tasks across 7 chunks

Python NLP pipeline (spaCy sentences + pause-based paragraphs), hierarchical
extractData, paragraph/sentence DOM structure in TranscriptView and
WindowedTranscriptView, document assembly at publish time, old annotation
system removal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

+1138
+1138
docs/superpowers/plans/2026-04-12-transcript-formatting.md
··· 1 + # Transcript Formatting Implementation Plan 2 + 3 + > **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking. 4 + 5 + **Goal:** Add NLP-based sentence and paragraph detection to transcripts so they render as structured prose instead of a wall of text. 6 + 7 + **Architecture:** A Python NLP pipeline (spaCy) produces sentence/paragraph annotation layers as JSON files. A TypeScript publish step validates and publishes these as layers.pub AT Protocol records. The document assembly step reads annotation layers and emits structural facets (`#sentence`, `#paragraph`) that the React renderer consumes as inline spans and block elements. The old `tv.ionosphere.annotation` system is removed entirely. 8 + 9 + **Tech Stack:** Python 3.12+, spaCy (`en_core_web_sm`), vitest, layers.pub lexicons, panproto, React/Next.js 10 + 11 + **Spec:** `docs/superpowers/specs/2026-04-12-transcript-formatting-design.md` 12 + 13 + **Scope note:** This plan implements the rendering pipeline (NLP → facets → renderer) end-to-end. Publishing layers.pub records to the PDS and creating panproto lenses are deferred to a follow-up — the vendored lexicons and spec are forward preparation. The immediate goal is formatted transcripts in the browser. 14 + 15 + **UX note:** Removing the old annotation system (Task 13) will temporarily remove concept highlighting from talk pages. Concepts return via NLP in Phase 2. If this is unacceptable, Task 13 can be deferred and the old overlay path kept alongside the new structural facets. 16 + 17 + --- 18 + 19 + ## Chunk 1: Vendor layers.pub lexicons and define new facet types 20 + 21 + ### Task 1: Vendor layers.pub lexicon definitions 22 + 23 + **Files:** 24 + - Create: `lexicons/pub/layers/defs.json` 25 + - Create: `lexicons/pub/layers/expression/expression.json` 26 + - Create: `lexicons/pub/layers/segmentation/segmentation.json` 27 + - Create: `lexicons/pub/layers/annotation/annotationLayer.json` 28 + 29 + - [ ] **Step 1: Create the `pub.layers.defs` shared definitions lexicon** 30 + 31 + Vendor the subset of `pub.layers.defs` that we use: `span`, `temporalSpan`, `uuid`, `tokenRef`, `anchor`, `annotationMetadata`, `featureMap`, `feature`. Pull the field definitions from https://docs.layers.pub/lexicons/defs. These are the shared types referenced by the other lexicons. 32 + 33 + - [ ] **Step 2: Create the `pub.layers.expression.expression` lexicon** 34 + 35 + Vendor the expression record schema from https://docs.layers.pub/lexicons/expression. Required fields: `id`, `kindUri`, `kind`, `text`, `language`, `createdAt`. Optional: `sourceRef`, `parentRef`, `anchor`, `metadata`, `features`. 36 + 37 + - [ ] **Step 3: Create the `pub.layers.segmentation.segmentation` lexicon** 38 + 39 + Vendor the segmentation record schema from https://docs.layers.pub/lexicons/segmentation. This includes the `segmentation` record and the `tokenization` and `token` object defs. 40 + 41 + - [ ] **Step 4: Create the `pub.layers.annotation.annotationLayer` lexicon** 42 + 43 + Vendor the annotation layer record schema from https://docs.layers.pub/lexicons/annotation. This includes `annotationLayer` record and the `annotation` object def. 44 + 45 + - [ ] **Step 5: Commit** 46 + 47 + ```bash 48 + git add lexicons/pub/ 49 + git commit -m "feat: vendor layers.pub lexicons for transcript enrichment" 50 + ``` 51 + 52 + ### Task 2: Add sentence and paragraph facet types to the format lexicon 53 + 54 + **Files:** 55 + - Modify: `formats/tv.ionosphere/ionosphere.lexicon.json` 56 + 57 + - [ ] **Step 1: Add `#sentence` (inline) and `#paragraph` (block) facet entries** 58 + 59 + Add to the `features` array in `formats/tv.ionosphere/ionosphere.lexicon.json`: 60 + 61 + ```json 62 + { 63 + "typeId": "tv.ionosphere.facet#sentence", 64 + "featureClass": "inline", 65 + "expandStart": false, 66 + "expandEnd": false 67 + }, 68 + { 69 + "typeId": "tv.ionosphere.facet#paragraph", 70 + "featureClass": "block", 71 + "expandStart": false, 72 + "expandEnd": false 73 + } 74 + ``` 75 + 76 + - [ ] **Step 2: Commit** 77 + 78 + ```bash 79 + git add formats/tv.ionosphere/ionosphere.lexicon.json 80 + git commit -m "feat: add sentence (inline) and paragraph (block) facet types" 81 + ``` 82 + 83 + --- 84 + 85 + ## Chunk 2: Python NLP pipeline 86 + 87 + ### Task 3: Set up the Python enrichment project 88 + 89 + **Files:** 90 + - Create: `pipeline/pyproject.toml` 91 + - Create: `pipeline/nlp/__init__.py` 92 + 93 + - [ ] **Step 1: Create `pipeline/pyproject.toml`** 94 + 95 + ```toml 96 + [project] 97 + name = "ionosphere-nlp" 98 + version = "0.1.0" 99 + description = "NLP enrichment pipeline for ionosphere transcripts" 100 + requires-python = ">=3.12" 101 + dependencies = [ 102 + "spacy>=3.7", 103 + ] 104 + 105 + [project.optional-dependencies] 106 + dev = ["pytest>=8.0"] 107 + 108 + [tool.pytest.ini_options] 109 + testpaths = ["tests"] 110 + ``` 111 + 112 + - [ ] **Step 2: Create the package init** 113 + 114 + Create `pipeline/nlp/__init__.py` (empty file). 115 + 116 + - [ ] **Step 3: Create `pipeline/tests/__init__.py`** 117 + 118 + Empty file (needed for pytest discovery). 119 + 120 + - [ ] **Step 4: Add `.gitignore` entries for Python artifacts** 121 + 122 + Add to the repo root `.gitignore`: 123 + ``` 124 + pipeline/.venv/ 125 + pipeline/data/ 126 + __pycache__/ 127 + *.pyc 128 + ``` 129 + 130 + - [ ] **Step 5: Install dependencies** 131 + 132 + ```bash 133 + cd pipeline 134 + python -m venv .venv 135 + source .venv/bin/activate 136 + pip install -e ".[dev]" 137 + python -m spacy download en_core_web_sm 138 + ``` 139 + 140 + - [ ] **Step 6: Commit** 141 + 142 + ```bash 143 + git add pipeline/pyproject.toml pipeline/nlp/__init__.py pipeline/tests/__init__.py .gitignore 144 + git commit -m "feat: scaffold Python NLP pipeline project" 145 + ``` 146 + 147 + ### Task 4: Implement sentence boundary detection (Pass 1) 148 + 149 + **Files:** 150 + - Create: `pipeline/tests/test_sentences.py` 151 + - Create: `pipeline/nlp/sentences.py` 152 + 153 + - [ ] **Step 1: Write the failing test** 154 + 155 + Create `pipeline/tests/test_sentences.py`: 156 + 157 + ```python 158 + from nlp.sentences import detect_sentences 159 + 160 + 161 + def test_basic_sentences(): 162 + text = "Hello world. This is a test. And another sentence." 163 + sentences = detect_sentences(text) 164 + assert len(sentences) == 3 165 + # Each sentence is a dict with byteStart and byteEnd 166 + assert sentences[0]["byteStart"] == 0 167 + assert sentences[0]["byteEnd"] == len("Hello world.".encode("utf-8")) 168 + assert sentences[1]["byteStart"] == len("Hello world. ".encode("utf-8")) 169 + 170 + 171 + def test_speech_without_punctuation(): 172 + """spaCy should detect sentence boundaries even with poor punctuation.""" 173 + text = "so the thing is we need to think about this carefully and then we can move on to the next topic which is about protocols" 174 + sentences = detect_sentences(text) 175 + # spaCy should find at least 1 sentence (the whole text if no clear boundary) 176 + assert len(sentences) >= 1 177 + # All sentences should cover the full text 178 + assert sentences[0]["byteStart"] == 0 179 + assert sentences[-1]["byteEnd"] == len(text.encode("utf-8")) 180 + 181 + 182 + def test_empty_text(): 183 + sentences = detect_sentences("") 184 + assert sentences == [] 185 + 186 + 187 + def test_byte_offsets_for_unicode(): 188 + text = "Caf\u00e9 is great. Let\u2019s go." 189 + sentences = detect_sentences(text) 190 + # Byte offsets must account for multi-byte characters 191 + full_bytes = text.encode("utf-8") 192 + assert sentences[-1]["byteEnd"] == len(full_bytes) 193 + ``` 194 + 195 + - [ ] **Step 2: Run test to verify it fails** 196 + 197 + ```bash 198 + cd pipeline && source .venv/bin/activate && pytest tests/test_sentences.py -v 199 + ``` 200 + Expected: FAIL with `ModuleNotFoundError` 201 + 202 + - [ ] **Step 3: Implement `detect_sentences`** 203 + 204 + Create `pipeline/nlp/sentences.py`: 205 + 206 + ```python 207 + """Pass 1: Sentence boundary detection using spaCy.""" 208 + 209 + import spacy 210 + 211 + _nlp = None 212 + 213 + 214 + def _get_nlp(): 215 + global _nlp 216 + if _nlp is None: 217 + _nlp = spacy.load("en_core_web_sm") 218 + return _nlp 219 + 220 + 221 + def detect_sentences(text: str) -> list[dict]: 222 + """Detect sentence boundaries and return byte-range spans. 223 + 224 + Returns a list of dicts, each with: 225 + byteStart: int — UTF-8 byte offset of sentence start 226 + byteEnd: int — UTF-8 byte offset of sentence end (exclusive) 227 + """ 228 + if not text.strip(): 229 + return [] 230 + 231 + nlp = _get_nlp() 232 + doc = nlp(text) 233 + text_bytes = text.encode("utf-8") 234 + sentences = [] 235 + 236 + for sent in doc.sents: 237 + # spaCy gives character offsets; convert to byte offsets 238 + byte_start = len(text[:sent.start_char].encode("utf-8")) 239 + byte_end = len(text[:sent.end_char].encode("utf-8")) 240 + sentences.append({ 241 + "byteStart": byte_start, 242 + "byteEnd": byte_end, 243 + }) 244 + 245 + return sentences 246 + ``` 247 + 248 + - [ ] **Step 4: Run tests to verify they pass** 249 + 250 + ```bash 251 + cd pipeline && source .venv/bin/activate && pytest tests/test_sentences.py -v 252 + ``` 253 + Expected: all PASS 254 + 255 + - [ ] **Step 5: Commit** 256 + 257 + ```bash 258 + git add pipeline/nlp/sentences.py pipeline/tests/test_sentences.py 259 + git commit -m "feat: sentence boundary detection via spaCy" 260 + ``` 261 + 262 + ### Task 5: Implement paragraph segmentation (Pass 2) 263 + 264 + **Files:** 265 + - Create: `pipeline/tests/test_paragraphs.py` 266 + - Create: `pipeline/nlp/paragraphs.py` 267 + 268 + - [ ] **Step 1: Write the failing test** 269 + 270 + Create `pipeline/tests/test_paragraphs.py`: 271 + 272 + ```python 273 + from nlp.paragraphs import detect_paragraphs 274 + 275 + 276 + def test_basic_paragraphs(): 277 + text = "Hello world. This is sentence two. After a long pause here. New topic starts." 278 + sentences = [ 279 + {"byteStart": 0, "byteEnd": 12}, 280 + {"byteStart": 13, "byteEnd": 34}, 281 + {"byteStart": 35, "byteEnd": 59}, 282 + {"byteStart": 60, "byteEnd": 77}, 283 + ] 284 + # Words: "Hello"=0, "world."=1, "This"=2, "is"=3, "sentence"=4, "two."=5, 285 + # "After"=6, "a"=7, "long"=8, "pause"=9, "here."=10, 286 + # "New"=11, "topic"=12, "starts."=13 287 + # Big pause gap (3000ms) between word index 5 and 6 (between sentence 2 and 3) 288 + timings = [100, 100, 100, 100, 100, 100, -3000, 100, 100, 100, 100, 100, 100, 100] 289 + start_ms = 0 290 + 291 + paragraphs = detect_paragraphs( 292 + text=text, 293 + timings=timings, 294 + start_ms=start_ms, 295 + sentences=sentences, 296 + pause_threshold_ms=2000, 297 + proximity_words=5, 298 + ) 299 + # Should detect a paragraph break at the sentence boundary near the 3s pause 300 + assert len(paragraphs) == 2 301 + assert paragraphs[0]["byteStart"] == 0 302 + assert paragraphs[1]["byteStart"] == 35 # "After a long pause..." 303 + 304 + 305 + def test_no_long_pauses_single_paragraph(): 306 + text = "One sentence. Two sentence." 307 + sentences = [ 308 + {"byteStart": 0, "byteEnd": 13}, 309 + {"byteStart": 14, "byteEnd": 27}, 310 + ] 311 + timings = [100, 100, 100, 100] 312 + paragraphs = detect_paragraphs( 313 + text=text, timings=timings, start_ms=0, 314 + sentences=sentences, 315 + ) 316 + assert len(paragraphs) == 1 317 + 318 + 319 + def test_empty_input(): 320 + paragraphs = detect_paragraphs( 321 + text="", timings=[], start_ms=0, sentences=[], 322 + ) 323 + assert paragraphs == [] 324 + ``` 325 + 326 + - [ ] **Step 2: Run test to verify it fails** 327 + 328 + ```bash 329 + cd pipeline && source .venv/bin/activate && pytest tests/test_paragraphs.py -v 330 + ``` 331 + Expected: FAIL with `ModuleNotFoundError` 332 + 333 + - [ ] **Step 3: Implement `detect_paragraphs`** 334 + 335 + Create `pipeline/nlp/paragraphs.py`: 336 + 337 + ```python 338 + """Pass 2: Paragraph segmentation using pause duration + sentence boundaries.""" 339 + 340 + 341 + def detect_paragraphs( 342 + text: str, 343 + timings: list[int], 344 + start_ms: int, 345 + sentences: list[dict], 346 + pause_threshold_ms: int = 2000, 347 + proximity_words: int = 5, 348 + ) -> list[dict]: 349 + """Detect paragraph boundaries from timing gaps and sentence boundaries. 350 + 351 + Returns a list of paragraph dicts with byteStart and byteEnd. 352 + """ 353 + if not text.strip() or not sentences: 354 + return [] 355 + 356 + text_bytes = text.encode("utf-8") 357 + 358 + # Find word indices where long pauses occur 359 + pause_word_indices: list[int] = [] 360 + word_index = 0 361 + for value in timings: 362 + if value < 0: 363 + if abs(value) >= pause_threshold_ms: 364 + pause_word_indices.append(word_index) 365 + else: 366 + word_index += 1 367 + 368 + if not pause_word_indices: 369 + # No long pauses — entire text is one paragraph 370 + return [{"byteStart": sentences[0]["byteStart"], 371 + "byteEnd": sentences[-1]["byteEnd"]}] 372 + 373 + # Build a char→byte offset map once, then compute word byte starts 374 + text_bytes = text.encode("utf-8") 375 + char_to_byte = [] 376 + byte_pos = 0 377 + for ch in text: 378 + char_to_byte.append(byte_pos) 379 + byte_pos += len(ch.encode("utf-8")) 380 + char_to_byte.append(byte_pos) # sentinel for end of text 381 + 382 + # Find word start char offsets using split positions 383 + words = text.split() 384 + word_byte_starts: list[int] = [] 385 + char_offset = 0 386 + for w in words: 387 + idx = text.index(w, char_offset) 388 + word_byte_starts.append(char_to_byte[idx]) 389 + char_offset = idx + len(w) 390 + 391 + def word_index_for_byte(byte_pos: int) -> int: 392 + """Find the word index closest to a byte position.""" 393 + best = 0 394 + for i, wb in enumerate(word_byte_starts): 395 + if wb <= byte_pos: 396 + best = i 397 + return best 398 + 399 + # Find sentence boundaries (byte positions where one sentence ends 400 + # and the next begins) 401 + sentence_break_byte_positions: list[int] = [] 402 + sentence_break_word_indices: list[int] = [] 403 + for i in range(1, len(sentences)): 404 + bp = sentences[i]["byteStart"] 405 + sentence_break_byte_positions.append(bp) 406 + sentence_break_word_indices.append(word_index_for_byte(bp)) 407 + 408 + # For each long pause, find the nearest sentence boundary 409 + paragraph_break_bytes: set[int] = set() 410 + for pause_wi in pause_word_indices: 411 + best_dist = float("inf") 412 + best_bp = None 413 + for sb_wi, sb_bp in zip( 414 + sentence_break_word_indices, sentence_break_byte_positions 415 + ): 416 + dist = abs(sb_wi - pause_wi) 417 + if dist <= proximity_words and dist < best_dist: 418 + best_dist = dist 419 + best_bp = sb_bp 420 + if best_bp is not None: 421 + paragraph_break_bytes.add(best_bp) 422 + 423 + # Build paragraph spans from the break points 424 + sorted_breaks = sorted(paragraph_break_bytes) 425 + paragraphs: list[dict] = [] 426 + current_start = sentences[0]["byteStart"] 427 + 428 + for brk in sorted_breaks: 429 + # Find the sentence that ends just before this break 430 + para_end = brk 431 + for s in sentences: 432 + if s["byteEnd"] <= brk: 433 + para_end = s["byteEnd"] 434 + paragraphs.append({"byteStart": current_start, "byteEnd": para_end}) 435 + current_start = brk 436 + 437 + # Final paragraph 438 + paragraphs.append({ 439 + "byteStart": current_start, 440 + "byteEnd": sentences[-1]["byteEnd"], 441 + }) 442 + 443 + return paragraphs 444 + ``` 445 + 446 + - [ ] **Step 4: Run tests to verify they pass** 447 + 448 + ```bash 449 + cd pipeline && source .venv/bin/activate && pytest tests/test_paragraphs.py -v 450 + ``` 451 + Expected: all PASS 452 + 453 + - [ ] **Step 5: Commit** 454 + 455 + ```bash 456 + git add pipeline/nlp/paragraphs.py pipeline/tests/test_paragraphs.py 457 + git commit -m "feat: paragraph segmentation from pause data + sentence boundaries" 458 + ``` 459 + 460 + ### Task 6: Pipeline orchestrator — process all transcripts 461 + 462 + **Files:** 463 + - Create: `pipeline/nlp/run.py` 464 + - Create: `pipeline/tests/test_run.py` 465 + 466 + - [ ] **Step 1: Write the failing test** 467 + 468 + Create `pipeline/tests/test_run.py`: 469 + 470 + ```python 471 + import json 472 + import os 473 + from pathlib import Path 474 + from nlp.run import process_transcript 475 + 476 + 477 + def test_process_transcript_produces_output(tmp_path): 478 + """Integration test: full pipeline on a simple transcript.""" 479 + transcript = { 480 + "text": "Hello world. This is a test. After a long pause. New topic here.", 481 + "startMs": 0, 482 + "timings": [100, 100, 100, 100, 100, 100, -3000, 100, 100, 100, 100, 100, 100, 100], 483 + } 484 + 485 + result = process_transcript(transcript, talk_rkey="test-talk") 486 + 487 + # Should have sentences and paragraphs 488 + assert "sentences" in result 489 + assert "paragraphs" in result 490 + assert len(result["sentences"]) >= 2 491 + assert len(result["paragraphs"]) >= 1 492 + # Each sentence has byte ranges 493 + for s in result["sentences"]: 494 + assert "byteStart" in s 495 + assert "byteEnd" in s 496 + # Metadata present 497 + assert "metadata" in result 498 + assert result["metadata"]["tool"] == "spacy/en_core_web_sm" 499 + ``` 500 + 501 + - [ ] **Step 2: Run test to verify it fails** 502 + 503 + ```bash 504 + cd pipeline && source .venv/bin/activate && pytest tests/test_run.py -v 505 + ``` 506 + 507 + - [ ] **Step 3: Implement the orchestrator** 508 + 509 + Create `pipeline/nlp/run.py`: 510 + 511 + ```python 512 + """Pipeline orchestrator: run all NLP passes on a transcript.""" 513 + 514 + import json 515 + import sys 516 + from pathlib import Path 517 + from nlp.sentences import detect_sentences 518 + from nlp.paragraphs import detect_paragraphs 519 + 520 + 521 + def process_transcript( 522 + transcript: dict, 523 + talk_rkey: str, 524 + pause_threshold_ms: int = 2000, 525 + proximity_words: int = 5, 526 + ) -> dict: 527 + """Run all NLP passes on a single transcript. 528 + 529 + Args: 530 + transcript: dict with text, startMs, timings 531 + talk_rkey: the talk's record key (for output naming) 532 + 533 + Returns: 534 + dict with sentences, paragraphs, and metadata 535 + """ 536 + text = transcript["text"] 537 + timings = transcript["timings"] 538 + start_ms = transcript["startMs"] 539 + 540 + # Pass 1: Sentence detection 541 + sentences = detect_sentences(text) 542 + 543 + # Pass 2: Paragraph segmentation 544 + paragraphs = detect_paragraphs( 545 + text=text, 546 + timings=timings, 547 + start_ms=start_ms, 548 + sentences=sentences, 549 + pause_threshold_ms=pause_threshold_ms, 550 + proximity_words=proximity_words, 551 + ) 552 + 553 + return { 554 + "talkRkey": talk_rkey, 555 + "sentences": sentences, 556 + "paragraphs": paragraphs, 557 + "metadata": { 558 + "tool": "spacy/en_core_web_sm", 559 + "pauseThresholdMs": pause_threshold_ms, 560 + "proximityWords": proximity_words, 561 + }, 562 + } 563 + 564 + 565 + def main(): 566 + """CLI: read transcripts from appview data/transcripts/, write results to pipeline/data/nlp/.""" 567 + # Match the path used by publish.ts: apps/ionosphere-appview/data/transcripts/ 568 + transcripts_dir = Path(__file__).resolve().parent.parent.parent / "apps" / "ionosphere-appview" / "data" / "transcripts" 569 + output_dir = Path(__file__).resolve().parent.parent / "data" / "nlp" 570 + output_dir.mkdir(parents=True, exist_ok=True) 571 + 572 + if not transcripts_dir.exists(): 573 + print(f"Transcripts directory not found: {transcripts_dir}") 574 + sys.exit(1) 575 + 576 + transcript_files = sorted(transcripts_dir.glob("*.json")) 577 + print(f"Processing {len(transcript_files)} transcripts...") 578 + 579 + for tf in transcript_files: 580 + talk_rkey = tf.stem 581 + transcript = json.loads(tf.read_text()) 582 + 583 + # The cached transcript files contain TranscriptResult format 584 + # (text + words array). We need to encode to compact format first. 585 + # But the pipeline needs text + timings. Let's derive timings from words. 586 + if "words" in transcript and "timings" not in transcript: 587 + from nlp.encoding import words_to_compact 588 + compact = words_to_compact(transcript) 589 + else: 590 + compact = transcript 591 + 592 + result = process_transcript(compact, talk_rkey=talk_rkey) 593 + 594 + out_path = output_dir / f"{talk_rkey}.json" 595 + out_path.write_text(json.dumps(result, indent=2)) 596 + print(f" {talk_rkey}: {len(result['sentences'])} sentences, {len(result['paragraphs'])} paragraphs") 597 + 598 + print("Done.") 599 + 600 + 601 + if __name__ == "__main__": 602 + main() 603 + ``` 604 + 605 + - [ ] **Step 4: Create `pipeline/nlp/encoding.py` — helper to convert word-level transcripts to compact format** 606 + 607 + ```python 608 + """Convert word-level transcript format to compact (text + timings) format.""" 609 + 610 + 611 + def words_to_compact(transcript: dict) -> dict: 612 + """Convert TranscriptResult {text, words[{word, start, end}]} to compact {text, startMs, timings}.""" 613 + words = transcript.get("words", []) 614 + if not words: 615 + return {"text": transcript.get("text", ""), "startMs": 0, "timings": []} 616 + 617 + start_ms = round(words[0]["start"] * 1000) 618 + timings = [] 619 + cursor = start_ms 620 + 621 + for w in words: 622 + word_start_ms = round(w["start"] * 1000) 623 + word_end_ms = round(w["end"] * 1000) 624 + duration = word_end_ms - word_start_ms 625 + 626 + gap = word_start_ms - cursor 627 + if gap > 0: 628 + timings.append(-gap) 629 + 630 + timings.append(max(duration, 1)) 631 + cursor = word_end_ms 632 + 633 + return { 634 + "text": transcript["text"], 635 + "startMs": start_ms, 636 + "timings": timings, 637 + } 638 + ``` 639 + 640 + - [ ] **Step 5: Run tests to verify they pass** 641 + 642 + ```bash 643 + cd pipeline && source .venv/bin/activate && pytest tests/ -v 644 + ``` 645 + Expected: all PASS 646 + 647 + - [ ] **Step 6: Commit** 648 + 649 + ```bash 650 + git add pipeline/nlp/run.py pipeline/nlp/encoding.py pipeline/tests/test_run.py 651 + git commit -m "feat: NLP pipeline orchestrator with CLI entry point" 652 + ``` 653 + 654 + --- 655 + 656 + ## Chunk 3: TypeScript — update `extractData` for hierarchical structure 657 + 658 + ### Task 7: Add `ParagraphSpan` and `SentenceSpan` types and update `extractData` 659 + 660 + **Files:** 661 + - Modify: `apps/ionosphere/src/lib/transcript.ts` 662 + - Modify: `apps/ionosphere/src/lib/transcript.test.ts` 663 + 664 + - [ ] **Step 1: Write failing tests for hierarchical extraction** 665 + 666 + Add to `apps/ionosphere/src/lib/transcript.test.ts`: 667 + 668 + ```typescript 669 + describe("extractData — hierarchical structure", () => { 670 + it("groups words into sentences and paragraphs when facets present", () => { 671 + const doc = makeDoc([ 672 + { text: "Hello", startNs: 1000, endNs: 2000 }, 673 + { text: "world.", startNs: 2000, endNs: 3000 }, 674 + { text: "New", startNs: 4000, endNs: 5000 }, 675 + { text: "sentence.", startNs: 5000, endNs: 6000 }, 676 + ]); 677 + const encoder = new TextEncoder(); 678 + const text = "Hello world. New sentence."; 679 + // Add sentence facets 680 + doc.facets.push({ 681 + index: { 682 + byteStart: 0, 683 + byteEnd: encoder.encode("Hello world.").length, 684 + }, 685 + features: [{ $type: "tv.ionosphere.facet#sentence" }], 686 + }); 687 + doc.facets.push({ 688 + index: { 689 + byteStart: encoder.encode("Hello world. ").length, 690 + byteEnd: encoder.encode(text).length, 691 + }, 692 + features: [{ $type: "tv.ionosphere.facet#sentence" }], 693 + }); 694 + // Add paragraph facet (one paragraph covering everything) 695 + doc.facets.push({ 696 + index: { byteStart: 0, byteEnd: encoder.encode(text).length }, 697 + features: [{ $type: "tv.ionosphere.facet#paragraph" }], 698 + }); 699 + 700 + const result = extractData(doc); 701 + expect(result.paragraphs).toHaveLength(1); 702 + expect(result.paragraphs[0].sentences).toHaveLength(2); 703 + expect(result.paragraphs[0].sentences[0].words).toHaveLength(2); 704 + expect(result.paragraphs[0].sentences[1].words).toHaveLength(2); 705 + }); 706 + 707 + it("gracefully degrades to singleton paragraph/sentence when no structural facets", () => { 708 + const doc = makeDoc([ 709 + { text: "Hello", startNs: 1000, endNs: 2000 }, 710 + { text: "world", startNs: 2000, endNs: 3000 }, 711 + ]); 712 + 713 + const result = extractData(doc); 714 + // Should still have paragraphs/sentences structure 715 + expect(result.paragraphs).toHaveLength(1); 716 + expect(result.paragraphs[0].sentences).toHaveLength(1); 717 + expect(result.paragraphs[0].sentences[0].words).toHaveLength(2); 718 + // Legacy flat access still works 719 + expect(result.words).toHaveLength(2); 720 + }); 721 + }); 722 + ``` 723 + 724 + - [ ] **Step 2: Run tests to verify they fail** 725 + 726 + ```bash 727 + cd apps/ionosphere && npx vitest run src/lib/transcript.test.ts 728 + ``` 729 + Expected: FAIL — `paragraphs` property does not exist 730 + 731 + - [ ] **Step 3: Add types and update `extractData`** 732 + 733 + Add these types to `apps/ionosphere/src/lib/transcript.ts`: 734 + 735 + ```typescript 736 + export interface SentenceSpan { 737 + byteStart: number; 738 + byteEnd: number; 739 + words: WordSpan[]; 740 + } 741 + 742 + export interface ParagraphSpan { 743 + byteStart: number; 744 + byteEnd: number; 745 + sentences: SentenceSpan[]; 746 + } 747 + ``` 748 + 749 + Update `extractData` to return `paragraphs: ParagraphSpan[]` alongside the existing flat `words` array. The function extracts `#sentence` and `#paragraph` facets, groups words into sentences by byte range overlap, groups sentences into paragraphs, and falls back to singleton wrappers when structural facets are absent. 750 + 751 + Key logic: 752 + 1. Extract words and concepts as before (existing code unchanged). 753 + 2. Extract sentence facets (byteStart/byteEnd from `#sentence` features). Sort by byteStart. 754 + 3. Extract paragraph facets (byteStart/byteEnd from `#paragraph` features). Sort by byteStart. 755 + 4. If no sentence facets: wrap all words in one sentence. If no paragraph facets: wrap all sentences in one paragraph. 756 + 5. Assign each word to its sentence (word.byteStart >= sentence.byteStart && word.byteEnd <= sentence.byteEnd). 757 + 6. Assign each sentence to its paragraph (sentence.byteStart >= paragraph.byteStart && sentence.byteEnd <= paragraph.byteEnd). 758 + 759 + - [ ] **Step 4: Run tests to verify they pass** 760 + 761 + ```bash 762 + cd apps/ionosphere && npx vitest run src/lib/transcript.test.ts 763 + ``` 764 + Expected: all PASS (both new and existing tests) 765 + 766 + - [ ] **Step 5: Commit** 767 + 768 + ```bash 769 + git add apps/ionosphere/src/lib/transcript.ts apps/ionosphere/src/lib/transcript.test.ts 770 + git commit -m "feat: hierarchical extractData with paragraph/sentence grouping" 771 + ``` 772 + 773 + --- 774 + 775 + ## Chunk 4: Update the renderer 776 + 777 + ### Task 8: Update `TranscriptView` to render paragraphs and sentences 778 + 779 + **Files:** 780 + - Modify: `apps/ionosphere/src/app/components/TranscriptView.tsx` 781 + 782 + - [ ] **Step 1: Update the render tree** 783 + 784 + Replace the flat `words.map(...)` rendering with a nested structure: 785 + 786 + ```tsx 787 + {paragraphs.map((para, pi) => ( 788 + <div key={pi} className="mb-4"> 789 + {para.sentences.map((sent, si) => ( 790 + <span key={si} className="sentence"> 791 + {sent.words.map((word, wi) => { 792 + const globalIdx = /* compute global word index */; 793 + return ( 794 + <WordSpanComponent 795 + key={globalIdx} 796 + ref={(el) => setWordRef(globalIdx, el)} 797 + word={word} 798 + concept={wordConcepts[globalIdx]?.[0] || null} 799 + currentTimeNs={currentTimeNs} 800 + onSeek={handleSeek} 801 + hasComment={wordHasComment.has(globalIdx)} 802 + /> 803 + ); 804 + })} 805 + </span> 806 + ))} 807 + </div> 808 + ))} 809 + ``` 810 + 811 + The `useMemo` call to `extractData` now destructures `paragraphs` alongside `words` and `wordConcepts`. The global word index is computed by maintaining a running counter across paragraphs and sentences. 812 + 813 + The comment system, reaction groups, text selection, and scroll/time mappings continue to use the flat `words` array (unchanged). Only the DOM structure changes to add the paragraph/sentence grouping. 814 + 815 + - [ ] **Step 2: Verify in browser** 816 + 817 + Start the dev server and load a talk page. Verify: 818 + - Transcripts without structural facets render identically to before (graceful degradation) 819 + - No console errors 820 + - Scroll-to-time and click-to-seek still work 821 + - Comments and reactions still work 822 + 823 + - [ ] **Step 3: Commit** 824 + 825 + ```bash 826 + git add apps/ionosphere/src/app/components/TranscriptView.tsx 827 + git commit -m "feat: render transcripts with paragraph/sentence DOM structure" 828 + ``` 829 + 830 + ### Task 9: Update `WindowedTranscriptView` for paragraph gaps 831 + 832 + **Files:** 833 + - Modify: `apps/ionosphere/src/app/components/WindowedTranscriptView.tsx` 834 + 835 + - [ ] **Step 1: Update `computeMonospaceLayout` to accept paragraph breaks** 836 + 837 + Add a `paragraphStartIndices: Set<number>` parameter. When a word is a paragraph start (its global index is in the set), insert a gap of `LINE_HEIGHT` pixels before that line entry. Add `isParagraphStart: boolean` to `LineEntry`. 838 + 839 + - [ ] **Step 2: Update the rendering to add paragraph gap spacers** 840 + 841 + For each visible line with `isParagraphStart: true`, render a gap spacer `div` above it. 842 + 843 + - [ ] **Step 3: Update `timeToScrollY` and `scrollYToTime`** 844 + 845 + Gap entries have no time range. Scrolling through a gap seeks to the end of the preceding line's time range (treating the gap as an extension of the previous paragraph's final time). 846 + 847 + - [ ] **Step 4: Verify in browser** 848 + 849 + Load the track view (which uses `WindowedTranscriptView`). Verify paragraph gaps appear and scroll behavior is smooth. 850 + 851 + - [ ] **Step 5: Commit** 852 + 853 + ```bash 854 + git add apps/ionosphere/src/app/components/WindowedTranscriptView.tsx 855 + git commit -m "feat: WindowedTranscriptView paragraph gap support" 856 + ``` 857 + 858 + --- 859 + 860 + ## Chunk 5: Document assembly and publish pipeline 861 + 862 + ### Task 10: Update document assembly to include structural facets 863 + 864 + **Files:** 865 + - Modify: `formats/tv.ionosphere/ts/transcript-encoding.ts` 866 + - Modify: `formats/tv.ionosphere/ts/transcript-encoding.test.ts` 867 + 868 + - [ ] **Step 1: Write the failing test** 869 + 870 + Add to `formats/tv.ionosphere/ts/transcript-encoding.test.ts`: 871 + 872 + ```typescript 873 + describe("decodeToDocumentWithStructure", () => { 874 + it("adds sentence and paragraph facets from NLP annotations", () => { 875 + const compact = encode(contiguous); 876 + const annotations = { 877 + sentences: [ 878 + { byteStart: 0, byteEnd: 11 }, // "hello world" 879 + { byteStart: 12, byteEnd: 26 }, // "this is a test" 880 + ], 881 + paragraphs: [ 882 + { byteStart: 0, byteEnd: 26 }, 883 + ], 884 + }; 885 + const doc = decodeToDocumentWithStructure(compact, annotations); 886 + 887 + const sentenceFacets = doc.facets.filter(f => 888 + f.features.some(feat => feat.$type === "tv.ionosphere.facet#sentence") 889 + ); 890 + const paragraphFacets = doc.facets.filter(f => 891 + f.features.some(feat => feat.$type === "tv.ionosphere.facet#paragraph") 892 + ); 893 + expect(sentenceFacets).toHaveLength(2); 894 + expect(paragraphFacets).toHaveLength(1); 895 + }); 896 + 897 + it("produces valid document without annotations (backward compatible)", () => { 898 + const compact = encode(contiguous); 899 + const doc = decodeToDocumentWithStructure(compact, null); 900 + // Same as decodeToDocument 901 + expect(doc.facets.length).toBe(6); // just timestamp facets 902 + }); 903 + }); 904 + ``` 905 + 906 + - [ ] **Step 2: Run test to verify it fails** 907 + 908 + ```bash 909 + cd formats/tv.ionosphere && npx vitest run ts/transcript-encoding.test.ts 910 + ``` 911 + 912 + - [ ] **Step 3: Implement `decodeToDocumentWithStructure`** 913 + 914 + Add to `formats/tv.ionosphere/ts/transcript-encoding.ts`: 915 + 916 + ```typescript 917 + export interface NlpAnnotations { 918 + sentences: Array<{ byteStart: number; byteEnd: number }>; 919 + paragraphs: Array<{ byteStart: number; byteEnd: number }>; 920 + } 921 + 922 + export function decodeToDocumentWithStructure( 923 + compact: CompactTranscript, 924 + annotations: NlpAnnotations | null, 925 + ): Document { 926 + // Start with the base document (timestamp facets) 927 + const doc = decodeToDocument(compact); 928 + 929 + if (!annotations) return doc; 930 + 931 + // Add sentence facets 932 + for (const s of annotations.sentences) { 933 + doc.facets.push({ 934 + index: { byteStart: s.byteStart, byteEnd: s.byteEnd }, 935 + features: [{ $type: "tv.ionosphere.facet#sentence" }], 936 + }); 937 + } 938 + 939 + // Add paragraph facets 940 + for (const p of annotations.paragraphs) { 941 + doc.facets.push({ 942 + index: { byteStart: p.byteStart, byteEnd: p.byteEnd }, 943 + features: [{ $type: "tv.ionosphere.facet#paragraph" }], 944 + }); 945 + } 946 + 947 + return doc; 948 + } 949 + ``` 950 + 951 + - [ ] **Step 4: Run tests to verify they pass** 952 + 953 + ```bash 954 + cd formats/tv.ionosphere && npx vitest run ts/transcript-encoding.test.ts 955 + ``` 956 + 957 + - [ ] **Step 5: Commit** 958 + 959 + ```bash 960 + git add formats/tv.ionosphere/ts/transcript-encoding.ts formats/tv.ionosphere/ts/transcript-encoding.test.ts 961 + git commit -m "feat: decodeToDocumentWithStructure for NLP annotations" 962 + ``` 963 + 964 + ### Task 11: Update publish.ts to include assembled documents on talk records 965 + 966 + **Files:** 967 + - Modify: `apps/ionosphere-appview/src/publish.ts` 968 + 969 + - [ ] **Step 1: Update the talk publishing step** 970 + 971 + After publishing transcripts (step 4 in publish.ts), add a step that: 972 + 1. For each talk, checks if NLP output exists at `pipeline/data/nlp/{rkey}.json` 973 + 2. If it does, reads the NLP annotations 974 + 3. Calls `decodeToDocumentWithStructure` with the compact transcript + annotations 975 + 4. Includes the assembled `document` field on the `tv.ionosphere.talk` record 976 + 977 + This moves document assembly from serve time to publish time, as specified in the design. 978 + 979 + - [ ] **Step 2: Verify by running publish in dry-run or against local PDS** 980 + 981 + Check that talk records now include the `document` field with sentence/paragraph facets. 982 + 983 + - [ ] **Step 3: Commit** 984 + 985 + ```bash 986 + git add apps/ionosphere-appview/src/publish.ts 987 + git commit -m "feat: publish assembled documents with structural facets on talk records" 988 + ``` 989 + 990 + ### Task 12: Update appview routes to serve pre-assembled documents 991 + 992 + **Files:** 993 + - Modify: `apps/ionosphere-appview/src/routes.ts` 994 + 995 + - [ ] **Step 1: Remove `overlayAnnotations` and serve pre-assembled document** 996 + 997 + In the `getTalk` route handler: 998 + 1. Remove the `overlayAnnotations` function entirely (lines 17-59). 999 + 2. Remove the annotation overlay logic in the route (lines 173-185). 1000 + 3. If the talk record has a `document` field in the DB, serve it directly. 1001 + 4. Fall back to `decodeToDocument` from the compact transcript if no pre-assembled document exists (backward compatibility during transition). 1002 + 1003 + - [ ] **Step 2: Update the indexer to store the document field** 1004 + 1005 + In `apps/ionosphere-appview/src/indexer.ts`, update the `indexTalk` function's INSERT statement (line 176-197). The `talks` table already has a `document TEXT` column (line 54 of db.ts), but the INSERT does not include it. Add `document` to the column list and bind `record.document ? JSON.stringify(record.document) : null` as the value. This is a SQL change — the column list and VALUES placeholders must both be updated. 1006 + 1007 + - [ ] **Step 3: Commit** 1008 + 1009 + ```bash 1010 + git add apps/ionosphere-appview/src/routes.ts apps/ionosphere-appview/src/indexer.ts 1011 + git commit -m "feat: serve pre-assembled documents, remove overlayAnnotations" 1012 + ``` 1013 + 1014 + --- 1015 + 1016 + ## Chunk 6: Remove old enrichment system 1017 + 1018 + ### Task 13: Remove old annotation/enrichment code 1019 + 1020 + **Files:** 1021 + - Delete: `apps/ionosphere-appview/src/enrich.ts` 1022 + - Delete: `apps/ionosphere-appview/src/enrich-all.ts` 1023 + - Delete: `apps/ionosphere-appview/src/publish-annotations.ts` 1024 + - Modify: `apps/ionosphere-appview/src/indexer.ts` — remove `tv.ionosphere.annotation` handling 1025 + - Modify: `apps/ionosphere-appview/src/routes.ts` — remove annotation-related queries from `getTalk` 1026 + 1027 + - [ ] **Step 1: Delete enrichment files** 1028 + 1029 + ```bash 1030 + rm apps/ionosphere-appview/src/enrich.ts 1031 + rm apps/ionosphere-appview/src/enrich-all.ts 1032 + rm apps/ionosphere-appview/src/publish-annotations.ts 1033 + ``` 1034 + 1035 + - [ ] **Step 2: Remove annotation indexing from `indexer.ts`** 1036 + 1037 + Remove `"tv.ionosphere.annotation"` from `IONOSPHERE_COLLECTIONS` array (line 28). Remove the annotation delete case (lines 72-75). Remove the annotation create/update case (lines 116-117). Remove the `indexAnnotation` function and `rebuildTalkConcepts` helper. 1038 + 1039 + - [ ] **Step 3: Remove annotation queries from `routes.ts`** 1040 + 1041 + In the `getTalk` route, remove the concepts query (lines 149-157) and the annotation overlay logic. The concepts data will return via layers.pub in Phase 2. 1042 + 1043 + - [ ] **Step 4: Remove annotation publishing from `publish.ts`** 1044 + 1045 + Remove step 6 (lines 158-177) that publishes `tv.ionosphere.annotation` records. 1046 + 1047 + - [ ] **Step 5: Verify the appview still starts and serves talks** 1048 + 1049 + ```bash 1050 + cd apps/ionosphere-appview && npx tsx src/appview.ts 1051 + ``` 1052 + Hit the `/xrpc/tv.ionosphere.getTalk?rkey=<some-rkey>` endpoint and verify it returns a talk with a document. 1053 + 1054 + - [ ] **Step 6: Commit** 1055 + 1056 + ```bash 1057 + git add -A apps/ionosphere-appview/src/ 1058 + git commit -m "chore: remove old enrichment system (enrich.ts, annotations, overlayAnnotations)" 1059 + ``` 1060 + 1061 + --- 1062 + 1063 + ## Chunk 7: End-to-end integration and verification 1064 + 1065 + **IMPORTANT:** Tasks 11-12 create the publish-time document assembly path, but existing talks in the appview DB will have NULL documents until a full re-publish is done. Task 14 performs this re-publish. Do NOT deploy Tasks 11-12 without running Task 14, or existing talks will lose concept overlays with no replacement. 1066 + 1067 + ### Task 14: Run the full pipeline end-to-end 1068 + 1069 + - [ ] **Step 1: Run the Python NLP pipeline on all transcripts** 1070 + 1071 + ```bash 1072 + cd pipeline && source .venv/bin/activate && python -m nlp.run 1073 + ``` 1074 + 1075 + Verify output files appear in `pipeline/data/nlp/` with sentence and paragraph data. 1076 + 1077 + - [ ] **Step 2: Spot-check 3-5 NLP output files** 1078 + 1079 + Open output JSON files for talks of different types (presentation, panel, lightning talk). Verify: 1080 + - Sentence count is reasonable (expect 50-300 for a 20-min talk) 1081 + - Paragraph count is reasonable (expect 5-30) 1082 + - Byte ranges are valid (byteStart < byteEnd, monotonically increasing) 1083 + - Paragraph boundaries fall at sentence boundaries 1084 + 1085 + - [ ] **Step 3: Run the TypeScript publish pipeline** 1086 + 1087 + ```bash 1088 + cd apps/ionosphere-appview && npx tsx src/publish.ts 1089 + ``` 1090 + 1091 + Verify talk records now include the `document` field with structural facets. 1092 + 1093 + - [ ] **Step 4: Start the appview and frontend, verify in browser** 1094 + 1095 + Start the dev environment and load several talk pages. Verify: 1096 + - Paragraphs have visible vertical spacing 1097 + - Sentences are grouped as inline spans 1098 + - Scroll-to-time and click-to-seek work correctly 1099 + - The playhead brightness gradient is smooth across paragraph breaks 1100 + - Comments and reactions still work 1101 + - Talks without NLP data still render correctly (graceful degradation) 1102 + 1103 + - [ ] **Step 5: Commit any fixes found during verification** 1104 + 1105 + ```bash 1106 + git add -A && git commit -m "fix: integration fixes from end-to-end verification" 1107 + ``` 1108 + 1109 + ### Task 15: Final cleanup 1110 + 1111 + - [ ] **Step 1: Run all tests** 1112 + 1113 + ```bash 1114 + # Python 1115 + cd pipeline && source .venv/bin/activate && pytest -v 1116 + 1117 + # TypeScript 1118 + cd ../.. && npx vitest run 1119 + ``` 1120 + 1121 + All tests should pass. 1122 + 1123 + - [ ] **Step 2: Update `.gitignore` for Python artifacts** 1124 + 1125 + Add to `.gitignore`: 1126 + ``` 1127 + pipeline/.venv/ 1128 + pipeline/data/ 1129 + __pycache__/ 1130 + *.pyc 1131 + ``` 1132 + 1133 + - [ ] **Step 3: Final commit** 1134 + 1135 + ```bash 1136 + git add -A 1137 + git commit -m "chore: final cleanup — tests passing, gitignore updated" 1138 + ```