Ionosphere.tv
3
fork

Configure Feed

Select the types of activity you want to include in your feed.

chore: remove superpowers docs and panproto wishlist

-3905
-123
docs/panproto-wishlist.md
··· 1 - # Panproto Wishlist for Ionosphere 2 - 3 - What we need from panproto to make the ionosphere lens layer fully declarative — no mechanical shims, all transforms expressed as protolens chains stored as AT Protocol records. 4 - 5 - Filed upstream: panproto/panproto#15 6 - 7 - --- 8 - 9 - ## 1. Pipeline combinator API in the TypeScript SDK 10 - 11 - **Priority: critical** 12 - 13 - The tutorial (Chapter 6) shows: 14 - ```typescript 15 - const lens = pipeline([ 16 - renameField('displayName', 'name'), 17 - addField('bio', 'string', ''), 18 - removeField('legacyField'), 19 - hoistField('additionalData.room'), 20 - ]); 21 - ``` 22 - 23 - This is exactly what we need for cross-schema transforms (calendar event → talk, VOD → talk). The Rust engine has these combinators, but they're not exported from `@panproto/core`. The WASM exposes `apply_protolens_step` with elementary steps (`rename_sort`, `drop_sort`, `add_sort`, `rename_op`, `drop_op`, `add_op`), but: 24 - 25 - - `rename_sort` renames the vertex, not the prop edge `name` (which controls the JSON key) 26 - - `rename_op` renames the edge kind, not the edge name 27 - - There's no hoist step (move nested field to top level) 28 - 29 - **What we need:** `renameField(oldPropName, newPropName)` that renames the JSON property key — which means renaming the prop edge's `name` attribute, not just the vertex. 30 - 31 - **Our workaround:** Existing `applyLens` field mapper with JSON lens specs. 32 - 33 - ## 2. Prop edge name rename step 34 - 35 - **Priority: critical** 36 - 37 - The elementary step vocabulary has 6 types: `{add,drop,rename}_{sort,op}`. None of these rename a prop edge's `name` attribute. In ATProto lexicons, the JSON property name comes from the prop edge name (e.g., `edge('body', 'body.name', 'prop', { name: 'name' })`). To rename a JSON key from `name` to `title`, we need to rename this edge attribute. 38 - 39 - Either: 40 - - A new step type: `rename_prop_name` (or `rename_edge_name`) 41 - - Or make `rename_op` also handle the edge `name` attribute when the edge kind is `prop` 42 - 43 - ## 3. Hoist / restructure steps 44 - 45 - **Priority: high** 46 - 47 - ATmosphereConf calendar events store metadata in `additionalData.room`, `additionalData.category`, etc. Ionosphere talks have these as top-level fields: `room`, `category`, `talkType`. This is a hoist: move a property from a nested object to the parent. 48 - 49 - The auto-lens generation (Chapter 17) has a "Restructuring Pass" that handles this, but it requires finding a morphism first — which fails when the schemas are too different. 50 - 51 - We need either: 52 - - A `hoist_field` elementary step 53 - - Or the ability to express this as a composed rename_sort + edge rearrangement 54 - 55 - ## 4. Morphism hints for cross-namespace schemas 56 - 57 - **Priority: high** 58 - 59 - `pp.lens(calendarSchema, talkSchema)` fails with "no morphism found" because the schemas have completely different NSID namespaces. The morphism search uses name similarity (edit distance) for scoring, but `community.lexicon.calendar.event:body.name` has no name similarity to `tv.ionosphere.talk:body.title`. 60 - 61 - The overlap discovery (`discover_overlap`) also fails because `find_best_morphism` returns empty. 62 - 63 - We need a way to provide explicit vertex correspondence hints: 64 - ```typescript 65 - const lens = pp.lens(calendarSchema, talkSchema, { 66 - hints: { 67 - 'community.lexicon.calendar.event:body.name': 'tv.ionosphere.talk:body.title', 68 - 'community.lexicon.calendar.event:body.description': 'tv.ionosphere.talk:body.description', 69 - } 70 - }); 71 - ``` 72 - 73 - Or equivalently, the ability to seed the morphism search with known correspondences. 74 - 75 - ## 5. `lift_json` exposed in the TypeScript SDK 76 - 77 - **Priority: medium** 78 - 79 - The WASM has `lift_json(migration, json_bytes, root_vertex)` which takes JSON in and returns JSON out. The SDK's `CompiledMigration.lift()` requires an `Instance` object (from `pp.parseJson()`), which is less ergonomic for the common case of transforming plain JSON records. 80 - 81 - Exposing `lift_json` in the SDK would eliminate the `parseJson` → `lift` → `toJson` round-trip. 82 - 83 - ## 6. `parseLexicon` metadata format fix 84 - 85 - **Priority: medium** 86 - 87 - `schema_metadata` WASM function returns positional arrays `[protocol, vertices[], edges[]]` but the SDK's `parseLexicon` method expects named object keys (`meta.vertices.map(...)`). We work around this with a direct WASM call that handles both formats. 88 - 89 - ## 7. WASM binary in npm package 90 - 91 - **Priority: medium** 92 - 93 - `@panproto/core@0.22.0` ships the TypeScript SDK shell but not the WASM binary (`panproto_wasm.js` + `panproto_wasm_bg.wasm`). We build from source with `wasm-pack build crates/panproto-wasm --target web --release` and copy the output into `node_modules`. 94 - 95 - Additionally, the web-target glue module uses `fetch()` to load the `.wasm` file, which doesn't work with `file://` URLs in Node.js. We work around this by pre-loading the binary with `readFileSync` and using `initSync({ module: wasmBytes })` via a wrapped glue module. 96 - 97 - Options: 98 - - Ship the WASM in the npm tarball (adds ~6MB) 99 - - Publish a separate `@panproto/wasm` package 100 - - Ship a Node.js-target build alongside the web-target build 101 - 102 - ## 8. Migration builder for partial cross-schema transforms 103 - 104 - **Priority: low (covered by items 1-4)** 105 - 106 - `pp.migration(src, tgt).map(...)` works for structurally similar schemas but fails with "root node was pruned during restriction" when the source schema has many unmapped vertices. This is the expected behavior for a morphism-based migration, but it means the migration builder can't be used for cross-schema transforms where most source fields are dropped. 107 - 108 - If items 1-4 land, this becomes unnecessary — the pipeline combinator API is the right tool for cross-schema transforms. 109 - 110 - --- 111 - 112 - ## What works great today 113 - 114 - - `parseLexicon()` — parses any ATProto lexicon into a schema graph 115 - - `diff()` / `diffFull()` — accurate structural diffs between schema versions 116 - - `validateSchema()` — validates schemas against protocol rules 117 - - `pp.lens(v1, v2)` — auto-generates lenses between structurally similar schemas 118 - - `migration().map().compile()` — explicit vertex mapping for version migration 119 - - `protolensChain().toJson()` / `fromJson()` — serialization for storage as AT Protocol records 120 - - `protocol('atproto')` + `SchemaBuilder` — manual schema construction 121 - - Schema diffing with 20+ change categories 122 - 123 - The algebraic foundations are excellent. These SDK gaps are the last mile to making cross-schema transforms fully declarative.
-1097
docs/superpowers/plans/2026-04-12-enrichment-phases-2-3.md
··· 1 - # Enrichment Phases 2-3 Implementation Plan 2 - 3 - > **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking. 4 - 5 - **Goal:** Add named entity recognition with AT Protocol record linking and topic segmentation to the transcript enrichment pipeline, achieving feature parity before deployment. 6 - 7 - **Architecture:** Two new Python NLP passes (entity detection + topic segmentation) extend the existing pipeline. Entity resolution matches against speaker/concept records from the SQLite database, with diarization context for disambiguation. Topic boundaries use sentence embeddings with cosine similarity. The TypeScript document assembly and React renderer extend to handle entity links and topic dividers. 8 - 9 - **Tech Stack:** Python 3.12+, spaCy NER (`en_core_web_sm`), sentence-transformers (`all-MiniLM-L6-v2`), sqlite3 (stdlib), vitest, React/Next.js 10 - 11 - **Spec:** `docs/superpowers/specs/2026-04-12-enrichment-phases-2-3-design.md` 12 - 13 - --- 14 - 15 - ## Chunk 1: New facet types + NlpAnnotations extension 16 - 17 - ### Task 1: Add topic-break and entity facet types to format lexicon 18 - 19 - **Files:** 20 - - Modify: `formats/tv.ionosphere/ionosphere.lexicon.json` 21 - 22 - - [ ] **Step 1: Add `#topic-break` (block) and `#entity` (inline) to the features array** 23 - 24 - ```json 25 - { 26 - "typeId": "tv.ionosphere.facet#topic-break", 27 - "featureClass": "block", 28 - "expandStart": false, 29 - "expandEnd": false 30 - }, 31 - { 32 - "typeId": "tv.ionosphere.facet#entity", 33 - "featureClass": "inline", 34 - "expandStart": false, 35 - "expandEnd": false 36 - } 37 - ``` 38 - 39 - - [ ] **Step 2: Commit** 40 - 41 - ```bash 42 - git add formats/tv.ionosphere/ionosphere.lexicon.json 43 - git commit -m "feat: add topic-break (block) and entity (inline) facet types" 44 - ``` 45 - 46 - ### Task 2: Extend NlpAnnotations and decodeToDocumentWithStructure 47 - 48 - **Files:** 49 - - Modify: `formats/tv.ionosphere/ts/transcript-encoding.ts` 50 - - Modify: `formats/tv.ionosphere/ts/transcript-encoding.test.ts` 51 - 52 - - [ ] **Step 1: Write the failing test** 53 - 54 - Add to `formats/tv.ionosphere/ts/transcript-encoding.test.ts`, in the `decodeToDocumentWithStructure` describe block: 55 - 56 - ```typescript 57 - it("adds entity, speaker-segment, and topic-break facets", () => { 58 - const compact = encode(contiguous); 59 - const annotations = { 60 - sentences: [{ byteStart: 0, byteEnd: 26 }], 61 - paragraphs: [{ byteStart: 0, byteEnd: 26 }], 62 - entities: [ 63 - { 64 - byteStart: 0, byteEnd: 5, label: "hello", nerType: "PERSON", 65 - speakerDid: "did:plc:abc123", 66 - }, 67 - { 68 - byteStart: 6, byteEnd: 11, label: "world", nerType: "ORG", 69 - conceptUri: "at://did:plc:xyz/tv.ionosphere.concept/test", 70 - }, 71 - { 72 - byteStart: 12, byteEnd: 16, label: "this", nerType: "PRODUCT", 73 - }, 74 - ], 75 - speakerSegments: [ 76 - { 77 - byteStart: 0, byteEnd: 26, speakerDid: "did:plc:abc123", 78 - speakerName: "Test Speaker", 79 - }, 80 - ], 81 - topicBreaks: [{ byteStart: 12 }], 82 - }; 83 - const doc = decodeToDocumentWithStructure(compact, annotations); 84 - 85 - const speakerRefs = doc.facets.filter(f => 86 - f.features.some(feat => feat.$type === "tv.ionosphere.facet#speaker-ref") 87 - ); 88 - const conceptRefs = doc.facets.filter(f => 89 - f.features.some(feat => feat.$type === "tv.ionosphere.facet#concept-ref") 90 - ); 91 - const entities = doc.facets.filter(f => 92 - f.features.some(feat => feat.$type === "tv.ionosphere.facet#entity") 93 - ); 94 - const speakerSegs = doc.facets.filter(f => 95 - f.features.some(feat => feat.$type === "tv.ionosphere.facet#speaker-segment") 96 - ); 97 - const topicBreaks = doc.facets.filter(f => 98 - f.features.some(feat => feat.$type === "tv.ionosphere.facet#topic-break") 99 - ); 100 - 101 - expect(speakerRefs).toHaveLength(1); 102 - expect(speakerRefs[0].features[0].speakerDid).toBe("did:plc:abc123"); 103 - expect(conceptRefs).toHaveLength(1); 104 - expect(conceptRefs[0].features[0].conceptUri).toBe("at://did:plc:xyz/tv.ionosphere.concept/test"); 105 - expect(entities).toHaveLength(1); // unresolved entity 106 - expect(entities[0].features[0].label).toBe("this"); 107 - expect(speakerSegs).toHaveLength(1); 108 - expect(topicBreaks).toHaveLength(1); 109 - }); 110 - 111 - it("handles missing optional annotation fields gracefully", () => { 112 - const compact = encode(contiguous); 113 - const annotations = { 114 - sentences: [{ byteStart: 0, byteEnd: 26 }], 115 - paragraphs: [{ byteStart: 0, byteEnd: 26 }], 116 - // No entities, speakerSegments, or topicBreaks 117 - }; 118 - const doc = decodeToDocumentWithStructure(compact, annotations); 119 - // Should have sentences + paragraphs + timestamps, nothing else 120 - const tsFacets = doc.facets.filter(f => 121 - f.features.some(feat => feat.$type === "tv.ionosphere.facet#timestamp") 122 - ); 123 - expect(tsFacets).toHaveLength(6); 124 - }); 125 - ``` 126 - 127 - - [ ] **Step 2: Run test to verify it fails** 128 - 129 - ```bash 130 - cd formats/tv.ionosphere && npx vitest run ts/transcript-encoding.test.ts 131 - ``` 132 - 133 - - [ ] **Step 3: Extend NlpAnnotations interface and decodeToDocumentWithStructure** 134 - 135 - Update `formats/tv.ionosphere/ts/transcript-encoding.ts`: 136 - 137 - ```typescript 138 - export interface NlpAnnotations { 139 - sentences: Array<{ byteStart: number; byteEnd: number }>; 140 - paragraphs: Array<{ byteStart: number; byteEnd: number }>; 141 - entities?: Array<{ 142 - byteStart: number; byteEnd: number; 143 - label: string; nerType: string; 144 - speakerDid?: string; conceptUri?: string; 145 - }>; 146 - speakerSegments?: Array<{ 147 - byteStart: number; byteEnd: number; 148 - speakerDid: string; speakerName: string; 149 - }>; 150 - topicBreaks?: Array<{ byteStart: number }>; 151 - } 152 - ``` 153 - 154 - Add to `decodeToDocumentWithStructure`, after the paragraph facet loop: 155 - 156 - ```typescript 157 - // Entity facets — route to speaker-ref, concept-ref, or generic entity 158 - for (const e of annotations.entities ?? []) { 159 - if (e.speakerDid) { 160 - doc.facets.push({ 161 - index: { byteStart: e.byteStart, byteEnd: e.byteEnd }, 162 - features: [{ 163 - $type: "tv.ionosphere.facet#speaker-ref", 164 - speakerDid: e.speakerDid, 165 - label: e.label, 166 - }], 167 - }); 168 - } else if (e.conceptUri) { 169 - doc.facets.push({ 170 - index: { byteStart: e.byteStart, byteEnd: e.byteEnd }, 171 - features: [{ 172 - $type: "tv.ionosphere.facet#concept-ref", 173 - conceptUri: e.conceptUri, 174 - conceptName: e.label, 175 - }], 176 - }); 177 - } else { 178 - doc.facets.push({ 179 - index: { byteStart: e.byteStart, byteEnd: e.byteEnd }, 180 - features: [{ 181 - $type: "tv.ionosphere.facet#entity", 182 - label: e.label, 183 - nerType: e.nerType, 184 - }], 185 - }); 186 - } 187 - } 188 - 189 - // Speaker segment facets 190 - for (const seg of annotations.speakerSegments ?? []) { 191 - doc.facets.push({ 192 - index: { byteStart: seg.byteStart, byteEnd: seg.byteEnd }, 193 - features: [{ 194 - $type: "tv.ionosphere.facet#speaker-segment", 195 - speakerDid: seg.speakerDid, 196 - speakerName: seg.speakerName, 197 - }], 198 - }); 199 - } 200 - 201 - // Topic break facets 202 - for (const tb of annotations.topicBreaks ?? []) { 203 - doc.facets.push({ 204 - index: { byteStart: tb.byteStart, byteEnd: tb.byteStart }, 205 - features: [{ $type: "tv.ionosphere.facet#topic-break" }], 206 - }); 207 - } 208 - ``` 209 - 210 - - [ ] **Step 4: Run tests to verify all pass** 211 - 212 - ```bash 213 - cd formats/tv.ionosphere && npx vitest run ts/transcript-encoding.test.ts 214 - ``` 215 - 216 - - [ ] **Step 5: Commit** 217 - 218 - ```bash 219 - git add formats/tv.ionosphere/ts/transcript-encoding.ts formats/tv.ionosphere/ts/transcript-encoding.test.ts 220 - git commit -m "feat: extend NlpAnnotations with entities, speakers, topic breaks" 221 - ``` 222 - 223 - --- 224 - 225 - ## Chunk 2: Python NER + entity linking 226 - 227 - ### Task 3: Install sentence-transformers dependency 228 - 229 - **Files:** 230 - - Modify: `pipeline/pyproject.toml` 231 - 232 - - [ ] **Step 1: Add sentence-transformers to dependencies** 233 - 234 - ```toml 235 - dependencies = [ 236 - "spacy>=3.7", 237 - "sentence-transformers>=2.0", 238 - ] 239 - ``` 240 - 241 - - [ ] **Step 2: Install** 242 - 243 - ```bash 244 - cd pipeline && source .venv/bin/activate && pip install -e ".[dev]" 245 - ``` 246 - 247 - - [ ] **Step 3: Verify sentence-transformers loads** 248 - 249 - ```bash 250 - python -c "from sentence_transformers import SentenceTransformer; print('OK')" 251 - ``` 252 - 253 - - [ ] **Step 4: Commit** 254 - 255 - ```bash 256 - git add pipeline/pyproject.toml 257 - git commit -m "feat: add sentence-transformers dependency for topic segmentation" 258 - ``` 259 - 260 - ### Task 4: Implement speaker lookup builder 261 - 262 - **Files:** 263 - - Create: `pipeline/nlp/speaker_lookup.py` 264 - - Create: `pipeline/tests/test_speaker_lookup.py` 265 - 266 - - [ ] **Step 1: Write the failing test** 267 - 268 - Create `pipeline/tests/test_speaker_lookup.py`: 269 - 270 - ```python 271 - from nlp.speaker_lookup import build_speaker_lookup 272 - 273 - 274 - def test_build_lookup_from_rows(): 275 - """Build lookup from speaker database rows.""" 276 - rows = [ 277 - ("Matt Akamatsu", "matsulab.com", "did:plc:matt123"), 278 - ("Rowan Cockett", "row1.ca", "did:plc:rowan456"), 279 - ("Jay Graber", "jay.bsky.team", "did:plc:jay789"), 280 - ] 281 - lookup = build_speaker_lookup(rows) 282 - 283 - # Full name match (case-insensitive) 284 - assert lookup.resolve("Matt Akamatsu") is not None 285 - assert lookup.resolve("Matt Akamatsu")["did"] == "did:plc:matt123" 286 - assert lookup.resolve("matt akamatsu")["did"] == "did:plc:matt123" 287 - 288 - # First name match 289 - assert lookup.resolve("Matt") is not None 290 - assert lookup.resolve("Matt")["did"] == "did:plc:matt123" 291 - 292 - # Handle match 293 - assert lookup.resolve("row1.ca") is not None 294 - assert lookup.resolve("row1.ca")["did"] == "did:plc:rowan456" 295 - 296 - # No match 297 - assert lookup.resolve("Unknown Person") is None 298 - 299 - 300 - def test_first_name_collision_returns_none(): 301 - """When multiple speakers share a first name, first-name lookup returns None (ambiguous).""" 302 - rows = [ 303 - ("Matt Akamatsu", "matsulab.com", "did:plc:matt1"), 304 - ("Matt Jones", "mattj.com", "did:plc:matt2"), 305 - ] 306 - lookup = build_speaker_lookup(rows) 307 - 308 - # Full name still works 309 - assert lookup.resolve("Matt Akamatsu")["did"] == "did:plc:matt1" 310 - # First name alone is ambiguous 311 - assert lookup.resolve("Matt") is None 312 - 313 - 314 - def test_empty_speakers(): 315 - lookup = build_speaker_lookup([]) 316 - assert lookup.resolve("Anyone") is None 317 - ``` 318 - 319 - - [ ] **Step 2: Run test to verify it fails** 320 - 321 - ```bash 322 - cd pipeline && source .venv/bin/activate && pytest tests/test_speaker_lookup.py -v 323 - ``` 324 - 325 - - [ ] **Step 3: Implement speaker_lookup.py** 326 - 327 - Create `pipeline/nlp/speaker_lookup.py`: 328 - 329 - ```python 330 - """Build a speaker lookup table from database records for entity resolution.""" 331 - 332 - 333 - class SpeakerLookup: 334 - def __init__(self): 335 - self._by_full_name: dict[str, dict] = {} 336 - self._by_first_name: dict[str, dict | None] = {} 337 - self._by_handle: dict[str, dict] = {} 338 - 339 - def add(self, name: str, handle: str | None, did: str | None): 340 - entry = {"name": name, "handle": handle, "did": did} 341 - self._by_full_name[name.lower()] = entry 342 - 343 - if handle: 344 - self._by_handle[handle.lower()] = entry 345 - 346 - first = name.split()[0].lower() 347 - if first in self._by_first_name: 348 - # Collision — mark as ambiguous 349 - self._by_first_name[first] = None 350 - else: 351 - self._by_first_name[first] = entry 352 - 353 - def resolve(self, name: str) -> dict | None: 354 - key = name.lower().strip() 355 - # Try full name first 356 - if key in self._by_full_name: 357 - return self._by_full_name[key] 358 - # Try handle 359 - if key in self._by_handle: 360 - return self._by_handle[key] 361 - # Try first name (returns None if ambiguous) 362 - if key in self._by_first_name: 363 - return self._by_first_name[key] 364 - return None 365 - 366 - 367 - def build_speaker_lookup(rows: list[tuple]) -> SpeakerLookup: 368 - """Build lookup from database rows of (name, handle, speaker_did).""" 369 - lookup = SpeakerLookup() 370 - for name, handle, did in rows: 371 - lookup.add(name, handle, did) 372 - return lookup 373 - ``` 374 - 375 - - [ ] **Step 4: Run tests to verify they pass** 376 - 377 - ```bash 378 - cd pipeline && source .venv/bin/activate && pytest tests/test_speaker_lookup.py -v 379 - ``` 380 - 381 - - [ ] **Step 5: Commit** 382 - 383 - ```bash 384 - git add pipeline/nlp/speaker_lookup.py pipeline/tests/test_speaker_lookup.py 385 - git commit -m "feat: speaker lookup builder for entity resolution" 386 - ``` 387 - 388 - ### Task 5: Implement NER + entity linking pass 389 - 390 - **Files:** 391 - - Create: `pipeline/nlp/entities.py` 392 - - Create: `pipeline/tests/test_entities.py` 393 - 394 - - [ ] **Step 1: Write the failing test** 395 - 396 - Create `pipeline/tests/test_entities.py`: 397 - 398 - ```python 399 - from nlp.entities import detect_entities 400 - from nlp.speaker_lookup import build_speaker_lookup 401 - 402 - 403 - def test_detects_person_entities(): 404 - text = "Matt Akamatsu is presenting today." 405 - rows = [("Matt Akamatsu", "matsulab.com", "did:plc:matt123")] 406 - lookup = build_speaker_lookup(rows) 407 - 408 - entities = detect_entities(text, speaker_lookup=lookup) 409 - 410 - persons = [e for e in entities if e["nerType"] == "PERSON"] 411 - assert len(persons) >= 1 412 - # Should resolve to the speaker 413 - resolved = [e for e in persons if e.get("speakerDid")] 414 - assert len(resolved) >= 1 415 - assert resolved[0]["speakerDid"] == "did:plc:matt123" 416 - 417 - 418 - def test_detects_org_entities(): 419 - text = "The work at Bluesky is impressive." 420 - concepts = [{"name": "Bluesky", "uri": "at://did/concept/bluesky", "aliases": "[]"}] 421 - 422 - entities = detect_entities(text, concept_rows=concepts) 423 - 424 - orgs = [e for e in entities if e.get("conceptUri")] 425 - assert len(orgs) >= 1 426 - 427 - 428 - def test_unresolved_entities_have_label(): 429 - text = "Barack Obama spoke at the conference." 430 - entities = detect_entities(text) 431 - 432 - persons = [e for e in entities if e["nerType"] == "PERSON"] 433 - assert len(persons) >= 1 434 - assert persons[0].get("speakerDid") is None 435 - assert persons[0]["label"] == "Barack Obama" 436 - 437 - 438 - def test_byte_offsets_correct(): 439 - text = "Matt Akamatsu presented." 440 - entities = detect_entities(text) 441 - if entities: 442 - e = entities[0] 443 - # Verify byte range matches the label 444 - text_at_range = text.encode("utf-8")[e["byteStart"]:e["byteEnd"]].decode("utf-8") 445 - assert e["label"] in text_at_range or text_at_range in e["label"] 446 - 447 - 448 - def test_empty_text(): 449 - entities = detect_entities("") 450 - assert entities == [] 451 - ``` 452 - 453 - - [ ] **Step 2: Run test to verify it fails** 454 - 455 - ```bash 456 - cd pipeline && source .venv/bin/activate && pytest tests/test_entities.py -v 457 - ``` 458 - 459 - - [ ] **Step 3: Implement entities.py** 460 - 461 - Create `pipeline/nlp/entities.py`: 462 - 463 - ```python 464 - """Pass 3: Named entity recognition + entity linking.""" 465 - 466 - import json 467 - import spacy 468 - 469 - _nlp = None 470 - 471 - 472 - def _get_nlp(): 473 - global _nlp 474 - if _nlp is None: 475 - _nlp = spacy.load("en_core_web_sm") 476 - return _nlp 477 - 478 - 479 - def detect_entities( 480 - text: str, 481 - speaker_lookup=None, 482 - concept_rows: list[dict] | None = None, 483 - ) -> list[dict]: 484 - """Detect named entities and resolve against speaker/concept records. 485 - 486 - Returns a list of entity dicts with: 487 - byteStart, byteEnd: UTF-8 byte offsets 488 - label: entity text 489 - nerType: spaCy entity type (PERSON, ORG, PRODUCT, etc.) 490 - speakerDid: (optional) resolved speaker DID 491 - conceptUri: (optional) resolved concept URI 492 - """ 493 - if not text.strip(): 494 - return [] 495 - 496 - nlp = _get_nlp() 497 - doc = nlp(text) 498 - 499 - # Build concept lookup 500 - concept_lookup: dict[str, str] = {} 501 - if concept_rows: 502 - for c in concept_rows: 503 - concept_lookup[c["name"].lower()] = c["uri"] 504 - aliases = c.get("aliases", "[]") 505 - if isinstance(aliases, str): 506 - try: 507 - aliases = json.loads(aliases) 508 - except (json.JSONDecodeError, TypeError): 509 - aliases = [] 510 - for alias in aliases: 511 - concept_lookup[alias.lower()] = c["uri"] 512 - 513 - entities = [] 514 - for ent in doc.ents: 515 - if ent.label_ not in ("PERSON", "ORG", "PRODUCT", "WORK_OF_ART", "GPE", "EVENT"): 516 - continue 517 - 518 - byte_start = len(text[:ent.start_char].encode("utf-8")) 519 - byte_end = len(text[:ent.end_char].encode("utf-8")) 520 - 521 - entity = { 522 - "byteStart": byte_start, 523 - "byteEnd": byte_end, 524 - "label": ent.text, 525 - "nerType": ent.label_, 526 - } 527 - 528 - # Try to resolve 529 - if ent.label_ == "PERSON" and speaker_lookup: 530 - match = speaker_lookup.resolve(ent.text) 531 - if match and match.get("did"): 532 - entity["speakerDid"] = match["did"] 533 - elif ent.label_ in ("ORG", "PRODUCT", "WORK_OF_ART"): 534 - uri = concept_lookup.get(ent.text.lower()) 535 - if uri: 536 - entity["conceptUri"] = uri 537 - 538 - entities.append(entity) 539 - 540 - return entities 541 - ``` 542 - 543 - - [ ] **Step 4: Run tests to verify they pass** 544 - 545 - ```bash 546 - cd pipeline && source .venv/bin/activate && pytest tests/test_entities.py -v 547 - ``` 548 - 549 - - [ ] **Step 5: Commit** 550 - 551 - ```bash 552 - git add pipeline/nlp/entities.py pipeline/tests/test_entities.py 553 - git commit -m "feat: NER + entity linking against speakers and concepts" 554 - ``` 555 - 556 - --- 557 - 558 - ## Chunk 3: Topic segmentation 559 - 560 - ### Task 6: Implement topic segmentation pass 561 - 562 - **Files:** 563 - - Create: `pipeline/nlp/topics.py` 564 - - Create: `pipeline/tests/test_topics.py` 565 - 566 - - [ ] **Step 1: Write the failing test** 567 - 568 - Create `pipeline/tests/test_topics.py`: 569 - 570 - ```python 571 - from nlp.topics import detect_topic_breaks 572 - 573 - 574 - def test_detects_topic_change(): 575 - """Distinct topics should produce at least one break.""" 576 - sentences = [ 577 - # Topic 1: cooking 578 - {"byteStart": 0, "byteEnd": 30, "text": "Today we will make a cake."}, 579 - {"byteStart": 31, "byteEnd": 65, "text": "First mix the flour and sugar."}, 580 - {"byteStart": 66, "byteEnd": 100, "text": "Then add the eggs and butter."}, 581 - {"byteStart": 101, "byteEnd": 135, "text": "Bake at 350 degrees for 30 minutes."}, 582 - {"byteStart": 136, "byteEnd": 170, "text": "Let it cool before adding frosting."}, 583 - # Topic 2: space exploration (very different) 584 - {"byteStart": 171, "byteEnd": 210, "text": "NASA launched a new rocket to Mars."}, 585 - {"byteStart": 211, "byteEnd": 250, "text": "The spacecraft will orbit the red planet."}, 586 - {"byteStart": 251, "byteEnd": 295, "text": "Astronauts may visit Mars within ten years."}, 587 - {"byteStart": 296, "byteEnd": 340, "text": "The mission costs billions of dollars."}, 588 - {"byteStart": 341, "byteEnd": 380, "text": "Space exploration advances human knowledge."}, 589 - ] 590 - breaks = detect_topic_breaks(sentences) 591 - assert len(breaks) >= 1 592 - # Break should be near the topic transition (around sentence 5) 593 - assert any(4 <= b["sentenceIndex"] <= 6 for b in breaks) 594 - 595 - 596 - def test_no_breaks_for_single_topic(): 597 - sentences = [ 598 - {"byteStart": 0, "byteEnd": 30, "text": "The cat sat on the mat."}, 599 - {"byteStart": 31, "byteEnd": 60, "text": "The cat was very happy."}, 600 - {"byteStart": 61, "byteEnd": 90, "text": "It purred loudly all day."}, 601 - ] 602 - breaks = detect_topic_breaks(sentences, min_segment_sentences=2) 603 - # Might have 0 breaks or very few — should not over-segment 604 - assert len(breaks) <= 1 605 - 606 - 607 - def test_empty_input(): 608 - assert detect_topic_breaks([]) == [] 609 - 610 - 611 - def test_too_few_sentences(): 612 - sentences = [{"byteStart": 0, "byteEnd": 10, "text": "Hello."}] 613 - assert detect_topic_breaks(sentences) == [] 614 - ``` 615 - 616 - - [ ] **Step 2: Run test to verify it fails** 617 - 618 - ```bash 619 - cd pipeline && source .venv/bin/activate && pytest tests/test_topics.py -v 620 - ``` 621 - 622 - - [ ] **Step 3: Implement topics.py** 623 - 624 - Create `pipeline/nlp/topics.py`: 625 - 626 - ```python 627 - """Pass 4: Topic segmentation using sentence embeddings.""" 628 - 629 - import numpy as np 630 - 631 - _model = None 632 - 633 - 634 - def _get_model(): 635 - global _model 636 - if _model is None: 637 - from sentence_transformers import SentenceTransformer 638 - _model = SentenceTransformer("all-MiniLM-L6-v2") 639 - return _model 640 - 641 - 642 - def detect_topic_breaks( 643 - sentences: list[dict], 644 - window_size: int = 3, 645 - similarity_threshold: float = 0.3, 646 - min_segment_sentences: int = 5, 647 - ) -> list[dict]: 648 - """Detect topic boundaries using sentence embedding similarity. 649 - 650 - Args: 651 - sentences: list of dicts with byteStart, byteEnd, text 652 - window_size: number of sentences per comparison window 653 - similarity_threshold: cosine similarity below this = topic break 654 - min_segment_sentences: minimum sentences between breaks 655 - 656 - Returns: 657 - list of dicts with byteStart and sentenceIndex 658 - """ 659 - if len(sentences) < window_size * 2: 660 - return [] 661 - 662 - model = _get_model() 663 - texts = [s["text"] for s in sentences] 664 - embeddings = model.encode(texts, show_progress_bar=False) 665 - 666 - # Compute cosine similarity between adjacent windows 667 - similarities = [] 668 - for i in range(window_size, len(sentences) - window_size + 1): 669 - left = np.mean(embeddings[i - window_size:i], axis=0) 670 - right = np.mean(embeddings[i:i + window_size], axis=0) 671 - cos_sim = np.dot(left, right) / (np.linalg.norm(left) * np.linalg.norm(right)) 672 - similarities.append((i, float(cos_sim))) 673 - 674 - # Find drops below threshold, respecting minimum segment length 675 - breaks = [] 676 - last_break = 0 677 - for sent_idx, sim in similarities: 678 - if sim < similarity_threshold and (sent_idx - last_break) >= min_segment_sentences: 679 - breaks.append({ 680 - "byteStart": sentences[sent_idx]["byteStart"], 681 - "sentenceIndex": sent_idx, 682 - }) 683 - last_break = sent_idx 684 - 685 - return breaks 686 - ``` 687 - 688 - - [ ] **Step 4: Run tests to verify they pass** 689 - 690 - ```bash 691 - cd pipeline && source .venv/bin/activate && pytest tests/test_topics.py -v 692 - ``` 693 - 694 - Note: the first run will download the `all-MiniLM-L6-v2` model (~80MB). 695 - 696 - - [ ] **Step 5: Commit** 697 - 698 - ```bash 699 - git add pipeline/nlp/topics.py pipeline/tests/test_topics.py 700 - git commit -m "feat: topic segmentation via sentence-transformer embeddings" 701 - ``` 702 - 703 - --- 704 - 705 - ## Chunk 4: Update pipeline orchestrator 706 - 707 - ### Task 7: Extend run.py with NER and topic passes 708 - 709 - **Files:** 710 - - Modify: `pipeline/nlp/run.py` 711 - - Modify: `pipeline/tests/test_run.py` 712 - 713 - - [ ] **Step 1: Update the integration test** 714 - 715 - Add to `pipeline/tests/test_run.py`: 716 - 717 - ```python 718 - def test_process_transcript_with_enrichment(): 719 - """Full pipeline with entity and topic passes.""" 720 - transcript = { 721 - "text": "Matt Akamatsu is presenting. The AT Protocol is decentralized. Now we discuss a completely different topic about cooking recipes and baking.", 722 - "startMs": 0, 723 - "timings": [100, 100, 100, 100, 100, 100, 100, 100, 100, -3000, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100], 724 - } 725 - speaker_rows = [("Matt Akamatsu", "matsulab.com", "did:plc:matt123")] 726 - concept_rows = [{"name": "AT Protocol", "uri": "at://did/concept/atp", "aliases": "[]"}] 727 - 728 - result = process_transcript( 729 - transcript, talk_rkey="test", 730 - speaker_rows=speaker_rows, 731 - concept_rows=concept_rows, 732 - ) 733 - 734 - assert "entities" in result 735 - assert "topicBreaks" in result 736 - assert len(result["entities"]) >= 1 737 - # At least one resolved entity 738 - resolved = [e for e in result["entities"] if e.get("speakerDid") or e.get("conceptUri")] 739 - assert len(resolved) >= 1 740 - ``` 741 - 742 - - [ ] **Step 2: Run test to verify it fails** 743 - 744 - ```bash 745 - cd pipeline && source .venv/bin/activate && pytest tests/test_run.py -v 746 - ``` 747 - 748 - - [ ] **Step 3: Update process_transcript in run.py** 749 - 750 - Add imports at top: 751 - ```python 752 - from nlp.entities import detect_entities 753 - from nlp.speaker_lookup import build_speaker_lookup 754 - from nlp.topics import detect_topic_breaks 755 - ``` 756 - 757 - Update `process_transcript` signature to accept optional speaker/concept data: 758 - ```python 759 - def process_transcript( 760 - transcript: dict, 761 - talk_rkey: str, 762 - pause_threshold_ms: int = 2000, 763 - proximity_words: int = 5, 764 - speaker_rows: list[tuple] | None = None, 765 - concept_rows: list[dict] | None = None, 766 - ) -> dict: 767 - ``` 768 - 769 - Add after the paragraph detection: 770 - ```python 771 - # Pass 3: NER + entity linking 772 - speaker_lookup = build_speaker_lookup(speaker_rows) if speaker_rows else None 773 - entities = detect_entities(text, speaker_lookup=speaker_lookup, concept_rows=concept_rows) 774 - 775 - # Pass 4: Topic segmentation 776 - # Enrich sentences with text for embedding 777 - sentences_with_text = [] 778 - for s in sentences: 779 - sent_text = text.encode("utf-8")[s["byteStart"]:s["byteEnd"]].decode("utf-8") 780 - sentences_with_text.append({**s, "text": sent_text}) 781 - topic_breaks = detect_topic_breaks(sentences_with_text) 782 - ``` 783 - 784 - Update the return dict to include: 785 - ```python 786 - return { 787 - "talkRkey": talk_rkey, 788 - "sentences": sentences, 789 - "paragraphs": paragraphs, 790 - "entities": entities, 791 - "topicBreaks": [{"byteStart": tb["byteStart"]} for tb in topic_breaks], 792 - "metadata": { ... }, 793 - } 794 - ``` 795 - 796 - - [ ] **Step 4: Update main() to load speaker/concept data from SQLite** 797 - 798 - Add to the `main()` function, before the transcript loop: 799 - 800 - ```python 801 - import sqlite3 802 - db_path = Path(__file__).resolve().parent.parent.parent / "apps" / "data" / "ionosphere.sqlite" 803 - speaker_rows = [] 804 - concept_rows = [] 805 - if db_path.exists(): 806 - conn = sqlite3.connect(str(db_path)) 807 - speaker_rows = conn.execute( 808 - "SELECT name, handle, speaker_did FROM speakers" 809 - ).fetchall() 810 - concept_rows = [ 811 - {"name": r[0], "uri": r[1], "aliases": r[2] or "[]"} 812 - for r in conn.execute("SELECT name, uri, aliases FROM concepts").fetchall() 813 - ] 814 - conn.close() 815 - print(f"Loaded {len(speaker_rows)} speakers, {len(concept_rows)} concepts from DB") 816 - else: 817 - print(f"Warning: database not found at {db_path}, entity linking disabled") 818 - ``` 819 - 820 - Pass these to `process_transcript`: 821 - ```python 822 - result = process_transcript( 823 - compact, talk_rkey=talk_rkey, 824 - speaker_rows=speaker_rows, 825 - concept_rows=concept_rows, 826 - ) 827 - ``` 828 - 829 - Update the log line: 830 - ```python 831 - entity_count = len(result.get("entities", [])) 832 - topic_count = len(result.get("topicBreaks", [])) 833 - print(f" {talk_rkey}: {len(result['sentences'])} sentences, {len(result['paragraphs'])} paragraphs, {entity_count} entities, {topic_count} topics") 834 - ``` 835 - 836 - - [ ] **Step 5: Run ALL tests** 837 - 838 - ```bash 839 - cd pipeline && source .venv/bin/activate && pytest tests/ -v 840 - ``` 841 - 842 - - [ ] **Step 6: Commit** 843 - 844 - ```bash 845 - git add pipeline/nlp/run.py pipeline/tests/test_run.py 846 - git commit -m "feat: extend pipeline with NER, entity linking, and topic segmentation" 847 - ``` 848 - 849 - --- 850 - 851 - ## Chunk 5: Renderer — entities and topic dividers 852 - 853 - ### Task 8: Extend extractData for entities and topic breaks 854 - 855 - **Files:** 856 - - Modify: `apps/ionosphere/src/lib/transcript.ts` 857 - - Modify: `apps/ionosphere/src/lib/transcript.test.ts` 858 - 859 - - [ ] **Step 1: Write failing tests** 860 - 861 - Add to `apps/ionosphere/src/lib/transcript.test.ts`: 862 - 863 - ```typescript 864 - describe("extractData — entities and topic breaks", () => { 865 - it("extracts entity spans from facets", () => { 866 - const doc = makeDoc([ 867 - { text: "Matt", startNs: 1000, endNs: 2000 }, 868 - { text: "presented.", startNs: 2000, endNs: 3000 }, 869 - ]); 870 - const encoder = new TextEncoder(); 871 - // Add a speaker-ref entity facet 872 - doc.facets.push({ 873 - index: { byteStart: 0, byteEnd: encoder.encode("Matt").length }, 874 - features: [{ 875 - $type: "tv.ionosphere.facet#speaker-ref", 876 - speakerDid: "did:plc:matt123", 877 - label: "Matt", 878 - }], 879 - }); 880 - 881 - const result = extractData(doc); 882 - expect(result.entities).toHaveLength(1); 883 - expect(result.entities[0].speakerDid).toBe("did:plc:matt123"); 884 - expect(result.entities[0].byteStart).toBe(0); 885 - }); 886 - 887 - it("extracts topic breaks as paragraph indices", () => { 888 - const doc = makeDoc([ 889 - { text: "A.", startNs: 1000, endNs: 2000 }, 890 - { text: "B.", startNs: 3000, endNs: 4000 }, 891 - ]); 892 - const encoder = new TextEncoder(); 893 - const text = "A. B."; 894 - const s1End = encoder.encode("A.").length; 895 - const s2Start = encoder.encode("A. ").length; 896 - const s2End = encoder.encode(text).length; 897 - 898 - // Two paragraphs 899 - doc.facets.push({ index: { byteStart: 0, byteEnd: s1End }, features: [{ $type: "tv.ionosphere.facet#paragraph" }] }); 900 - doc.facets.push({ index: { byteStart: s2Start, byteEnd: s2End }, features: [{ $type: "tv.ionosphere.facet#paragraph" }] }); 901 - // Topic break at second paragraph 902 - doc.facets.push({ index: { byteStart: s2Start, byteEnd: s2Start }, features: [{ $type: "tv.ionosphere.facet#topic-break" }] }); 903 - // Sentences 904 - doc.facets.push({ index: { byteStart: 0, byteEnd: s1End }, features: [{ $type: "tv.ionosphere.facet#sentence" }] }); 905 - doc.facets.push({ index: { byteStart: s2Start, byteEnd: s2End }, features: [{ $type: "tv.ionosphere.facet#sentence" }] }); 906 - 907 - const result = extractData(doc); 908 - expect(result.topicBreaks.has(1)).toBe(true); // second paragraph 909 - expect(result.topicBreaks.has(0)).toBe(false); // first paragraph 910 - }); 911 - 912 - it("returns empty entities and topicBreaks when no such facets exist", () => { 913 - const doc = makeDoc([ 914 - { text: "Hello", startNs: 1000, endNs: 2000 }, 915 - ]); 916 - const result = extractData(doc); 917 - expect(result.entities).toHaveLength(0); 918 - expect(result.topicBreaks.size).toBe(0); 919 - }); 920 - }); 921 - ``` 922 - 923 - - [ ] **Step 2: Run tests to verify they fail** 924 - 925 - ```bash 926 - cd apps/ionosphere && npx vitest run src/lib/transcript.test.ts 927 - ``` 928 - 929 - - [ ] **Step 3: Add EntitySpan type and update extractData** 930 - 931 - Add to `apps/ionosphere/src/lib/transcript.ts`: 932 - 933 - ```typescript 934 - export interface EntitySpan { 935 - byteStart: number; 936 - byteEnd: number; 937 - label: string; 938 - nerType?: string; 939 - speakerDid?: string; 940 - conceptUri?: string; 941 - conceptName?: string; 942 - } 943 - ``` 944 - 945 - In `extractData`, add extraction of entity facets (speaker-ref, concept-ref, entity) and topic-break facets. Map topic breaks to paragraph indices by finding which paragraph contains each break's byte position. 946 - 947 - Return `entities: EntitySpan[]` and `topicBreaks: Set<number>` alongside existing fields. 948 - 949 - - [ ] **Step 4: Run tests to verify all pass** 950 - 951 - ```bash 952 - cd apps/ionosphere && npx vitest run src/lib/transcript.test.ts 953 - ``` 954 - 955 - - [ ] **Step 5: Commit** 956 - 957 - ```bash 958 - git add apps/ionosphere/src/lib/transcript.ts apps/ionosphere/src/lib/transcript.test.ts 959 - git commit -m "feat: extract entities and topic breaks from facets" 960 - ``` 961 - 962 - ### Task 9: Update TranscriptView for entity rendering and topic dividers 963 - 964 - **Files:** 965 - - Modify: `apps/ionosphere/src/app/components/TranscriptView.tsx` 966 - 967 - - [ ] **Step 1: Extract entities and topicBreaks from extractData** 968 - 969 - In the `useMemo` that calls `extractData`, also destructure `entities` and `topicBreaks`. 970 - 971 - - [ ] **Step 2: Build word-to-entity lookup** 972 - 973 - ```typescript 974 - const wordEntities = useMemo(() => { 975 - const map = new Map<number, EntitySpan>(); 976 - for (const entity of entities) { 977 - for (let i = 0; i < words.length; i++) { 978 - const w = words[i]; 979 - if (w.byteStart >= entity.byteStart && w.byteEnd <= entity.byteEnd) { 980 - // First word of entity gets the entity data (for rendering the link) 981 - if (!map.has(i)) map.set(i, entity); 982 - } 983 - } 984 - } 985 - return map; 986 - }, [words, entities]); 987 - ``` 988 - 989 - - [ ] **Step 3: Add entity link styling to WordSpanComponent or a wrapper** 990 - 991 - For words that are part of an entity, add visual treatment: 992 - - `speaker-ref`: blue underline, link to `/speakers/{rkey}` or Bluesky profile 993 - - `concept-ref`: amber underline (matching existing concept style) 994 - - `entity` (unresolved): dotted underline, no link 995 - 996 - The simplest approach: check `wordEntities.get(globalIdx)` when rendering each word, and wrap entity-start words in an `<a>` or styled `<span>`. 997 - 998 - - [ ] **Step 4: Add topic dividers between paragraphs** 999 - 1000 - In the paragraph rendering loop, check `topicBreaks.has(paragraphIndex)` and insert an `<hr>` before that paragraph: 1001 - 1002 - ```tsx 1003 - {paragraphs.map((para, pi) => ( 1004 - <> 1005 - {topicBreaks.has(pi) && ( 1006 - <hr key={`topic-${pi}`} className="border-neutral-800 my-6" /> 1007 - )} 1008 - <div key={pi} className="mb-4"> 1009 - {/* sentences... */} 1010 - </div> 1011 - </> 1012 - ))} 1013 - ``` 1014 - 1015 - - [ ] **Step 5: Verify in browser** 1016 - 1017 - Load a talk page. Verify: 1018 - - Entity names have visible styling (underlines) 1019 - - Resolved entities are clickable 1020 - - Topic dividers appear as subtle horizontal rules 1021 - - Existing functionality (scroll, brightness, comments) unchanged 1022 - 1023 - - [ ] **Step 6: Commit** 1024 - 1025 - ```bash 1026 - git add apps/ionosphere/src/app/components/TranscriptView.tsx 1027 - git commit -m "feat: render entity links and topic dividers in transcript" 1028 - ``` 1029 - 1030 - --- 1031 - 1032 - ## Chunk 6: End-to-end integration 1033 - 1034 - ### Task 10: Run the full enriched pipeline 1035 - 1036 - - [ ] **Step 1: Run the pipeline on all transcripts** 1037 - 1038 - ```bash 1039 - cd pipeline && source .venv/bin/activate && python -m nlp.run 1040 - ``` 1041 - 1042 - Verify output includes entity and topic data. Spot-check 3-5 output files. 1043 - 1044 - - [ ] **Step 2: Inject documents into local database** 1045 - 1046 - Use the same inject pattern as Phase 1 — create a temporary script that reads NLP output + transcripts, calls `decodeToDocumentWithStructure`, and updates the talks table. 1047 - 1048 - - [ ] **Step 3: Verify in browser** 1049 - 1050 - Load several talk pages. Check: 1051 - - Entity links point to correct profiles/concepts 1052 - - Topic dividers land at natural transitions 1053 - - Talks without diarization still work (entities just don't have speaker context) 1054 - - All existing features (paragraphs, sentences, timing, comments) unchanged 1055 - 1056 - - [ ] **Step 4: Run all tests** 1057 - 1058 - ```bash 1059 - # Python 1060 - cd pipeline && source .venv/bin/activate && pytest -v 1061 - 1062 - # TypeScript 1063 - cd formats/tv.ionosphere && npx vitest run 1064 - cd ../../apps/ionosphere && npx vitest run src/lib/transcript.test.ts 1065 - ``` 1066 - 1067 - - [ ] **Step 5: Commit any integration fixes** 1068 - 1069 - ```bash 1070 - git add -A && git commit -m "fix: integration fixes from end-to-end verification" 1071 - ``` 1072 - 1073 - ### Task 11: Update publish.ts for enriched annotations 1074 - 1075 - **Files:** 1076 - - Modify: `apps/ionosphere-appview/src/publish.ts` 1077 - 1078 - - [ ] **Step 1: Update the NLP data reading to include new fields** 1079 - 1080 - The publish.ts code that reads NLP JSON and passes it to `decodeToDocumentWithStructure` needs to include the new fields (entities, speakerSegments, topicBreaks) from the NLP output. Since `NlpAnnotations` now has these as optional fields, this is just passing them through: 1081 - 1082 - ```typescript 1083 - const doc = decodeToDocumentWithStructure(compact, { 1084 - sentences: nlpData.sentences, 1085 - paragraphs: nlpData.paragraphs, 1086 - entities: nlpData.entities, 1087 - topicBreaks: nlpData.topicBreaks, 1088 - // speakerSegments will come when diarization is integrated 1089 - }); 1090 - ``` 1091 - 1092 - - [ ] **Step 2: Commit** 1093 - 1094 - ```bash 1095 - git add apps/ionosphere-appview/src/publish.ts 1096 - git commit -m "feat: pass entity and topic data through publish pipeline" 1097 - ```
-1138
docs/superpowers/plans/2026-04-12-transcript-formatting.md
··· 1 - # Transcript Formatting Implementation Plan 2 - 3 - > **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking. 4 - 5 - **Goal:** Add NLP-based sentence and paragraph detection to transcripts so they render as structured prose instead of a wall of text. 6 - 7 - **Architecture:** A Python NLP pipeline (spaCy) produces sentence/paragraph annotation layers as JSON files. A TypeScript publish step validates and publishes these as layers.pub AT Protocol records. The document assembly step reads annotation layers and emits structural facets (`#sentence`, `#paragraph`) that the React renderer consumes as inline spans and block elements. The old `tv.ionosphere.annotation` system is removed entirely. 8 - 9 - **Tech Stack:** Python 3.12+, spaCy (`en_core_web_sm`), vitest, layers.pub lexicons, panproto, React/Next.js 10 - 11 - **Spec:** `docs/superpowers/specs/2026-04-12-transcript-formatting-design.md` 12 - 13 - **Scope note:** This plan implements the rendering pipeline (NLP → facets → renderer) end-to-end. Publishing layers.pub records to the PDS and creating panproto lenses are deferred to a follow-up — the vendored lexicons and spec are forward preparation. The immediate goal is formatted transcripts in the browser. 14 - 15 - **UX note:** Removing the old annotation system (Task 13) will temporarily remove concept highlighting from talk pages. Concepts return via NLP in Phase 2. If this is unacceptable, Task 13 can be deferred and the old overlay path kept alongside the new structural facets. 16 - 17 - --- 18 - 19 - ## Chunk 1: Vendor layers.pub lexicons and define new facet types 20 - 21 - ### Task 1: Vendor layers.pub lexicon definitions 22 - 23 - **Files:** 24 - - Create: `lexicons/pub/layers/defs.json` 25 - - Create: `lexicons/pub/layers/expression/expression.json` 26 - - Create: `lexicons/pub/layers/segmentation/segmentation.json` 27 - - Create: `lexicons/pub/layers/annotation/annotationLayer.json` 28 - 29 - - [ ] **Step 1: Create the `pub.layers.defs` shared definitions lexicon** 30 - 31 - Vendor the subset of `pub.layers.defs` that we use: `span`, `temporalSpan`, `uuid`, `tokenRef`, `anchor`, `annotationMetadata`, `featureMap`, `feature`. Pull the field definitions from https://docs.layers.pub/lexicons/defs. These are the shared types referenced by the other lexicons. 32 - 33 - - [ ] **Step 2: Create the `pub.layers.expression.expression` lexicon** 34 - 35 - Vendor the expression record schema from https://docs.layers.pub/lexicons/expression. Required fields: `id`, `kindUri`, `kind`, `text`, `language`, `createdAt`. Optional: `sourceRef`, `parentRef`, `anchor`, `metadata`, `features`. 36 - 37 - - [ ] **Step 3: Create the `pub.layers.segmentation.segmentation` lexicon** 38 - 39 - Vendor the segmentation record schema from https://docs.layers.pub/lexicons/segmentation. This includes the `segmentation` record and the `tokenization` and `token` object defs. 40 - 41 - - [ ] **Step 4: Create the `pub.layers.annotation.annotationLayer` lexicon** 42 - 43 - Vendor the annotation layer record schema from https://docs.layers.pub/lexicons/annotation. This includes `annotationLayer` record and the `annotation` object def. 44 - 45 - - [ ] **Step 5: Commit** 46 - 47 - ```bash 48 - git add lexicons/pub/ 49 - git commit -m "feat: vendor layers.pub lexicons for transcript enrichment" 50 - ``` 51 - 52 - ### Task 2: Add sentence and paragraph facet types to the format lexicon 53 - 54 - **Files:** 55 - - Modify: `formats/tv.ionosphere/ionosphere.lexicon.json` 56 - 57 - - [ ] **Step 1: Add `#sentence` (inline) and `#paragraph` (block) facet entries** 58 - 59 - Add to the `features` array in `formats/tv.ionosphere/ionosphere.lexicon.json`: 60 - 61 - ```json 62 - { 63 - "typeId": "tv.ionosphere.facet#sentence", 64 - "featureClass": "inline", 65 - "expandStart": false, 66 - "expandEnd": false 67 - }, 68 - { 69 - "typeId": "tv.ionosphere.facet#paragraph", 70 - "featureClass": "block", 71 - "expandStart": false, 72 - "expandEnd": false 73 - } 74 - ``` 75 - 76 - - [ ] **Step 2: Commit** 77 - 78 - ```bash 79 - git add formats/tv.ionosphere/ionosphere.lexicon.json 80 - git commit -m "feat: add sentence (inline) and paragraph (block) facet types" 81 - ``` 82 - 83 - --- 84 - 85 - ## Chunk 2: Python NLP pipeline 86 - 87 - ### Task 3: Set up the Python enrichment project 88 - 89 - **Files:** 90 - - Create: `pipeline/pyproject.toml` 91 - - Create: `pipeline/nlp/__init__.py` 92 - 93 - - [ ] **Step 1: Create `pipeline/pyproject.toml`** 94 - 95 - ```toml 96 - [project] 97 - name = "ionosphere-nlp" 98 - version = "0.1.0" 99 - description = "NLP enrichment pipeline for ionosphere transcripts" 100 - requires-python = ">=3.12" 101 - dependencies = [ 102 - "spacy>=3.7", 103 - ] 104 - 105 - [project.optional-dependencies] 106 - dev = ["pytest>=8.0"] 107 - 108 - [tool.pytest.ini_options] 109 - testpaths = ["tests"] 110 - ``` 111 - 112 - - [ ] **Step 2: Create the package init** 113 - 114 - Create `pipeline/nlp/__init__.py` (empty file). 115 - 116 - - [ ] **Step 3: Create `pipeline/tests/__init__.py`** 117 - 118 - Empty file (needed for pytest discovery). 119 - 120 - - [ ] **Step 4: Add `.gitignore` entries for Python artifacts** 121 - 122 - Add to the repo root `.gitignore`: 123 - ``` 124 - pipeline/.venv/ 125 - pipeline/data/ 126 - __pycache__/ 127 - *.pyc 128 - ``` 129 - 130 - - [ ] **Step 5: Install dependencies** 131 - 132 - ```bash 133 - cd pipeline 134 - python -m venv .venv 135 - source .venv/bin/activate 136 - pip install -e ".[dev]" 137 - python -m spacy download en_core_web_sm 138 - ``` 139 - 140 - - [ ] **Step 6: Commit** 141 - 142 - ```bash 143 - git add pipeline/pyproject.toml pipeline/nlp/__init__.py pipeline/tests/__init__.py .gitignore 144 - git commit -m "feat: scaffold Python NLP pipeline project" 145 - ``` 146 - 147 - ### Task 4: Implement sentence boundary detection (Pass 1) 148 - 149 - **Files:** 150 - - Create: `pipeline/tests/test_sentences.py` 151 - - Create: `pipeline/nlp/sentences.py` 152 - 153 - - [ ] **Step 1: Write the failing test** 154 - 155 - Create `pipeline/tests/test_sentences.py`: 156 - 157 - ```python 158 - from nlp.sentences import detect_sentences 159 - 160 - 161 - def test_basic_sentences(): 162 - text = "Hello world. This is a test. And another sentence." 163 - sentences = detect_sentences(text) 164 - assert len(sentences) == 3 165 - # Each sentence is a dict with byteStart and byteEnd 166 - assert sentences[0]["byteStart"] == 0 167 - assert sentences[0]["byteEnd"] == len("Hello world.".encode("utf-8")) 168 - assert sentences[1]["byteStart"] == len("Hello world. ".encode("utf-8")) 169 - 170 - 171 - def test_speech_without_punctuation(): 172 - """spaCy should detect sentence boundaries even with poor punctuation.""" 173 - text = "so the thing is we need to think about this carefully and then we can move on to the next topic which is about protocols" 174 - sentences = detect_sentences(text) 175 - # spaCy should find at least 1 sentence (the whole text if no clear boundary) 176 - assert len(sentences) >= 1 177 - # All sentences should cover the full text 178 - assert sentences[0]["byteStart"] == 0 179 - assert sentences[-1]["byteEnd"] == len(text.encode("utf-8")) 180 - 181 - 182 - def test_empty_text(): 183 - sentences = detect_sentences("") 184 - assert sentences == [] 185 - 186 - 187 - def test_byte_offsets_for_unicode(): 188 - text = "Caf\u00e9 is great. Let\u2019s go." 189 - sentences = detect_sentences(text) 190 - # Byte offsets must account for multi-byte characters 191 - full_bytes = text.encode("utf-8") 192 - assert sentences[-1]["byteEnd"] == len(full_bytes) 193 - ``` 194 - 195 - - [ ] **Step 2: Run test to verify it fails** 196 - 197 - ```bash 198 - cd pipeline && source .venv/bin/activate && pytest tests/test_sentences.py -v 199 - ``` 200 - Expected: FAIL with `ModuleNotFoundError` 201 - 202 - - [ ] **Step 3: Implement `detect_sentences`** 203 - 204 - Create `pipeline/nlp/sentences.py`: 205 - 206 - ```python 207 - """Pass 1: Sentence boundary detection using spaCy.""" 208 - 209 - import spacy 210 - 211 - _nlp = None 212 - 213 - 214 - def _get_nlp(): 215 - global _nlp 216 - if _nlp is None: 217 - _nlp = spacy.load("en_core_web_sm") 218 - return _nlp 219 - 220 - 221 - def detect_sentences(text: str) -> list[dict]: 222 - """Detect sentence boundaries and return byte-range spans. 223 - 224 - Returns a list of dicts, each with: 225 - byteStart: int — UTF-8 byte offset of sentence start 226 - byteEnd: int — UTF-8 byte offset of sentence end (exclusive) 227 - """ 228 - if not text.strip(): 229 - return [] 230 - 231 - nlp = _get_nlp() 232 - doc = nlp(text) 233 - text_bytes = text.encode("utf-8") 234 - sentences = [] 235 - 236 - for sent in doc.sents: 237 - # spaCy gives character offsets; convert to byte offsets 238 - byte_start = len(text[:sent.start_char].encode("utf-8")) 239 - byte_end = len(text[:sent.end_char].encode("utf-8")) 240 - sentences.append({ 241 - "byteStart": byte_start, 242 - "byteEnd": byte_end, 243 - }) 244 - 245 - return sentences 246 - ``` 247 - 248 - - [ ] **Step 4: Run tests to verify they pass** 249 - 250 - ```bash 251 - cd pipeline && source .venv/bin/activate && pytest tests/test_sentences.py -v 252 - ``` 253 - Expected: all PASS 254 - 255 - - [ ] **Step 5: Commit** 256 - 257 - ```bash 258 - git add pipeline/nlp/sentences.py pipeline/tests/test_sentences.py 259 - git commit -m "feat: sentence boundary detection via spaCy" 260 - ``` 261 - 262 - ### Task 5: Implement paragraph segmentation (Pass 2) 263 - 264 - **Files:** 265 - - Create: `pipeline/tests/test_paragraphs.py` 266 - - Create: `pipeline/nlp/paragraphs.py` 267 - 268 - - [ ] **Step 1: Write the failing test** 269 - 270 - Create `pipeline/tests/test_paragraphs.py`: 271 - 272 - ```python 273 - from nlp.paragraphs import detect_paragraphs 274 - 275 - 276 - def test_basic_paragraphs(): 277 - text = "Hello world. This is sentence two. After a long pause here. New topic starts." 278 - sentences = [ 279 - {"byteStart": 0, "byteEnd": 12}, 280 - {"byteStart": 13, "byteEnd": 34}, 281 - {"byteStart": 35, "byteEnd": 59}, 282 - {"byteStart": 60, "byteEnd": 77}, 283 - ] 284 - # Words: "Hello"=0, "world."=1, "This"=2, "is"=3, "sentence"=4, "two."=5, 285 - # "After"=6, "a"=7, "long"=8, "pause"=9, "here."=10, 286 - # "New"=11, "topic"=12, "starts."=13 287 - # Big pause gap (3000ms) between word index 5 and 6 (between sentence 2 and 3) 288 - timings = [100, 100, 100, 100, 100, 100, -3000, 100, 100, 100, 100, 100, 100, 100] 289 - start_ms = 0 290 - 291 - paragraphs = detect_paragraphs( 292 - text=text, 293 - timings=timings, 294 - start_ms=start_ms, 295 - sentences=sentences, 296 - pause_threshold_ms=2000, 297 - proximity_words=5, 298 - ) 299 - # Should detect a paragraph break at the sentence boundary near the 3s pause 300 - assert len(paragraphs) == 2 301 - assert paragraphs[0]["byteStart"] == 0 302 - assert paragraphs[1]["byteStart"] == 35 # "After a long pause..." 303 - 304 - 305 - def test_no_long_pauses_single_paragraph(): 306 - text = "One sentence. Two sentence." 307 - sentences = [ 308 - {"byteStart": 0, "byteEnd": 13}, 309 - {"byteStart": 14, "byteEnd": 27}, 310 - ] 311 - timings = [100, 100, 100, 100] 312 - paragraphs = detect_paragraphs( 313 - text=text, timings=timings, start_ms=0, 314 - sentences=sentences, 315 - ) 316 - assert len(paragraphs) == 1 317 - 318 - 319 - def test_empty_input(): 320 - paragraphs = detect_paragraphs( 321 - text="", timings=[], start_ms=0, sentences=[], 322 - ) 323 - assert paragraphs == [] 324 - ``` 325 - 326 - - [ ] **Step 2: Run test to verify it fails** 327 - 328 - ```bash 329 - cd pipeline && source .venv/bin/activate && pytest tests/test_paragraphs.py -v 330 - ``` 331 - Expected: FAIL with `ModuleNotFoundError` 332 - 333 - - [ ] **Step 3: Implement `detect_paragraphs`** 334 - 335 - Create `pipeline/nlp/paragraphs.py`: 336 - 337 - ```python 338 - """Pass 2: Paragraph segmentation using pause duration + sentence boundaries.""" 339 - 340 - 341 - def detect_paragraphs( 342 - text: str, 343 - timings: list[int], 344 - start_ms: int, 345 - sentences: list[dict], 346 - pause_threshold_ms: int = 2000, 347 - proximity_words: int = 5, 348 - ) -> list[dict]: 349 - """Detect paragraph boundaries from timing gaps and sentence boundaries. 350 - 351 - Returns a list of paragraph dicts with byteStart and byteEnd. 352 - """ 353 - if not text.strip() or not sentences: 354 - return [] 355 - 356 - text_bytes = text.encode("utf-8") 357 - 358 - # Find word indices where long pauses occur 359 - pause_word_indices: list[int] = [] 360 - word_index = 0 361 - for value in timings: 362 - if value < 0: 363 - if abs(value) >= pause_threshold_ms: 364 - pause_word_indices.append(word_index) 365 - else: 366 - word_index += 1 367 - 368 - if not pause_word_indices: 369 - # No long pauses — entire text is one paragraph 370 - return [{"byteStart": sentences[0]["byteStart"], 371 - "byteEnd": sentences[-1]["byteEnd"]}] 372 - 373 - # Build a char→byte offset map once, then compute word byte starts 374 - text_bytes = text.encode("utf-8") 375 - char_to_byte = [] 376 - byte_pos = 0 377 - for ch in text: 378 - char_to_byte.append(byte_pos) 379 - byte_pos += len(ch.encode("utf-8")) 380 - char_to_byte.append(byte_pos) # sentinel for end of text 381 - 382 - # Find word start char offsets using split positions 383 - words = text.split() 384 - word_byte_starts: list[int] = [] 385 - char_offset = 0 386 - for w in words: 387 - idx = text.index(w, char_offset) 388 - word_byte_starts.append(char_to_byte[idx]) 389 - char_offset = idx + len(w) 390 - 391 - def word_index_for_byte(byte_pos: int) -> int: 392 - """Find the word index closest to a byte position.""" 393 - best = 0 394 - for i, wb in enumerate(word_byte_starts): 395 - if wb <= byte_pos: 396 - best = i 397 - return best 398 - 399 - # Find sentence boundaries (byte positions where one sentence ends 400 - # and the next begins) 401 - sentence_break_byte_positions: list[int] = [] 402 - sentence_break_word_indices: list[int] = [] 403 - for i in range(1, len(sentences)): 404 - bp = sentences[i]["byteStart"] 405 - sentence_break_byte_positions.append(bp) 406 - sentence_break_word_indices.append(word_index_for_byte(bp)) 407 - 408 - # For each long pause, find the nearest sentence boundary 409 - paragraph_break_bytes: set[int] = set() 410 - for pause_wi in pause_word_indices: 411 - best_dist = float("inf") 412 - best_bp = None 413 - for sb_wi, sb_bp in zip( 414 - sentence_break_word_indices, sentence_break_byte_positions 415 - ): 416 - dist = abs(sb_wi - pause_wi) 417 - if dist <= proximity_words and dist < best_dist: 418 - best_dist = dist 419 - best_bp = sb_bp 420 - if best_bp is not None: 421 - paragraph_break_bytes.add(best_bp) 422 - 423 - # Build paragraph spans from the break points 424 - sorted_breaks = sorted(paragraph_break_bytes) 425 - paragraphs: list[dict] = [] 426 - current_start = sentences[0]["byteStart"] 427 - 428 - for brk in sorted_breaks: 429 - # Find the sentence that ends just before this break 430 - para_end = brk 431 - for s in sentences: 432 - if s["byteEnd"] <= brk: 433 - para_end = s["byteEnd"] 434 - paragraphs.append({"byteStart": current_start, "byteEnd": para_end}) 435 - current_start = brk 436 - 437 - # Final paragraph 438 - paragraphs.append({ 439 - "byteStart": current_start, 440 - "byteEnd": sentences[-1]["byteEnd"], 441 - }) 442 - 443 - return paragraphs 444 - ``` 445 - 446 - - [ ] **Step 4: Run tests to verify they pass** 447 - 448 - ```bash 449 - cd pipeline && source .venv/bin/activate && pytest tests/test_paragraphs.py -v 450 - ``` 451 - Expected: all PASS 452 - 453 - - [ ] **Step 5: Commit** 454 - 455 - ```bash 456 - git add pipeline/nlp/paragraphs.py pipeline/tests/test_paragraphs.py 457 - git commit -m "feat: paragraph segmentation from pause data + sentence boundaries" 458 - ``` 459 - 460 - ### Task 6: Pipeline orchestrator — process all transcripts 461 - 462 - **Files:** 463 - - Create: `pipeline/nlp/run.py` 464 - - Create: `pipeline/tests/test_run.py` 465 - 466 - - [ ] **Step 1: Write the failing test** 467 - 468 - Create `pipeline/tests/test_run.py`: 469 - 470 - ```python 471 - import json 472 - import os 473 - from pathlib import Path 474 - from nlp.run import process_transcript 475 - 476 - 477 - def test_process_transcript_produces_output(tmp_path): 478 - """Integration test: full pipeline on a simple transcript.""" 479 - transcript = { 480 - "text": "Hello world. This is a test. After a long pause. New topic here.", 481 - "startMs": 0, 482 - "timings": [100, 100, 100, 100, 100, 100, -3000, 100, 100, 100, 100, 100, 100, 100], 483 - } 484 - 485 - result = process_transcript(transcript, talk_rkey="test-talk") 486 - 487 - # Should have sentences and paragraphs 488 - assert "sentences" in result 489 - assert "paragraphs" in result 490 - assert len(result["sentences"]) >= 2 491 - assert len(result["paragraphs"]) >= 1 492 - # Each sentence has byte ranges 493 - for s in result["sentences"]: 494 - assert "byteStart" in s 495 - assert "byteEnd" in s 496 - # Metadata present 497 - assert "metadata" in result 498 - assert result["metadata"]["tool"] == "spacy/en_core_web_sm" 499 - ``` 500 - 501 - - [ ] **Step 2: Run test to verify it fails** 502 - 503 - ```bash 504 - cd pipeline && source .venv/bin/activate && pytest tests/test_run.py -v 505 - ``` 506 - 507 - - [ ] **Step 3: Implement the orchestrator** 508 - 509 - Create `pipeline/nlp/run.py`: 510 - 511 - ```python 512 - """Pipeline orchestrator: run all NLP passes on a transcript.""" 513 - 514 - import json 515 - import sys 516 - from pathlib import Path 517 - from nlp.sentences import detect_sentences 518 - from nlp.paragraphs import detect_paragraphs 519 - 520 - 521 - def process_transcript( 522 - transcript: dict, 523 - talk_rkey: str, 524 - pause_threshold_ms: int = 2000, 525 - proximity_words: int = 5, 526 - ) -> dict: 527 - """Run all NLP passes on a single transcript. 528 - 529 - Args: 530 - transcript: dict with text, startMs, timings 531 - talk_rkey: the talk's record key (for output naming) 532 - 533 - Returns: 534 - dict with sentences, paragraphs, and metadata 535 - """ 536 - text = transcript["text"] 537 - timings = transcript["timings"] 538 - start_ms = transcript["startMs"] 539 - 540 - # Pass 1: Sentence detection 541 - sentences = detect_sentences(text) 542 - 543 - # Pass 2: Paragraph segmentation 544 - paragraphs = detect_paragraphs( 545 - text=text, 546 - timings=timings, 547 - start_ms=start_ms, 548 - sentences=sentences, 549 - pause_threshold_ms=pause_threshold_ms, 550 - proximity_words=proximity_words, 551 - ) 552 - 553 - return { 554 - "talkRkey": talk_rkey, 555 - "sentences": sentences, 556 - "paragraphs": paragraphs, 557 - "metadata": { 558 - "tool": "spacy/en_core_web_sm", 559 - "pauseThresholdMs": pause_threshold_ms, 560 - "proximityWords": proximity_words, 561 - }, 562 - } 563 - 564 - 565 - def main(): 566 - """CLI: read transcripts from appview data/transcripts/, write results to pipeline/data/nlp/.""" 567 - # Match the path used by publish.ts: apps/ionosphere-appview/data/transcripts/ 568 - transcripts_dir = Path(__file__).resolve().parent.parent.parent / "apps" / "ionosphere-appview" / "data" / "transcripts" 569 - output_dir = Path(__file__).resolve().parent.parent / "data" / "nlp" 570 - output_dir.mkdir(parents=True, exist_ok=True) 571 - 572 - if not transcripts_dir.exists(): 573 - print(f"Transcripts directory not found: {transcripts_dir}") 574 - sys.exit(1) 575 - 576 - transcript_files = sorted(transcripts_dir.glob("*.json")) 577 - print(f"Processing {len(transcript_files)} transcripts...") 578 - 579 - for tf in transcript_files: 580 - talk_rkey = tf.stem 581 - transcript = json.loads(tf.read_text()) 582 - 583 - # The cached transcript files contain TranscriptResult format 584 - # (text + words array). We need to encode to compact format first. 585 - # But the pipeline needs text + timings. Let's derive timings from words. 586 - if "words" in transcript and "timings" not in transcript: 587 - from nlp.encoding import words_to_compact 588 - compact = words_to_compact(transcript) 589 - else: 590 - compact = transcript 591 - 592 - result = process_transcript(compact, talk_rkey=talk_rkey) 593 - 594 - out_path = output_dir / f"{talk_rkey}.json" 595 - out_path.write_text(json.dumps(result, indent=2)) 596 - print(f" {talk_rkey}: {len(result['sentences'])} sentences, {len(result['paragraphs'])} paragraphs") 597 - 598 - print("Done.") 599 - 600 - 601 - if __name__ == "__main__": 602 - main() 603 - ``` 604 - 605 - - [ ] **Step 4: Create `pipeline/nlp/encoding.py` — helper to convert word-level transcripts to compact format** 606 - 607 - ```python 608 - """Convert word-level transcript format to compact (text + timings) format.""" 609 - 610 - 611 - def words_to_compact(transcript: dict) -> dict: 612 - """Convert TranscriptResult {text, words[{word, start, end}]} to compact {text, startMs, timings}.""" 613 - words = transcript.get("words", []) 614 - if not words: 615 - return {"text": transcript.get("text", ""), "startMs": 0, "timings": []} 616 - 617 - start_ms = round(words[0]["start"] * 1000) 618 - timings = [] 619 - cursor = start_ms 620 - 621 - for w in words: 622 - word_start_ms = round(w["start"] * 1000) 623 - word_end_ms = round(w["end"] * 1000) 624 - duration = word_end_ms - word_start_ms 625 - 626 - gap = word_start_ms - cursor 627 - if gap > 0: 628 - timings.append(-gap) 629 - 630 - timings.append(max(duration, 1)) 631 - cursor = word_end_ms 632 - 633 - return { 634 - "text": transcript["text"], 635 - "startMs": start_ms, 636 - "timings": timings, 637 - } 638 - ``` 639 - 640 - - [ ] **Step 5: Run tests to verify they pass** 641 - 642 - ```bash 643 - cd pipeline && source .venv/bin/activate && pytest tests/ -v 644 - ``` 645 - Expected: all PASS 646 - 647 - - [ ] **Step 6: Commit** 648 - 649 - ```bash 650 - git add pipeline/nlp/run.py pipeline/nlp/encoding.py pipeline/tests/test_run.py 651 - git commit -m "feat: NLP pipeline orchestrator with CLI entry point" 652 - ``` 653 - 654 - --- 655 - 656 - ## Chunk 3: TypeScript — update `extractData` for hierarchical structure 657 - 658 - ### Task 7: Add `ParagraphSpan` and `SentenceSpan` types and update `extractData` 659 - 660 - **Files:** 661 - - Modify: `apps/ionosphere/src/lib/transcript.ts` 662 - - Modify: `apps/ionosphere/src/lib/transcript.test.ts` 663 - 664 - - [ ] **Step 1: Write failing tests for hierarchical extraction** 665 - 666 - Add to `apps/ionosphere/src/lib/transcript.test.ts`: 667 - 668 - ```typescript 669 - describe("extractData — hierarchical structure", () => { 670 - it("groups words into sentences and paragraphs when facets present", () => { 671 - const doc = makeDoc([ 672 - { text: "Hello", startNs: 1000, endNs: 2000 }, 673 - { text: "world.", startNs: 2000, endNs: 3000 }, 674 - { text: "New", startNs: 4000, endNs: 5000 }, 675 - { text: "sentence.", startNs: 5000, endNs: 6000 }, 676 - ]); 677 - const encoder = new TextEncoder(); 678 - const text = "Hello world. New sentence."; 679 - // Add sentence facets 680 - doc.facets.push({ 681 - index: { 682 - byteStart: 0, 683 - byteEnd: encoder.encode("Hello world.").length, 684 - }, 685 - features: [{ $type: "tv.ionosphere.facet#sentence" }], 686 - }); 687 - doc.facets.push({ 688 - index: { 689 - byteStart: encoder.encode("Hello world. ").length, 690 - byteEnd: encoder.encode(text).length, 691 - }, 692 - features: [{ $type: "tv.ionosphere.facet#sentence" }], 693 - }); 694 - // Add paragraph facet (one paragraph covering everything) 695 - doc.facets.push({ 696 - index: { byteStart: 0, byteEnd: encoder.encode(text).length }, 697 - features: [{ $type: "tv.ionosphere.facet#paragraph" }], 698 - }); 699 - 700 - const result = extractData(doc); 701 - expect(result.paragraphs).toHaveLength(1); 702 - expect(result.paragraphs[0].sentences).toHaveLength(2); 703 - expect(result.paragraphs[0].sentences[0].words).toHaveLength(2); 704 - expect(result.paragraphs[0].sentences[1].words).toHaveLength(2); 705 - }); 706 - 707 - it("gracefully degrades to singleton paragraph/sentence when no structural facets", () => { 708 - const doc = makeDoc([ 709 - { text: "Hello", startNs: 1000, endNs: 2000 }, 710 - { text: "world", startNs: 2000, endNs: 3000 }, 711 - ]); 712 - 713 - const result = extractData(doc); 714 - // Should still have paragraphs/sentences structure 715 - expect(result.paragraphs).toHaveLength(1); 716 - expect(result.paragraphs[0].sentences).toHaveLength(1); 717 - expect(result.paragraphs[0].sentences[0].words).toHaveLength(2); 718 - // Legacy flat access still works 719 - expect(result.words).toHaveLength(2); 720 - }); 721 - }); 722 - ``` 723 - 724 - - [ ] **Step 2: Run tests to verify they fail** 725 - 726 - ```bash 727 - cd apps/ionosphere && npx vitest run src/lib/transcript.test.ts 728 - ``` 729 - Expected: FAIL — `paragraphs` property does not exist 730 - 731 - - [ ] **Step 3: Add types and update `extractData`** 732 - 733 - Add these types to `apps/ionosphere/src/lib/transcript.ts`: 734 - 735 - ```typescript 736 - export interface SentenceSpan { 737 - byteStart: number; 738 - byteEnd: number; 739 - words: WordSpan[]; 740 - } 741 - 742 - export interface ParagraphSpan { 743 - byteStart: number; 744 - byteEnd: number; 745 - sentences: SentenceSpan[]; 746 - } 747 - ``` 748 - 749 - Update `extractData` to return `paragraphs: ParagraphSpan[]` alongside the existing flat `words` array. The function extracts `#sentence` and `#paragraph` facets, groups words into sentences by byte range overlap, groups sentences into paragraphs, and falls back to singleton wrappers when structural facets are absent. 750 - 751 - Key logic: 752 - 1. Extract words and concepts as before (existing code unchanged). 753 - 2. Extract sentence facets (byteStart/byteEnd from `#sentence` features). Sort by byteStart. 754 - 3. Extract paragraph facets (byteStart/byteEnd from `#paragraph` features). Sort by byteStart. 755 - 4. If no sentence facets: wrap all words in one sentence. If no paragraph facets: wrap all sentences in one paragraph. 756 - 5. Assign each word to its sentence (word.byteStart >= sentence.byteStart && word.byteEnd <= sentence.byteEnd). 757 - 6. Assign each sentence to its paragraph (sentence.byteStart >= paragraph.byteStart && sentence.byteEnd <= paragraph.byteEnd). 758 - 759 - - [ ] **Step 4: Run tests to verify they pass** 760 - 761 - ```bash 762 - cd apps/ionosphere && npx vitest run src/lib/transcript.test.ts 763 - ``` 764 - Expected: all PASS (both new and existing tests) 765 - 766 - - [ ] **Step 5: Commit** 767 - 768 - ```bash 769 - git add apps/ionosphere/src/lib/transcript.ts apps/ionosphere/src/lib/transcript.test.ts 770 - git commit -m "feat: hierarchical extractData with paragraph/sentence grouping" 771 - ``` 772 - 773 - --- 774 - 775 - ## Chunk 4: Update the renderer 776 - 777 - ### Task 8: Update `TranscriptView` to render paragraphs and sentences 778 - 779 - **Files:** 780 - - Modify: `apps/ionosphere/src/app/components/TranscriptView.tsx` 781 - 782 - - [ ] **Step 1: Update the render tree** 783 - 784 - Replace the flat `words.map(...)` rendering with a nested structure: 785 - 786 - ```tsx 787 - {paragraphs.map((para, pi) => ( 788 - <div key={pi} className="mb-4"> 789 - {para.sentences.map((sent, si) => ( 790 - <span key={si} className="sentence"> 791 - {sent.words.map((word, wi) => { 792 - const globalIdx = /* compute global word index */; 793 - return ( 794 - <WordSpanComponent 795 - key={globalIdx} 796 - ref={(el) => setWordRef(globalIdx, el)} 797 - word={word} 798 - concept={wordConcepts[globalIdx]?.[0] || null} 799 - currentTimeNs={currentTimeNs} 800 - onSeek={handleSeek} 801 - hasComment={wordHasComment.has(globalIdx)} 802 - /> 803 - ); 804 - })} 805 - </span> 806 - ))} 807 - </div> 808 - ))} 809 - ``` 810 - 811 - The `useMemo` call to `extractData` now destructures `paragraphs` alongside `words` and `wordConcepts`. The global word index is computed by maintaining a running counter across paragraphs and sentences. 812 - 813 - The comment system, reaction groups, text selection, and scroll/time mappings continue to use the flat `words` array (unchanged). Only the DOM structure changes to add the paragraph/sentence grouping. 814 - 815 - - [ ] **Step 2: Verify in browser** 816 - 817 - Start the dev server and load a talk page. Verify: 818 - - Transcripts without structural facets render identically to before (graceful degradation) 819 - - No console errors 820 - - Scroll-to-time and click-to-seek still work 821 - - Comments and reactions still work 822 - 823 - - [ ] **Step 3: Commit** 824 - 825 - ```bash 826 - git add apps/ionosphere/src/app/components/TranscriptView.tsx 827 - git commit -m "feat: render transcripts with paragraph/sentence DOM structure" 828 - ``` 829 - 830 - ### Task 9: Update `WindowedTranscriptView` for paragraph gaps 831 - 832 - **Files:** 833 - - Modify: `apps/ionosphere/src/app/components/WindowedTranscriptView.tsx` 834 - 835 - - [ ] **Step 1: Update `computeMonospaceLayout` to accept paragraph breaks** 836 - 837 - Add a `paragraphStartIndices: Set<number>` parameter. When a word is a paragraph start (its global index is in the set), insert a gap of `LINE_HEIGHT` pixels before that line entry. Add `isParagraphStart: boolean` to `LineEntry`. 838 - 839 - - [ ] **Step 2: Update the rendering to add paragraph gap spacers** 840 - 841 - For each visible line with `isParagraphStart: true`, render a gap spacer `div` above it. 842 - 843 - - [ ] **Step 3: Update `timeToScrollY` and `scrollYToTime`** 844 - 845 - Gap entries have no time range. Scrolling through a gap seeks to the end of the preceding line's time range (treating the gap as an extension of the previous paragraph's final time). 846 - 847 - - [ ] **Step 4: Verify in browser** 848 - 849 - Load the track view (which uses `WindowedTranscriptView`). Verify paragraph gaps appear and scroll behavior is smooth. 850 - 851 - - [ ] **Step 5: Commit** 852 - 853 - ```bash 854 - git add apps/ionosphere/src/app/components/WindowedTranscriptView.tsx 855 - git commit -m "feat: WindowedTranscriptView paragraph gap support" 856 - ``` 857 - 858 - --- 859 - 860 - ## Chunk 5: Document assembly and publish pipeline 861 - 862 - ### Task 10: Update document assembly to include structural facets 863 - 864 - **Files:** 865 - - Modify: `formats/tv.ionosphere/ts/transcript-encoding.ts` 866 - - Modify: `formats/tv.ionosphere/ts/transcript-encoding.test.ts` 867 - 868 - - [ ] **Step 1: Write the failing test** 869 - 870 - Add to `formats/tv.ionosphere/ts/transcript-encoding.test.ts`: 871 - 872 - ```typescript 873 - describe("decodeToDocumentWithStructure", () => { 874 - it("adds sentence and paragraph facets from NLP annotations", () => { 875 - const compact = encode(contiguous); 876 - const annotations = { 877 - sentences: [ 878 - { byteStart: 0, byteEnd: 11 }, // "hello world" 879 - { byteStart: 12, byteEnd: 26 }, // "this is a test" 880 - ], 881 - paragraphs: [ 882 - { byteStart: 0, byteEnd: 26 }, 883 - ], 884 - }; 885 - const doc = decodeToDocumentWithStructure(compact, annotations); 886 - 887 - const sentenceFacets = doc.facets.filter(f => 888 - f.features.some(feat => feat.$type === "tv.ionosphere.facet#sentence") 889 - ); 890 - const paragraphFacets = doc.facets.filter(f => 891 - f.features.some(feat => feat.$type === "tv.ionosphere.facet#paragraph") 892 - ); 893 - expect(sentenceFacets).toHaveLength(2); 894 - expect(paragraphFacets).toHaveLength(1); 895 - }); 896 - 897 - it("produces valid document without annotations (backward compatible)", () => { 898 - const compact = encode(contiguous); 899 - const doc = decodeToDocumentWithStructure(compact, null); 900 - // Same as decodeToDocument 901 - expect(doc.facets.length).toBe(6); // just timestamp facets 902 - }); 903 - }); 904 - ``` 905 - 906 - - [ ] **Step 2: Run test to verify it fails** 907 - 908 - ```bash 909 - cd formats/tv.ionosphere && npx vitest run ts/transcript-encoding.test.ts 910 - ``` 911 - 912 - - [ ] **Step 3: Implement `decodeToDocumentWithStructure`** 913 - 914 - Add to `formats/tv.ionosphere/ts/transcript-encoding.ts`: 915 - 916 - ```typescript 917 - export interface NlpAnnotations { 918 - sentences: Array<{ byteStart: number; byteEnd: number }>; 919 - paragraphs: Array<{ byteStart: number; byteEnd: number }>; 920 - } 921 - 922 - export function decodeToDocumentWithStructure( 923 - compact: CompactTranscript, 924 - annotations: NlpAnnotations | null, 925 - ): Document { 926 - // Start with the base document (timestamp facets) 927 - const doc = decodeToDocument(compact); 928 - 929 - if (!annotations) return doc; 930 - 931 - // Add sentence facets 932 - for (const s of annotations.sentences) { 933 - doc.facets.push({ 934 - index: { byteStart: s.byteStart, byteEnd: s.byteEnd }, 935 - features: [{ $type: "tv.ionosphere.facet#sentence" }], 936 - }); 937 - } 938 - 939 - // Add paragraph facets 940 - for (const p of annotations.paragraphs) { 941 - doc.facets.push({ 942 - index: { byteStart: p.byteStart, byteEnd: p.byteEnd }, 943 - features: [{ $type: "tv.ionosphere.facet#paragraph" }], 944 - }); 945 - } 946 - 947 - return doc; 948 - } 949 - ``` 950 - 951 - - [ ] **Step 4: Run tests to verify they pass** 952 - 953 - ```bash 954 - cd formats/tv.ionosphere && npx vitest run ts/transcript-encoding.test.ts 955 - ``` 956 - 957 - - [ ] **Step 5: Commit** 958 - 959 - ```bash 960 - git add formats/tv.ionosphere/ts/transcript-encoding.ts formats/tv.ionosphere/ts/transcript-encoding.test.ts 961 - git commit -m "feat: decodeToDocumentWithStructure for NLP annotations" 962 - ``` 963 - 964 - ### Task 11: Update publish.ts to include assembled documents on talk records 965 - 966 - **Files:** 967 - - Modify: `apps/ionosphere-appview/src/publish.ts` 968 - 969 - - [ ] **Step 1: Update the talk publishing step** 970 - 971 - After publishing transcripts (step 4 in publish.ts), add a step that: 972 - 1. For each talk, checks if NLP output exists at `pipeline/data/nlp/{rkey}.json` 973 - 2. If it does, reads the NLP annotations 974 - 3. Calls `decodeToDocumentWithStructure` with the compact transcript + annotations 975 - 4. Includes the assembled `document` field on the `tv.ionosphere.talk` record 976 - 977 - This moves document assembly from serve time to publish time, as specified in the design. 978 - 979 - - [ ] **Step 2: Verify by running publish in dry-run or against local PDS** 980 - 981 - Check that talk records now include the `document` field with sentence/paragraph facets. 982 - 983 - - [ ] **Step 3: Commit** 984 - 985 - ```bash 986 - git add apps/ionosphere-appview/src/publish.ts 987 - git commit -m "feat: publish assembled documents with structural facets on talk records" 988 - ``` 989 - 990 - ### Task 12: Update appview routes to serve pre-assembled documents 991 - 992 - **Files:** 993 - - Modify: `apps/ionosphere-appview/src/routes.ts` 994 - 995 - - [ ] **Step 1: Remove `overlayAnnotations` and serve pre-assembled document** 996 - 997 - In the `getTalk` route handler: 998 - 1. Remove the `overlayAnnotations` function entirely (lines 17-59). 999 - 2. Remove the annotation overlay logic in the route (lines 173-185). 1000 - 3. If the talk record has a `document` field in the DB, serve it directly. 1001 - 4. Fall back to `decodeToDocument` from the compact transcript if no pre-assembled document exists (backward compatibility during transition). 1002 - 1003 - - [ ] **Step 2: Update the indexer to store the document field** 1004 - 1005 - In `apps/ionosphere-appview/src/indexer.ts`, update the `indexTalk` function's INSERT statement (line 176-197). The `talks` table already has a `document TEXT` column (line 54 of db.ts), but the INSERT does not include it. Add `document` to the column list and bind `record.document ? JSON.stringify(record.document) : null` as the value. This is a SQL change — the column list and VALUES placeholders must both be updated. 1006 - 1007 - - [ ] **Step 3: Commit** 1008 - 1009 - ```bash 1010 - git add apps/ionosphere-appview/src/routes.ts apps/ionosphere-appview/src/indexer.ts 1011 - git commit -m "feat: serve pre-assembled documents, remove overlayAnnotations" 1012 - ``` 1013 - 1014 - --- 1015 - 1016 - ## Chunk 6: Remove old enrichment system 1017 - 1018 - ### Task 13: Remove old annotation/enrichment code 1019 - 1020 - **Files:** 1021 - - Delete: `apps/ionosphere-appview/src/enrich.ts` 1022 - - Delete: `apps/ionosphere-appview/src/enrich-all.ts` 1023 - - Delete: `apps/ionosphere-appview/src/publish-annotations.ts` 1024 - - Modify: `apps/ionosphere-appview/src/indexer.ts` — remove `tv.ionosphere.annotation` handling 1025 - - Modify: `apps/ionosphere-appview/src/routes.ts` — remove annotation-related queries from `getTalk` 1026 - 1027 - - [ ] **Step 1: Delete enrichment files** 1028 - 1029 - ```bash 1030 - rm apps/ionosphere-appview/src/enrich.ts 1031 - rm apps/ionosphere-appview/src/enrich-all.ts 1032 - rm apps/ionosphere-appview/src/publish-annotations.ts 1033 - ``` 1034 - 1035 - - [ ] **Step 2: Remove annotation indexing from `indexer.ts`** 1036 - 1037 - Remove `"tv.ionosphere.annotation"` from `IONOSPHERE_COLLECTIONS` array (line 28). Remove the annotation delete case (lines 72-75). Remove the annotation create/update case (lines 116-117). Remove the `indexAnnotation` function and `rebuildTalkConcepts` helper. 1038 - 1039 - - [ ] **Step 3: Remove annotation queries from `routes.ts`** 1040 - 1041 - In the `getTalk` route, remove the concepts query (lines 149-157) and the annotation overlay logic. The concepts data will return via layers.pub in Phase 2. 1042 - 1043 - - [ ] **Step 4: Remove annotation publishing from `publish.ts`** 1044 - 1045 - Remove step 6 (lines 158-177) that publishes `tv.ionosphere.annotation` records. 1046 - 1047 - - [ ] **Step 5: Verify the appview still starts and serves talks** 1048 - 1049 - ```bash 1050 - cd apps/ionosphere-appview && npx tsx src/appview.ts 1051 - ``` 1052 - Hit the `/xrpc/tv.ionosphere.getTalk?rkey=<some-rkey>` endpoint and verify it returns a talk with a document. 1053 - 1054 - - [ ] **Step 6: Commit** 1055 - 1056 - ```bash 1057 - git add -A apps/ionosphere-appview/src/ 1058 - git commit -m "chore: remove old enrichment system (enrich.ts, annotations, overlayAnnotations)" 1059 - ``` 1060 - 1061 - --- 1062 - 1063 - ## Chunk 7: End-to-end integration and verification 1064 - 1065 - **IMPORTANT:** Tasks 11-12 create the publish-time document assembly path, but existing talks in the appview DB will have NULL documents until a full re-publish is done. Task 14 performs this re-publish. Do NOT deploy Tasks 11-12 without running Task 14, or existing talks will lose concept overlays with no replacement. 1066 - 1067 - ### Task 14: Run the full pipeline end-to-end 1068 - 1069 - - [ ] **Step 1: Run the Python NLP pipeline on all transcripts** 1070 - 1071 - ```bash 1072 - cd pipeline && source .venv/bin/activate && python -m nlp.run 1073 - ``` 1074 - 1075 - Verify output files appear in `pipeline/data/nlp/` with sentence and paragraph data. 1076 - 1077 - - [ ] **Step 2: Spot-check 3-5 NLP output files** 1078 - 1079 - Open output JSON files for talks of different types (presentation, panel, lightning talk). Verify: 1080 - - Sentence count is reasonable (expect 50-300 for a 20-min talk) 1081 - - Paragraph count is reasonable (expect 5-30) 1082 - - Byte ranges are valid (byteStart < byteEnd, monotonically increasing) 1083 - - Paragraph boundaries fall at sentence boundaries 1084 - 1085 - - [ ] **Step 3: Run the TypeScript publish pipeline** 1086 - 1087 - ```bash 1088 - cd apps/ionosphere-appview && npx tsx src/publish.ts 1089 - ``` 1090 - 1091 - Verify talk records now include the `document` field with structural facets. 1092 - 1093 - - [ ] **Step 4: Start the appview and frontend, verify in browser** 1094 - 1095 - Start the dev environment and load several talk pages. Verify: 1096 - - Paragraphs have visible vertical spacing 1097 - - Sentences are grouped as inline spans 1098 - - Scroll-to-time and click-to-seek work correctly 1099 - - The playhead brightness gradient is smooth across paragraph breaks 1100 - - Comments and reactions still work 1101 - - Talks without NLP data still render correctly (graceful degradation) 1102 - 1103 - - [ ] **Step 5: Commit any fixes found during verification** 1104 - 1105 - ```bash 1106 - git add -A && git commit -m "fix: integration fixes from end-to-end verification" 1107 - ``` 1108 - 1109 - ### Task 15: Final cleanup 1110 - 1111 - - [ ] **Step 1: Run all tests** 1112 - 1113 - ```bash 1114 - # Python 1115 - cd pipeline && source .venv/bin/activate && pytest -v 1116 - 1117 - # TypeScript 1118 - cd ../.. && npx vitest run 1119 - ``` 1120 - 1121 - All tests should pass. 1122 - 1123 - - [ ] **Step 2: Update `.gitignore` for Python artifacts** 1124 - 1125 - Add to `.gitignore`: 1126 - ``` 1127 - pipeline/.venv/ 1128 - pipeline/data/ 1129 - __pycache__/ 1130 - *.pyc 1131 - ``` 1132 - 1133 - - [ ] **Step 3: Final commit** 1134 - 1135 - ```bash 1136 - git add -A 1137 - git commit -m "chore: final cleanup — tests passing, gitignore updated" 1138 - ```
-751
docs/superpowers/plans/2026-04-13-layers-pub-publishing.md
··· 1 - # layers.pub Record Publishing via Panproto Lenses — Implementation Plan 2 - 3 - > **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking. 4 - 5 - **Goal:** Publish NLP enrichment data as layers.pub AT Protocol records, with panproto lenses as the authoritative transform pipeline, and index those records back into the appview's materialized document view. 6 - 7 - **Architecture:** Data flows through panproto lenses: compact transcript → expression + segmentation (Lens 1), NLP annotations → annotation layers (Lens 2). The appview indexes layers.pub records and rebuilds the materialized talk document using a reverse lens (Lens 3). No parallel TypeScript transform pipelines — lenses are the single source of truth. 8 - 9 - **Tech Stack:** @panproto/core v0.25.1 (WASM), AT Protocol lexicons, better-sqlite3, Hono, vitest 10 - 11 - **Spec:** `docs/superpowers/specs/2026-04-13-layers-pub-publishing-design.md` 12 - 13 - --- 14 - 15 - ## File Structure 16 - 17 - ### New files 18 - - `formats/tv.ionosphere/nlpAnnotations.lexicon.json` — NLP annotations schema (lens source, not published to PDS) 19 - - `formats/tv.ionosphere/lenses/transcript-to-expression.lens.json` — Lens 1 spec 20 - - `formats/tv.ionosphere/lenses/nlp-to-annotation-layers.lens.json` — Lens 2 spec 21 - - `formats/tv.ionosphere/lenses/layers-to-document.lens.json` — Lens 3 spec 22 - - `formats/tv.ionosphere/ts/layers-pub.ts` — layers.pub record builders (runs data through panproto lenses) 23 - - `apps/ionosphere-appview/src/layers-indexer.ts` — indexer for layers.pub records + document rebuild 24 - - `apps/ionosphere-appview/src/__tests__/layers-pub.test.ts` — round-trip test: publish → index → verify 25 - 26 - ### Modified files 27 - - `apps/ionosphere-appview/src/publish.ts` — add Stage 6 (layers.pub publishing) 28 - - `apps/ionosphere-appview/src/db.ts` — add 3 new tables (layers_expressions, layers_segmentations, layers_annotations) 29 - - `apps/ionosphere-appview/src/indexer.ts` — add layers.pub collections to IONOSPHERE_COLLECTIONS + wire to layers-indexer 30 - 31 - --- 32 - 33 - ## Chunk 1: NLP Annotations Lexicon + Lens 1 (Transcript → Expression + Segmentation) 34 - 35 - ### Task 1: Define the NLP annotations lexicon 36 - 37 - This lexicon formalizes the JSON shape produced by the NLP pipeline so panproto can use it as a lens source schema. It is never published to PDS. 38 - 39 - **Files:** 40 - - Create: `formats/tv.ionosphere/nlpAnnotations.lexicon.json` 41 - 42 - - [ ] **Step 1: Write the lexicon** 43 - 44 - The NLP JSON has this shape (from `pipeline/data/nlp/*.json`): 45 - ```json 46 - { 47 - "talkRkey": "string", 48 - "sentences": [{ "byteStart": 0, "byteEnd": 214 }], 49 - "paragraphs": [{ "byteStart": 0, "byteEnd": 1729 }], 50 - "entities": [{ "byteStart": 15, "byteEnd": 19, "label": "Matt", "nerType": "PERSON", "conceptUri": "at://..." }], 51 - "topicBreaks": [{ "byteStart": 1596 }], 52 - "metadata": { "tool": "spacy/en_core_web_sm", "pauseThresholdMs": 2000, "proximityWords": 5 } 53 - } 54 - ``` 55 - 56 - Write `formats/tv.ionosphere/nlpAnnotations.lexicon.json` as an ATProto lexicon that models this exactly. Key types: 57 - - `nlpSentence`: `{ byteStart: integer, byteEnd: integer }` 58 - - `nlpParagraph`: `{ byteStart: integer, byteEnd: integer }` 59 - - `nlpEntity`: `{ byteStart: integer, byteEnd: integer, label: string, nerType: string, conceptUri?: string (format: at-uri) }` 60 - - `nlpTopicBreak`: `{ byteStart: integer }` 61 - - `nlpMetadata`: `{ tool: string, pauseThresholdMs?: integer, proximityWords?: integer }` 62 - - Main record: `{ talkRkey: string, sentences: nlpSentence[], paragraphs: nlpParagraph[], entities: nlpEntity[], topicBreaks: nlpTopicBreak[], metadata: nlpMetadata }` 63 - 64 - - [ ] **Step 2: Validate the lexicon parses with panproto** 65 - 66 - ```bash 67 - cd apps/ionosphere-appview 68 - npx tsx -e " 69 - import { loadSchema } from '../../formats/tv.ionosphere/ts/panproto.js'; 70 - import nlp from '../../formats/tv.ionosphere/nlpAnnotations.lexicon.json' assert { type: 'json' }; 71 - const schema = await loadSchema(nlp); 72 - console.log('NLP annotations schema loaded:', !!schema); 73 - " 74 - ``` 75 - 76 - Expected: `NLP annotations schema loaded: true` 77 - 78 - - [ ] **Step 3: Commit** 79 - 80 - ```bash 81 - git add formats/tv.ionosphere/nlpAnnotations.lexicon.json 82 - git commit -m "feat: define tv.ionosphere.nlpAnnotations lexicon for lens source schema" 83 - ``` 84 - 85 - ### Task 2: Define Lens 1 — compact transcript → expression + segmentation 86 - 87 - This lens transforms `tv.ionosphere.transcript` records into `pub.layers.expression.expression` + `pub.layers.segmentation.segmentation` records. It is the authoritative transform for the text and temporal mapping. 88 - 89 - **Files:** 90 - - Create: `formats/tv.ionosphere/lenses/transcript-to-expression.lens.json` 91 - - Create: `formats/tv.ionosphere/ts/layers-pub.ts` 92 - - Create: `apps/ionosphere-appview/src/__tests__/layers-pub.test.ts` 93 - 94 - - [ ] **Step 1: Write the failing test** 95 - 96 - Create `apps/ionosphere-appview/src/__tests__/layers-pub.test.ts`: 97 - 98 - ```typescript 99 - import { describe, it, expect } from 'vitest'; 100 - import { transcriptToLayersPub } from '../../../formats/tv.ionosphere/ts/layers-pub.js'; 101 - 102 - describe('Lens 1: transcript → expression + segmentation', () => { 103 - const transcript = { 104 - $type: 'tv.ionosphere.transcript', 105 - talkUri: 'at://did:plc:test/tv.ionosphere.talk/test-talk', 106 - text: 'Hello world foo bar', 107 - startMs: 1000, 108 - // word durations: Hello=200ms, world=300ms, 100ms gap, foo=150ms, bar=250ms 109 - timings: [200, 300, -100, 150, 250], 110 - }; 111 - 112 - const did = 'did:plc:test'; 113 - const talkRkey = 'test-talk'; 114 - 115 - it('produces an expression record with correct fields', async () => { 116 - const { expression } = await transcriptToLayersPub(transcript, did, talkRkey); 117 - expect(expression.$type).toBe('pub.layers.expression.expression'); 118 - expect(expression.id).toBe('test-talk'); 119 - expect(expression.kind).toBe('transcript'); 120 - expect(expression.text).toBe('Hello world foo bar'); 121 - expect(expression.language).toBe('en'); 122 - expect(expression.sourceRef).toBe('at://did:plc:test/tv.ionosphere.transcript/test-talk-transcript'); 123 - expect(expression.metadata.tool).toBe('ionosphere-pipeline'); 124 - expect(expression.metadata.timestamp).toBeDefined(); 125 - expect(expression.createdAt).toBeDefined(); 126 - }); 127 - 128 - it('produces a segmentation record with word tokens', async () => { 129 - const { segmentation } = await transcriptToLayersPub(transcript, did, talkRkey); 130 - expect(segmentation.$type).toBe('pub.layers.segmentation.segmentation'); 131 - expect(segmentation.expression).toBe( 132 - 'at://did:plc:test/pub.layers.expression.expression/test-talk-expression' 133 - ); 134 - expect(segmentation.tokenizations).toHaveLength(1); 135 - 136 - const tok = segmentation.tokenizations[0]; 137 - expect(tok.kind).toBe('word'); 138 - expect(tok.tokens).toHaveLength(4); 139 - 140 - // Check first token 141 - expect(tok.tokens[0].tokenIndex).toBe(0); 142 - expect(tok.tokens[0].text).toBe('Hello'); 143 - expect(tok.tokens[0].textSpan.byteStart).toBe(0); 144 - expect(tok.tokens[0].textSpan.byteEnd).toBe(5); 145 - expect(tok.tokens[0].temporalSpan.start).toBe(1000); 146 - expect(tok.tokens[0].temporalSpan.ending).toBe(1200); 147 - 148 - // Check third token (after gap) 149 - expect(tok.tokens[2].text).toBe('foo'); 150 - expect(tok.tokens[2].temporalSpan.start).toBe(1600); // 1000+200+300+100gap 151 - expect(tok.tokens[2].temporalSpan.ending).toBe(1750); 152 - }); 153 - }); 154 - ``` 155 - 156 - - [ ] **Step 2: Run the test to verify it fails** 157 - 158 - ```bash 159 - cd apps/ionosphere-appview 160 - npx vitest run src/__tests__/layers-pub.test.ts 161 - ``` 162 - 163 - Expected: FAIL — `transcriptToLayersPub` does not exist yet. 164 - 165 - - [ ] **Step 3: Write the layers-pub module with Lens 1** 166 - 167 - Create `formats/tv.ionosphere/ts/layers-pub.ts`. This module initializes panproto, loads the transcript and layers.pub schemas, builds the lens, and exposes `transcriptToLayersPub()`. 168 - 169 - The function must: 170 - 1. Initialize panproto WASM (via existing `init()` from panproto.ts) 171 - 2. Load `tv.ionosphere.transcript` schema and `pub.layers.expression.expression` + `pub.layers.segmentation.segmentation` schemas 172 - 3. Use `autoGenerateWithHints()` to create a protolens chain with hints mapping transcript fields to layers.pub fields 173 - 4. Run the transcript record through the lens 174 - 5. Post-process: inject `$type`, `sourceRef`, `expression` URI (pre-computed from DID + rkey), `createdAt` 175 - 6. Return `{ expression, segmentation }` 176 - 177 - The timings replay algorithm is the same as `decodeToDocument()` in `transcript-encoding.ts`: 178 - - Split text by whitespace to get words 179 - - Use TextEncoder for UTF-8 byte offsets 180 - - Iterate timings: negative = silence gap (advance cursor), positive = word duration 181 - - Each word becomes a token with `textSpan: { byteStart, byteEnd }` and `temporalSpan: { start, ending }` in ms 182 - 183 - If panproto's auto-generated lens cannot handle the timings array → token list transform natively (likely — this is algorithmic, not structural), implement the timings replay in TypeScript and feed the pre-built token array into the lens as a morphism hint's computed field. The lens still owns the structural mapping; the timings replay is a computed input. 184 - 185 - - [ ] **Step 4: Run the test to verify it passes** 186 - 187 - ```bash 188 - cd apps/ionosphere-appview 189 - npx vitest run src/__tests__/layers-pub.test.ts 190 - ``` 191 - 192 - Expected: PASS 193 - 194 - - [ ] **Step 5: Commit** 195 - 196 - ```bash 197 - git add formats/tv.ionosphere/ts/layers-pub.ts formats/tv.ionosphere/lenses/transcript-to-expression.lens.json apps/ionosphere-appview/src/__tests__/layers-pub.test.ts 198 - git commit -m "feat: Lens 1 — compact transcript to layers.pub expression + segmentation" 199 - ``` 200 - 201 - --- 202 - 203 - ## Chunk 2: Lens 2 (NLP Annotations → Annotation Layers) 204 - 205 - ### Task 3: Define Lens 2 — NLP annotations → 4 annotation layers 206 - 207 - This lens transforms the NLP pipeline JSON into 4 `pub.layers.annotation.annotationLayer` records (sentences, paragraphs, entities, topics). 208 - 209 - **Files:** 210 - - Create: `formats/tv.ionosphere/lenses/nlp-to-annotation-layers.lens.json` 211 - - Modify: `formats/tv.ionosphere/ts/layers-pub.ts` — add `nlpToAnnotationLayers()` 212 - - Modify: `apps/ionosphere-appview/src/__tests__/layers-pub.test.ts` — add Lens 2 tests 213 - 214 - - [ ] **Step 1: Write the failing tests** 215 - 216 - Add to `apps/ionosphere-appview/src/__tests__/layers-pub.test.ts`: 217 - 218 - ```typescript 219 - import { nlpToAnnotationLayers } from '../../../formats/tv.ionosphere/ts/layers-pub.js'; 220 - 221 - describe('Lens 2: NLP annotations → annotation layers', () => { 222 - const nlpAnnotations = { 223 - talkRkey: 'test-talk', 224 - sentences: [ 225 - { byteStart: 0, byteEnd: 11 }, 226 - { byteStart: 12, byteEnd: 19 }, 227 - ], 228 - paragraphs: [ 229 - { byteStart: 0, byteEnd: 19 }, 230 - ], 231 - entities: [ 232 - { byteStart: 0, byteEnd: 5, label: 'Hello', nerType: 'MISC' }, 233 - { byteStart: 12, byteEnd: 15, label: 'foo', nerType: 'ORG', conceptUri: 'at://did:plc:test/tv.ionosphere.concept/foo' }, 234 - ], 235 - topicBreaks: [ 236 - { byteStart: 12 }, 237 - ], 238 - metadata: { tool: 'spacy/en_core_web_sm' }, 239 - }; 240 - 241 - const did = 'did:plc:test'; 242 - const talkRkey = 'test-talk'; 243 - const expressionUri = 'at://did:plc:test/pub.layers.expression.expression/test-talk-expression'; 244 - 245 - it('produces 4 annotation layer records', async () => { 246 - const layers = await nlpToAnnotationLayers(nlpAnnotations, did, talkRkey, expressionUri); 247 - expect(Object.keys(layers)).toEqual(['sentences', 'paragraphs', 'entities', 'topics']); 248 - }); 249 - 250 - it('sentences layer has correct structure', async () => { 251 - const { sentences } = await nlpToAnnotationLayers(nlpAnnotations, did, talkRkey, expressionUri); 252 - expect(sentences.$type).toBe('pub.layers.annotation.annotationLayer'); 253 - expect(sentences.expression).toBe(expressionUri); 254 - expect(sentences.kind).toBe('span'); 255 - expect(sentences.subkind).toBe('sentence-boundary'); 256 - expect(sentences.sourceMethod).toBe('automatic'); 257 - expect(sentences.metadata.tool).toBe('ionosphere-nlp-pipeline'); 258 - expect(sentences.annotations).toHaveLength(2); 259 - expect(sentences.annotations[0].anchor.textSpan).toEqual({ byteStart: 0, byteEnd: 11 }); 260 - }); 261 - 262 - it('entities layer wraps features in featureMap', async () => { 263 - const { entities } = await nlpToAnnotationLayers(nlpAnnotations, did, talkRkey, expressionUri); 264 - expect(entities.annotations).toHaveLength(2); 265 - 266 - // Plain entity — nerType is always present 267 - const plain = entities.annotations[0]; 268 - expect(plain.label).toBe('Hello'); 269 - expect(plain.features.entries).toContainEqual({ key: 'nerType', value: 'MISC' }); 270 - 271 - // Entity with conceptUri — all known keys forwarded to features 272 - const withConcept = entities.annotations[1]; 273 - expect(withConcept.features.entries).toContainEqual({ 274 - key: 'conceptUri', 275 - value: 'at://did:plc:test/tv.ionosphere.concept/foo', 276 - }); 277 - expect(withConcept.features.entries).toContainEqual({ key: 'nerType', value: 'ORG' }); 278 - }); 279 - 280 - it('topics layer has correct subkind and uses zero-width spans', async () => { 281 - const { topics } = await nlpToAnnotationLayers(nlpAnnotations, did, talkRkey, expressionUri); 282 - expect(topics.subkind).toBe('topic-segment'); 283 - expect(topics.annotations).toHaveLength(1); 284 - expect(topics.annotations[0].anchor.textSpan).toEqual({ byteStart: 12, byteEnd: 12 }); 285 - }); 286 - }); 287 - ``` 288 - 289 - - [ ] **Step 2: Run tests to verify they fail** 290 - 291 - ```bash 292 - cd apps/ionosphere-appview 293 - npx vitest run src/__tests__/layers-pub.test.ts 294 - ``` 295 - 296 - Expected: FAIL — `nlpToAnnotationLayers` does not exist. 297 - 298 - - [ ] **Step 3: Implement nlpToAnnotationLayers in layers-pub.ts** 299 - 300 - Add `nlpToAnnotationLayers()` to `formats/tv.ionosphere/ts/layers-pub.ts`. 301 - 302 - The function: 303 - 1. Loads the NLP annotations schema and annotation layer schema via panproto 304 - 2. Builds protolens with hints for each annotation type 305 - 3. For each of the 4 annotation types, maps the NLP data to a `pub.layers.annotation.annotationLayer` record: 306 - 307 - **Sentences layer** (`{talkRkey}-sentences`): 308 - - Each sentence `{ byteStart, byteEnd }` → annotation with `anchor: { textSpan: { byteStart, byteEnd } }`, `label`: truncated text (or "sentence") 309 - 310 - **Paragraphs layer** (`{talkRkey}-paragraphs`): 311 - - Each paragraph → annotation with `anchor: { textSpan }`, `label: "paragraph"` 312 - 313 - **Entities layer** (`{talkRkey}-entities`): 314 - - Each entity → annotation with `anchor: { textSpan }`, `label: entity.label` 315 - - `features: { entries: [] }` — forward all entity keys beyond byteStart/byteEnd/label into entries: `nerType` always, `conceptUri` if present, and any future keys (e.g., `speakerDid`) passthrough automatically 316 - 317 - **Topics layer** (`{talkRkey}-topics`): 318 - - Each topicBreak → annotation with `anchor: { textSpan: { byteStart, byteEnd: byteStart } }` (zero-width), `label: "topic-break"` 319 - 320 - All layers get: `$type`, `expression` URI, `kind: "span"`, `sourceMethod: "automatic"`, `metadata: { tool, timestamp }`, `createdAt`. 321 - 322 - Similar to Lens 1: if the structural fan-out (one source → four targets) is beyond what panproto's protolens can express natively, implement the fan-out in TypeScript and use the lens for each individual layer's structural mapping. The lens remains authoritative for the shape of each annotation layer record. 323 - 324 - - [ ] **Step 4: Run tests to verify they pass** 325 - 326 - ```bash 327 - cd apps/ionosphere-appview 328 - npx vitest run src/__tests__/layers-pub.test.ts 329 - ``` 330 - 331 - Expected: All PASS 332 - 333 - - [ ] **Step 5: Commit** 334 - 335 - ```bash 336 - git add formats/tv.ionosphere/ts/layers-pub.ts formats/tv.ionosphere/lenses/nlp-to-annotation-layers.lens.json apps/ionosphere-appview/src/__tests__/layers-pub.test.ts 337 - git commit -m "feat: Lens 2 — NLP annotations to 4 annotation layer records" 338 - ``` 339 - 340 - --- 341 - 342 - ## Chunk 3: Publish Pipeline Stage 6 343 - 344 - ### Task 4: Add layers.pub publishing to publish.ts 345 - 346 - Wire the lens functions into the existing publish pipeline as a new Stage 6. 347 - 348 - **Files:** 349 - - Modify: `apps/ionosphere-appview/src/publish.ts` — add Stage 6 350 - - Modify: `apps/ionosphere-appview/src/__tests__/layers-pub.test.ts` — add integration test 351 - 352 - - [ ] **Step 1: Write the failing integration test** 353 - 354 - Add to `apps/ionosphere-appview/src/__tests__/layers-pub.test.ts`: 355 - 356 - ```typescript 357 - describe('Stage 6: layers.pub publish pipeline', () => { 358 - it('produces 6 records for a talk with transcript + NLP data', async () => { 359 - // Use real fixture data from pipeline/data/ 360 - // Read a compact transcript and NLP annotations for the same talk 361 - // Run both lenses, verify 6 records produced with correct rkeys and $type values 362 - }); 363 - }); 364 - ``` 365 - 366 - The test should use fixture data from `pipeline/data/nlp/ats26-keynote.json` and a corresponding transcript to verify end-to-end record production. It does NOT publish to a PDS — it verifies the lens output shape. 367 - 368 - - [ ] **Step 2: Run test to verify it fails** 369 - 370 - ```bash 371 - cd apps/ionosphere-appview 372 - npx vitest run src/__tests__/layers-pub.test.ts 373 - ``` 374 - 375 - - [ ] **Step 3: Add Stage 6 to publish.ts** 376 - 377 - Add after the existing transcript publishing stage in `apps/ionosphere-appview/src/publish.ts`: 378 - 379 - ```typescript 380 - // ── Stage 6: layers.pub records ──────────────────────────────────────────── 381 - console.log("\n=== Stage 6: layers.pub records ==="); 382 - ``` 383 - 384 - For each talk that has both a transcript and NLP annotations: 385 - - Transcripts: `apps/data/transcripts/{rkey}.json` (same path as existing Stage 5, resolved via `../../data/transcripts` from `src/`) 386 - - NLP: `pipeline/data/nlp/{rkey}.json` (same path as existing Stage 4) 387 - 388 - 1. Load transcript JSON → `encode()` → CompactTranscript 389 - 2. Load NLP annotations JSON 390 - 3. Call `transcriptToLayersPub(transcriptRecord, did, rkey)` → expression + segmentation 391 - 4. Call `nlpToAnnotationLayers(nlpData, did, rkey, expressionUri)` → 4 annotation layers 392 - 5. Publish all 6 records via `pds.putRecord()` — parallel within each talk using `Promise.all()` 393 - 6. Log progress: `Published 6 layers.pub records for {rkey}` 394 - 395 - Record collections and rkeys: 396 - - `pub.layers.expression.expression` / `{rkey}-expression` 397 - - `pub.layers.segmentation.segmentation` / `{rkey}-segmentation` 398 - - `pub.layers.annotation.annotationLayer` / `{rkey}-sentences` 399 - - `pub.layers.annotation.annotationLayer` / `{rkey}-paragraphs` 400 - - `pub.layers.annotation.annotationLayer` / `{rkey}-entities` 401 - - `pub.layers.annotation.annotationLayer` / `{rkey}-topics` 402 - 403 - Also publish the 3 new lens files (transcript-to-expression, nlp-to-annotation-layers, layers-to-document) in Stage 1 alongside the existing 4 lenses. 404 - 405 - - [ ] **Step 4: Run test to verify it passes** 406 - 407 - ```bash 408 - cd apps/ionosphere-appview 409 - npx vitest run src/__tests__/layers-pub.test.ts 410 - ``` 411 - 412 - - [ ] **Step 5: Commit** 413 - 414 - ```bash 415 - git add apps/ionosphere-appview/src/publish.ts apps/ionosphere-appview/src/__tests__/layers-pub.test.ts 416 - git commit -m "feat: publish layers.pub records in Stage 6 of publish pipeline" 417 - ``` 418 - 419 - --- 420 - 421 - ## Chunk 4: Lens 3 (Reverse) + Appview Indexer 422 - 423 - ### Task 5: Define Lens 3 — layers.pub → ionosphere document facets 424 - 425 - This is the reverse lens: given layers.pub records, produce the materialized RelationalText document with ionosphere facets. Used by the appview indexer. 426 - 427 - **Files:** 428 - - Create: `formats/tv.ionosphere/lenses/layers-to-document.lens.json` 429 - - Modify: `formats/tv.ionosphere/ts/layers-pub.ts` — add `layersPubToDocument()` 430 - - Modify: `apps/ionosphere-appview/src/__tests__/layers-pub.test.ts` — add round-trip test 431 - 432 - - [ ] **Step 1: Write the failing round-trip test** 433 - 434 - This is the critical correctness test. Feed a real transcript + NLP annotations through Lens 1+2, then feed the output through Lens 3, and compare the result with what `decodeToDocumentWithStructure()` produces for the same input. 435 - 436 - ```typescript 437 - import { decodeToDocumentWithStructure, encode } from '../../../formats/tv.ionosphere/ts/transcript-encoding.js'; 438 - import { transcriptToLayersPub, nlpToAnnotationLayers, layersPubToDocument } from '../../../formats/tv.ionosphere/ts/layers-pub.js'; 439 - 440 - describe('Lens 3: round-trip correctness', () => { 441 - it('layers.pub → document matches decodeToDocumentWithStructure output', async () => { 442 - // Load real fixture data 443 - // Transcripts: apps/data/transcripts/{rkey}.json (resolved from publish.ts: ../../data/transcripts) 444 - // NLP: pipeline/data/nlp/{rkey}.json 445 - const transcriptData = /* read from apps/data/transcripts/ats26-keynote.json */; 446 - const nlpData = /* read from pipeline/data/nlp/ats26-keynote.json */; 447 - 448 - // Path A: existing direct path 449 - const compact = encode(transcriptData); 450 - const directDoc = decodeToDocumentWithStructure(compact, nlpData); 451 - 452 - // Path B: through lenses 453 - const transcriptRecord = { text: compact.text, startMs: compact.startMs, timings: compact.timings, talkUri: 'at://test/tv.ionosphere.talk/test' }; 454 - const { expression, segmentation } = await transcriptToLayersPub(transcriptRecord, 'did:plc:test', 'test'); 455 - const annotationLayers = await nlpToAnnotationLayers(nlpData, 'did:plc:test', 'test', 'at://...'); 456 - const lensDoc = await layersPubToDocument(expression, segmentation, annotationLayers); 457 - 458 - // Compare 459 - expect(lensDoc.text).toBe(directDoc.text); 460 - expect(lensDoc.facets.length).toBe(directDoc.facets.length); 461 - // Facets may be in different order — sort by byteStart then compare 462 - }); 463 - }); 464 - ``` 465 - 466 - - [ ] **Step 2: Run test to verify it fails** 467 - 468 - ```bash 469 - cd apps/ionosphere-appview 470 - npx vitest run src/__tests__/layers-pub.test.ts 471 - ``` 472 - 473 - - [ ] **Step 3: Implement layersPubToDocument** 474 - 475 - Add `layersPubToDocument()` to `formats/tv.ionosphere/ts/layers-pub.ts`. 476 - 477 - Inputs: expression record, segmentation record, annotation layers (object with sentences/paragraphs/entities/topics). 478 - 479 - Output: `{ text: string, facets: DocumentFacet[] }` — same shape as `decodeToDocumentWithStructure()`. 480 - 481 - Transform: 482 - 1. `text` comes from the expression record 483 - 2. Timestamp facets: iterate segmentation tokens, for each token create a facet with `$type: "tv.ionosphere.facet#timestamp"`, `startTime` and `endTime` in nanoseconds (ms × 1_000_000), `byteStart`/`byteEnd` from token's textSpan 484 - 3. Sentence facets: from sentences annotation layer, each annotation → facet with `$type: "tv.ionosphere.facet#sentence"` 485 - 4. Paragraph facets: similar, `$type: "tv.ionosphere.facet#paragraph"` 486 - 5. Entity facets: route by features — if has `conceptUri` → `#concept-ref`, else → `#entity` 487 - 6. Topic facets: `$type: "tv.ionosphere.facet#topic-break"`, zero-width span 488 - 489 - - [ ] **Step 4: Run test to verify it passes** 490 - 491 - ```bash 492 - cd apps/ionosphere-appview 493 - npx vitest run src/__tests__/layers-pub.test.ts 494 - ``` 495 - 496 - - [ ] **Step 5: Commit** 497 - 498 - ```bash 499 - git add formats/tv.ionosphere/ts/layers-pub.ts formats/tv.ionosphere/lenses/layers-to-document.lens.json apps/ionosphere-appview/src/__tests__/layers-pub.test.ts 500 - git commit -m "feat: Lens 3 — layers.pub records to ionosphere document facets (round-trip verified)" 501 - ``` 502 - 503 - ### Task 6: Add layers.pub DB tables 504 - 505 - **Files:** 506 - - Modify: `apps/ionosphere-appview/src/db.ts` 507 - 508 - - [ ] **Step 1: Add 3 new tables to migrate()** 509 - 510 - Add after the existing `_cursor` table creation in `apps/ionosphere-appview/src/db.ts`: 511 - 512 - ```sql 513 - CREATE TABLE IF NOT EXISTS layers_expressions ( 514 - uri TEXT PRIMARY KEY, 515 - rkey TEXT NOT NULL, 516 - did TEXT NOT NULL, 517 - transcript_uri TEXT NOT NULL, 518 - text TEXT NOT NULL, 519 - language TEXT NOT NULL DEFAULT 'en', 520 - created_at TEXT DEFAULT CURRENT_TIMESTAMP 521 - ); 522 - 523 - -- uri IS the expression URI for this table; other tables reference it via expression_uri 524 - CREATE INDEX IF NOT EXISTS idx_layers_expr_transcript ON layers_expressions(transcript_uri); 525 - 526 - CREATE TABLE IF NOT EXISTS layers_segmentations ( 527 - uri TEXT PRIMARY KEY, 528 - rkey TEXT NOT NULL, 529 - did TEXT NOT NULL, 530 - expression_uri TEXT NOT NULL, 531 - tokens_json TEXT NOT NULL, 532 - created_at TEXT DEFAULT CURRENT_TIMESTAMP 533 - ); 534 - 535 - CREATE INDEX IF NOT EXISTS idx_layers_seg_expression ON layers_segmentations(expression_uri); 536 - 537 - CREATE TABLE IF NOT EXISTS layers_annotations ( 538 - uri TEXT PRIMARY KEY, 539 - rkey TEXT NOT NULL, 540 - did TEXT NOT NULL, 541 - expression_uri TEXT NOT NULL, 542 - kind TEXT NOT NULL, 543 - subkind TEXT NOT NULL, 544 - annotations_json TEXT NOT NULL, 545 - created_at TEXT DEFAULT CURRENT_TIMESTAMP 546 - ); 547 - 548 - CREATE INDEX IF NOT EXISTS idx_layers_ann_expression ON layers_annotations(expression_uri); 549 - ``` 550 - 551 - - [ ] **Step 2: Verify DB migration runs clean** 552 - 553 - ```bash 554 - cd apps/ionosphere-appview 555 - npx tsx -e " 556 - import { openDb, migrate } from './src/db.js'; 557 - const db = openDb(); 558 - migrate(db); 559 - const tables = db.prepare(\"SELECT name FROM sqlite_master WHERE type='table' AND name LIKE 'layers_%'\").all(); 560 - console.log('New tables:', tables.map(t => t.name)); 561 - " 562 - ``` 563 - 564 - Expected: `New tables: ['layers_expressions', 'layers_segmentations', 'layers_annotations']` 565 - 566 - - [ ] **Step 3: Commit** 567 - 568 - ```bash 569 - git add apps/ionosphere-appview/src/db.ts 570 - git commit -m "feat: add layers.pub DB tables (expressions, segmentations, annotations)" 571 - ``` 572 - 573 - ### Task 7: Wire layers.pub indexer into appview 574 - 575 - **Files:** 576 - - Create: `apps/ionosphere-appview/src/layers-indexer.ts` 577 - - Modify: `apps/ionosphere-appview/src/indexer.ts` 578 - 579 - - [ ] **Step 1: Create the layers-indexer module** 580 - 581 - Create `apps/ionosphere-appview/src/layers-indexer.ts` with these functions: 582 - 583 - ```typescript 584 - export function indexExpression(db, did, rkey, uri, record): void 585 - export function indexSegmentation(db, did, rkey, uri, record): void 586 - export function indexAnnotationLayer(db, did, rkey, uri, record): void 587 - export function deleteExpression(db, uri): void 588 - export function deleteSegmentation(db, uri): void 589 - export function deleteAnnotationLayer(db, uri): void 590 - export async function rebuildDocument(db, expressionUri): Promise<void> 591 - ``` 592 - 593 - **indexExpression:** INSERT OR REPLACE into `layers_expressions`. The record's `uri` IS the expression URI (used by other tables' `expression_uri` FK). Extract `sourceRef` as `transcript_uri`. 594 - 595 - **indexSegmentation:** INSERT OR REPLACE into `layers_segmentations`. Store tokenizations as JSON. 596 - 597 - **indexAnnotationLayer:** INSERT OR REPLACE into `layers_annotations`. Store annotations array as JSON. 598 - 599 - **deleteExpression:** DELETE from `layers_expressions` WHERE uri. CASCADE: also delete from `layers_segmentations` and `layers_annotations` WHERE expression_uri matches. Clear the talk's document field in the talks table. 600 - 601 - **deleteSegmentation/deleteAnnotationLayer:** DELETE the specific row, then call `rebuildDocument`. 602 - 603 - **rebuildDocument:** 604 - 1. Look up expression by `uri` (= expression_uri) → get transcript_uri 605 - 2. Look up segmentation by expression_uri 606 - 3. Look up all annotation layers by expression_uri 607 - 4. If expression + segmentation exist, call `layersPubToDocument()` (Lens 3) 608 - 5. Find the talk_uri from the transcript table using transcript_uri 609 - 6. UPDATE the talk's `document` field with `JSON.stringify(document)` 610 - 611 - - [ ] **Step 2: Wire into indexer.ts** 612 - 613 - Add 3 new collections to `IONOSPHERE_COLLECTIONS`: 614 - 615 - ```typescript 616 - "pub.layers.expression.expression", 617 - "pub.layers.segmentation.segmentation", 618 - "pub.layers.annotation.annotationLayer", 619 - ``` 620 - 621 - Add DID filter in `processEvent()` — only process layers.pub records from the bot DID: 622 - 623 - ```typescript 624 - if (collection.startsWith("pub.layers.") && event.did !== BOT_DID) return; 625 - ``` 626 - 627 - The `BOT_DID` is already resolved in `appview.ts` — pass it to the indexer or make it available as a module-level constant. 628 - 629 - Add delete and create/update cases for the 3 new collections in the switch statements, calling the functions from `layers-indexer.ts`. 630 - 631 - After each create/update of a layers.pub record, call `rebuildDocument()` with the expression URI. 632 - 633 - - [ ] **Step 3: Test indexer locally** 634 - 635 - ```bash 636 - cd apps/ionosphere-appview 637 - # Start local environment 638 - docker compose up -d 639 - PORT=9401 npx tsx src/appview.ts & 640 - # Publish records 641 - PDS_URL=http://localhost:2690 BOT_HANDLE=ionosphere.test BOT_PASSWORD=ionosphere-dev-password npx tsx src/publish.ts 642 - # Verify layers.pub records were indexed 643 - curl -s http://localhost:9401/xrpc/tv.ionosphere.getTalk?rkey=ats26-keynote | python3 -c "import sys,json; d=json.load(sys.stdin); print('Has document:', bool(d.get('document'))); print('Facet count:', len(d['document']['facets']) if d.get('document') else 0)" 644 - ``` 645 - 646 - - [ ] **Step 4: Commit** 647 - 648 - ```bash 649 - git add apps/ionosphere-appview/src/layers-indexer.ts apps/ionosphere-appview/src/indexer.ts 650 - git commit -m "feat: index layers.pub records and rebuild materialized documents via Lens 3" 651 - ``` 652 - 653 - --- 654 - 655 - ## Chunk 5: Schema Versioning + Final Verification 656 - 657 - ### Task 8: Initialize panproto VCS 658 - 659 - **Files:** 660 - - Project root — panproto VCS state 661 - 662 - - [ ] **Step 1: Initialize and commit schemas** 663 - 664 - ```bash 665 - # From project root 666 - schema init 667 - schema add lexicons/pub/layers/ 668 - schema add formats/tv.ionosphere/ionosphere.lexicon.json 669 - schema add formats/tv.ionosphere/nlpAnnotations.lexicon.json 670 - schema add formats/tv.ionosphere/lenses/ 671 - schema commit -m "Initial schema commit: layers.pub v0.5.0, ionosphere facets, NLP annotations, 7 lenses" 672 - schema tag v0.5.0 673 - ``` 674 - 675 - - [ ] **Step 2: Verify VCS state** 676 - 677 - ```bash 678 - schema log 679 - schema status 680 - ``` 681 - 682 - Expected: clean state, one commit, tagged v0.5.0. 683 - 684 - - [ ] **Step 3: Commit VCS state to git** 685 - 686 - ```bash 687 - git add .panproto/ # or wherever schema VCS stores its state 688 - git commit -m "feat: initialize panproto VCS, tag layers.pub v0.5.0" 689 - ``` 690 - 691 - ### Task 9: End-to-end verification 692 - 693 - - [ ] **Step 1: Run full test suite** 694 - 695 - ```bash 696 - cd apps/ionosphere-appview 697 - npx vitest run 698 - ``` 699 - 700 - Expected: All tests pass. 701 - 702 - - [ ] **Step 2: Run publish against local PDS and verify round-trip** 703 - 704 - ```bash 705 - cd apps/ionosphere-appview 706 - docker compose up -d 707 - PORT=9401 npx tsx src/appview.ts & 708 - PDS_URL=http://localhost:2690 BOT_HANDLE=ionosphere.test BOT_PASSWORD=ionosphere-dev-password npx tsx src/publish.ts 709 - ``` 710 - 711 - Verify: 712 - 1. layers.pub records appear in PDS (check via `com.atproto.repo.listRecords`) 713 - 2. Appview indexes them and rebuilds documents 714 - 3. Documents served via API match previous output 715 - 4. Frontend renders correctly at http://127.0.0.1:9402/talks 716 - 717 - - [ ] **Step 3: Commit any fixes** 718 - 719 - ### Task 10: Deploy 720 - 721 - - [ ] **Step 1: Deploy appview** 722 - 723 - ```bash 724 - flyctl deploy --config fly.appview.toml --remote-only 725 - ``` 726 - 727 - - [ ] **Step 2: Publish to production PDS** 728 - 729 - ```bash 730 - cd apps/ionosphere-appview 731 - PDS_URL=https://jellybaby.us-east.host.bsky.network \ 732 - BOT_HANDLE=ionosphere.tv \ 733 - BOT_PASSWORD=<app-password> \ 734 - npx tsx src/publish.ts 735 - ``` 736 - 737 - - [ ] **Step 3: Invalidate caches** 738 - 739 - ```bash 740 - curl -X POST https://api.ionosphere.tv/xrpc/tv.ionosphere.invalidate 741 - ``` 742 - 743 - - [ ] **Step 4: Deploy frontend** 744 - 745 - ```bash 746 - flyctl deploy --config fly.web.toml --remote-only 747 - ``` 748 - 749 - - [ ] **Step 5: Verify production** 750 - 751 - Check https://ionosphere.tv/talks — documents should render with all enrichment annotations intact.
-173
docs/superpowers/specs/2026-04-12-enrichment-phases-2-3-design.md
··· 1 - # Enrichment Phases 2-3: NER + Entity Linking, Topic Segmentation 2 - 3 - **Date:** 2026-04-12 4 - **Status:** Approved 5 - **Depends on:** Phase 1 transcript formatting (complete) 6 - 7 - ## Goal 8 - 9 - Add named entity recognition with AT Protocol record linking (Phase 2) and topic segmentation with visual dividers (Phase 3) to the existing NLP enrichment pipeline. Achieve feature parity with the old concept system while adding speaker attribution and topic navigation, before deploying. 10 - 11 - ## Constraints 12 - 13 - - **Text is immutable.** Same as Phase 1 — annotations only, no word changes. 14 - - **Build-time processing.** New passes extend the existing Python pipeline. 15 - - **Leverage existing data.** Speaker records, diarization records, and concept records are already in the database. Use them for entity resolution. 16 - - **layers.pub annotation model.** Each pass produces a separate annotation layer, consistent with Phase 1's approach. 17 - 18 - ## Pipeline Passes 19 - 20 - ### Pass 3: Named Entity Recognition + Entity Linking 21 - 22 - **Input:** transcript text + speaker records (from SQLite) + diarization records (from SQLite) + concept records (from SQLite) 23 - 24 - **Steps:** 25 - 26 - 1. **Build speaker lookup.** Query `speakers` table, build a map of `{name, aliases, handle, did}` for all speakers. Include normalized variants (lowercase, first-name-only for disambiguation with diarization context). 27 - 28 - 2. **Load diarization.** Query `stream_diarizations` table for the talk's stream. Map diarization time ranges to speaker identities. Each segment tells us who is speaking when — this provides context for resolving ambiguous names. 29 - 30 - 3. **Run spaCy NER.** The existing `en_core_web_sm` model (already loaded for sentence detection) provides NER via `doc.ents`. Extract entities with types: PERSON, ORG, PRODUCT, WORK_OF_ART, GPE, EVENT. Compute byte ranges using the same char→byte conversion as sentence detection. 31 - 32 - 4. **Resolve entities:** 33 - - **PERSON entities:** Match against speaker lookup by name similarity. Use diarization context for disambiguation — if a first name is mentioned while a known speaker with that first name is presenting or was just speaking, prefer that match. Resolved entities get a `speakerDid` linking to the Bluesky profile. 34 - - **ORG/PRODUCT entities:** Match against concept records by name and aliases. Resolved entities get a `conceptUri`. 35 - - **Unresolved entities:** Keep as labeled spans with NER type but no link target. Available for manual curation in Phase 4. 36 - 37 - 5. **Emit speaker attribution.** For each diarization segment, emit a speaker-segment annotation spanning the corresponding byte range in the transcript. Cross-reference diarization time ranges with word timestamps to find byte boundaries. 38 - 39 - **Output:** NLP JSON with `entities` array and `speakerSegments` array. 40 - 41 - ### Pass 4: Topic Segmentation 42 - 43 - **Input:** transcript text + sentence boundaries (from Pass 1) 44 - 45 - **Steps:** 46 - 47 - 1. **Embed sentences.** Run each sentence through `all-MiniLM-L6-v2` (384-dim sentence embeddings). The model is ~80MB, downloaded on first run. Embedding 300 sentences takes ~2 seconds on CPU. 48 - 49 - 2. **Compute similarity.** For each pair of adjacent sentence windows (window size N, default 3 sentences), compute cosine similarity between the mean embedding of the left window and the right window. 50 - 51 - 3. **Detect boundaries.** Similarity drops below a threshold (tunable, default 0.3) indicate topic shifts. Apply a minimum segment length (default 5 sentences) to avoid over-segmentation. 52 - 53 - 4. **Snap to structure.** Topic breaks are snapped to the nearest paragraph boundary where possible (since paragraphs already represent pause-based thought transitions). If no paragraph boundary is within 2 sentences of the detected break, snap to the nearest sentence boundary. 54 - 55 - **Output:** NLP JSON with `topicBreaks` array (byte positions of topic boundaries). 56 - 57 - **Parameters stored in metadata:** `embeddingModel`, `windowSize`, `similarityThreshold`, `minSegmentSentences`. 58 - 59 - ## Facet Schema 60 - 61 - **Existing facets now populated:** 62 - 63 - | Facet type | Class | Use | 64 - |---|---|---| 65 - | `tv.ionosphere.facet#speaker-segment` | `block` | Wraps diarization segment — attributes text to speaker | 66 - | `tv.ionosphere.facet#speaker-ref` | `inline` | Links person mention to speaker DID/profile | 67 - | `tv.ionosphere.facet#concept-ref` | `inline` | Links ORG/PRODUCT mention to concept record | 68 - 69 - **New facets to add to format lexicon:** 70 - 71 - | Facet type | Class | Use | 72 - |---|---|---| 73 - | `tv.ionosphere.facet#topic-break` | `block` | Topic boundary — renderer inserts divider | 74 - | `tv.ionosphere.facet#entity` | `inline` | Unresolved entity — has label + NER type, no linked record | 75 - 76 - ## Document Assembly 77 - 78 - The `NlpAnnotations` interface in `transcript-encoding.ts` extends to: 79 - 80 - ```typescript 81 - interface NlpAnnotations { 82 - sentences: Array<{ byteStart: number; byteEnd: number }>; 83 - paragraphs: Array<{ byteStart: number; byteEnd: number }>; 84 - entities: Array<{ 85 - byteStart: number; byteEnd: number; 86 - label: string; nerType: string; 87 - speakerDid?: string; conceptUri?: string; 88 - }>; 89 - speakerSegments: Array<{ 90 - byteStart: number; byteEnd: number; 91 - speakerDid: string; speakerName: string; 92 - }>; 93 - topicBreaks: Array<{ byteStart: number }>; 94 - } 95 - ``` 96 - 97 - `decodeToDocumentWithStructure` maps these to facets: 98 - - `entities` with `speakerDid` → `#speaker-ref` facets 99 - - `entities` with `conceptUri` → `#concept-ref` facets 100 - - `entities` with neither → `#entity` facets (unresolved) 101 - - `speakerSegments` → `#speaker-segment` facets 102 - - `topicBreaks` → `#topic-break` facets 103 - 104 - ## Renderer Changes 105 - 106 - ### Entity spans 107 - 108 - `extractData` returns `entities: EntitySpan[]` with byte range, label, NER type, and optional link target. The renderer overlays these on word spans: 109 - 110 - - **`#speaker-ref`** — renders as a link styled with a subtle blue underline. Clicking navigates to the speaker page or Bluesky profile. 111 - - **`#concept-ref`** — renders as a link with amber underline (matching existing concept highlighting). Clicking navigates to the concept page. 112 - - **`#entity`** (unresolved) — renders as subtly styled text (dotted underline, slightly different color) to indicate a recognized entity without a link. 113 - 114 - These are inline facets that overlay on word spans. A word can have multiple facets (timestamp + entity). The existing `wordConcepts` pattern in `extractData` extends to handle all entity types. 115 - 116 - ### Speaker segments 117 - 118 - Not visually rendered in this phase. The data is stored in facets for future use (speaker-colored text, margin labels, etc.). Getting the attribution data right is the priority. 119 - 120 - ### Topic dividers 121 - 122 - A subtle `<hr>` between paragraphs where a topic break falls: 123 - 124 - ```html 125 - <div class="mb-4"><!-- paragraph --></div> 126 - <hr class="border-neutral-800 my-6" /> 127 - <div class="mb-4"><!-- paragraph --></div> 128 - ``` 129 - 130 - `extractData` returns `topicBreaks: Set<number>` — a set of paragraph indices where topic breaks occur. The renderer checks this set when iterating paragraphs and inserts dividers. 131 - 132 - ## Speaker Lookup Generation 133 - 134 - The Python pipeline reads speaker data directly from the SQLite database (Python's `sqlite3` is in the standard library). The lookup table is built at pipeline startup: 135 - 136 - ```python 137 - speakers = db.execute("SELECT name, handle, speaker_did FROM speakers").fetchall() 138 - lookup = {} 139 - for name, handle, did in speakers: 140 - lookup[name.lower()] = {"name": name, "handle": handle, "did": did} 141 - # Also index by first name for diarization-context matching 142 - first = name.split()[0].lower() 143 - if first not in lookup: 144 - lookup[first] = {"name": name, "handle": handle, "did": did} 145 - ``` 146 - 147 - This is ephemeral — rebuilt each pipeline run from the current speaker records. No separate file to maintain. 148 - 149 - ## Dependencies 150 - 151 - **New Python dependency:** `sentence-transformers>=2.0` (adds torch, transformers, tokenizers — ~2GB install). Build-time only, no runtime impact. 152 - 153 - **spaCy NER:** Zero-cost addition — `en_core_web_sm` already loaded for sentence detection. NER entities are read from `doc.ents` in the same pass. 154 - 155 - **SQLite access:** Python `sqlite3` standard library. Pipeline reads speaker, diarization, and concept records from the same database the appview uses. 156 - 157 - ## Testing Strategy 158 - 159 - **Python pipeline (pytest):** 160 - - Unit tests for speaker lookup construction (name variants, first-name matching) 161 - - Unit tests for entity resolution (exact match, first-name match with diarization context, unresolved fallback) 162 - - Unit tests for topic segmentation (boundary detection, minimum segment length, snap-to-paragraph) 163 - - Integration test: full pipeline on a known transcript, verify entity and topic output 164 - 165 - **TypeScript (vitest):** 166 - - `decodeToDocumentWithStructure` with entity/speaker/topic annotations 167 - - `extractData` with entity facets and topic breaks 168 - - Renderer: entity links, topic dividers between paragraphs 169 - 170 - **Manual validation:** 171 - - Spot-check 5-10 talks: verify entity links point to correct profiles/concepts 172 - - Verify topic breaks land at natural transitions, not mid-thought 173 - - Check that speaker attribution aligns with diarization (correct speaker for each segment)
-299
docs/superpowers/specs/2026-04-12-transcript-formatting-design.md
··· 1 - # Transcript Formatting: NLP Enrichment Pipeline 2 - 3 - **Date:** 2026-04-12 4 - **Status:** Approved 5 - 6 - ## Problem 7 - 8 - Transcripts are currently rendered as an infinitely long, unbroken run of text. Word-level timing and concept facets exist, but there is no structural formatting — no sentences, no paragraphs, no visual hierarchy. The goal is for transcripts to read as though they were essays. 9 - 10 - ## Constraints 11 - 12 - - **Text is immutable.** The pipeline adds structural annotations only — no words are modified, added, or removed. Transcript editing is a separate future concern. 13 - - **Reliability over ambition.** Each enrichment pass must be dependable enough to run unsupervised across all transcripts. Noisy output is worse than no output. 14 - - **Build-time processing.** NLP runs once in the batch pipeline; results are published as AT Protocol records. Zero runtime cost. 15 - - **Python NLP stack.** spaCy for sentence detection and NER; sentence-transformers for topic segmentation. 16 - 17 - ## Schema Design: layers.pub Integration 18 - 19 - The enrichment pipeline uses [layers.pub](https://layers.pub) (`pub.layers.*`) lexicons — composable AT Protocol schemas for linguistic annotation. This gives us a standard, interoperable representation with built-in support for multiple annotation passes, provenance tracking, and manual overrides. 20 - 21 - **Vendoring strategy:** layers.pub is at v0.5.0 draft. We vendor the specific lexicon definitions we use into `lexicons/pub/layers/` in this repo. Panproto lenses provide forward-compatibility — when layers.pub evolves, we define migrations rather than rewriting our pipeline. This follows the project's principle of prioritizing the lens layer for forward-compat. 22 - 23 - ### Record Architecture 24 - 25 - #### 1. Source transcript (existing) 26 - 27 - `tv.ionosphere.transcript` — compact storage format with `text`, `startMs`, and `timings` array. Stays as-is. Source of truth for raw transcription output. 28 - 29 - #### 2. Expression record 30 - 31 - `pub.layers.expression.expression` (kind: `"transcript"`) — the transcript text published as a layers.pub expression. Links back to the ionosphere transcript via `sourceRef`. This is the anchoring point for all annotations. 32 - 33 - #### 3. Segmentation record 34 - 35 - `pub.layers.segmentation.segmentation` — word-level tokenization derived from the transcript's compact timing data. Each token carries: 36 - - `textSpan`: UTF-8 byte offsets (`byteStart`, `byteEnd`) 37 - - `temporalSpan`: timing in milliseconds (`start`, `ending`) 38 - 39 - This replaces the per-word timestamp facets with a standard representation. 40 - 41 - #### 4. Annotation layers 42 - 43 - `pub.layers.annotation.annotationLayer` — one record per enrichment pass: 44 - 45 - | Pass | `kind` | `subkind` | `sourceMethod` | 46 - |------|--------|-----------|----------------| 47 - | Sentence detection | `span` | `sentence-boundary` | `automatic` | 48 - | Paragraph segmentation | `span` | `paragraph-boundary` | `automatic` | 49 - | Topic segmentation (future) | `span` | `topic-segment` | `automatic` | 50 - | Named entity recognition (future) | `span` | `ner` | `automatic` | 51 - | Concept linking (future) | `span` | `concept` | `automatic` | 52 - | Speaker attribution (future) | `span` | `speaker` | `automatic` | 53 - | Manual corrections (future) | varies | varies | `manual-native` | 54 - 55 - Each layer includes `metadata` (agent, tool, confidence, timestamp) for provenance. Pipeline parameters (e.g., paragraph pause threshold) are stored in `metadata.features` so provenance is complete and results are reproducible. 56 - 57 - #### 5. Manual override layer (future) 58 - 59 - A separate annotation layer with `sourceMethod: "manual-native"` and higher `rank`. The merge step prefers higher-ranked layers. Example: correcting "Blue Sky" to link to the Bluesky concept record is an annotation in this layer that supersedes the auto-detected concept. Published as first-class AT Protocol records — auditable, attributable, and preservable across pipeline re-runs. 60 - 61 - ### Replacing `tv.ionosphere.annotation` 62 - 63 - The existing `tv.ionosphere.annotation` record type (concept mentions anchored to byte ranges) is replaced wholesale by layers.pub annotation layers. Since the entire pipeline rebuilds from raw transcripts, there is no migration burden — the next pipeline run produces layers.pub records instead of `tv.ionosphere.annotation` records, and the old records are deleted. 64 - 65 - The existing concept data is re-derived by the NLP pipeline as a concept annotation layer (Phase 2), which will produce better results than the current approach. The `tv.ionosphere.annotation` lexicon and related code (`enrich.ts`, `enrich-all.ts`, `overlayAnnotations` in the appview) are removed in Phase 1. 66 - 67 - ### Panproto Integration 68 - 69 - - **Lenses:** Transform between compact transcript format (`tv.ionosphere.transcript`) and layers.pub expression + segmentation format. Lens definitions live in `formats/tv.ionosphere/lenses/`. 70 - - **Schema validation:** Validate all layers.pub records before publishing to PDS. Runs in the TypeScript publish step (after the Python NLP pipeline outputs JSON). 71 - - **Migration support:** As layers.pub evolves from v0.5.0, panproto migrations keep ionosphere records compatible. Vendored lexicons in `lexicons/pub/layers/` are the pinned source of truth. 72 - - **Pipeline boundary:** The Python NLP pipeline outputs annotation layer JSON files. The TypeScript publish step validates them against panproto-parsed lexicons and publishes to PDS. This reuses the existing panproto TypeScript integration. 73 - 74 - ## Pipeline Architecture 75 - 76 - ``` 77 - transcript record (text + timings) 78 - | 79 - v 80 - +-------------------+ 81 - | Pass 1: Sentences | <-- spaCy sentence boundary detection 82 - +---------+---------+ 83 - | 84 - v 85 - +--------------------+ 86 - | Pass 2: Paragraphs | <-- pause data + sentence boundaries 87 - +---------+----------+ 88 - | 89 - v 90 - +--------------------+ 91 - | Pass N: (future) | <-- topics, entities, speaker linking 92 - +---------+----------+ 93 - | 94 - v 95 - +--------------------+ 96 - | Override layer | <-- manual corrections (higher rank) 97 - +---------+----------+ 98 - | 99 - v 100 - +--------------------+ 101 - | Merge & publish | <-- assemble RelationalText document 102 - +--------------------+ 103 - ``` 104 - 105 - Properties: 106 - - **Each pass is a standalone Python module** with a consistent interface: takes transcript text + timings + prior layer output, returns a new annotation layer. 107 - - **Passes are additive** — they never modify text, only emit new annotations. 108 - - **Override layer applies last** — manual corrections supersede auto-generated annotations at matching byte ranges via `rank`. 109 - - **Idempotent** — re-running the pipeline produces the same output; manual overrides are preserved because they are separate records. 110 - 111 - ### Pass 1: Sentence Boundary Detection 112 - 113 - **Tool:** spaCy with `en_core_web_sm`. The small model is nearly as accurate as the transformer model for sentence boundary detection (its most battle-tested feature), and runs without GPU on a standard dev machine. If accuracy proves insufficient on speech transcripts, upgrade to `en_core_web_trf` in a later pass. 114 - 115 - spaCy's sentence segmenter uses dependency parsing, which is significantly more robust than punctuation-splitting for speech transcripts where Whisper's punctuation can be unreliable. 116 - 117 - **Output:** An annotation layer with one annotation per sentence, anchored by byte span. 118 - 119 - **Reliability:** Very high (95%+ accuracy on messy speech text). 120 - 121 - ### Pass 2: Paragraph Segmentation 122 - 123 - **Tool:** Custom algorithm combining two signals. 124 - 125 - **Signal 1 — Pause duration:** The transcript's timing data encodes silence gaps as negative values. Pauses above a tunable threshold are paragraph boundary candidates. Default threshold: **2.0 seconds** (a conservative starting point — most speech pauses are under 1s; pauses over 2s reliably indicate topic transitions). 126 - 127 - **Signal 2 — Sentence alignment:** Paragraph breaks only occur at sentence boundaries (from Pass 1). A long pause mid-sentence is a speaker thinking, not a paragraph break. 128 - 129 - **Algorithm:** 130 - ``` 131 - for each silence gap > pause_threshold_ms (default: 2000): 132 - find the nearest sentence boundary (from Pass 1) 133 - if the sentence boundary is within proximity_words (default: 5) of the pause: 134 - emit paragraph break at that sentence boundary 135 - ``` 136 - 137 - The proximity constraint of 5 words allows for the common case where a speaker finishes a thought (pause), says a brief connective phrase ("so", "and then"), and starts the next topic — the paragraph break lands at the sentence boundary closest to the actual pause. 138 - 139 - Both `pause_threshold_ms` and `proximity_words` are stored in the annotation layer's `metadata.features` for reproducibility. 140 - 141 - **Reliability:** High. Pause duration is a genuine speech signal, and constraining to sentence boundaries eliminates false positives. 142 - 143 - ## Rendering 144 - 145 - ### Format Lexicon Updates 146 - 147 - Two new facet types added to `tv.ionosphere.facet`: 148 - 149 - | Facet type | `featureClass` | Description | 150 - |---|---|---| 151 - | `tv.ionosphere.facet#sentence` | `inline` | Wraps all words in a sentence as a contiguous inline span | 152 - | `tv.ionosphere.facet#paragraph` | `block` | Groups sentences into a block-level paragraph container | 153 - 154 - Note: the annotation _storage_ format is layers.pub annotation layers (on the PDS). The _rendering_ format is ionosphere facets in the RelationalText document. The document assembly step bridges these — it reads layers.pub annotations and emits ionosphere facets. This separation means the renderer does not need to know about layers.pub. 155 - 156 - ### DOM Structure 157 - 158 - The renderer groups words into sentence spans and sentences into paragraph blocks: 159 - 160 - ```html 161 - <div> <!-- paragraph (block) --> 162 - <span> <!-- sentence (inline) --> 163 - <span>word</span> <span>word</span> <span>word</span> 164 - </span> 165 - <span> <!-- sentence (inline) --> 166 - <span>word</span> <span>word</span> 167 - </span> 168 - </div> 169 - <div> <!-- paragraph (block) --> 170 - <span> <!-- sentence (inline) --> 171 - <span>word</span> <span>word</span> 172 - </span> 173 - </div> 174 - ``` 175 - 176 - This mirrors the layers.pub expression hierarchy (transcript > paragraph > sentence) and maps directly to the format lexicon's `featureClass` system (`block` for paragraphs, `inline` for sentences). 177 - 178 - Sentence spans provide styling hooks for hover, selection, and transitions at sentence granularity. Paragraph blocks provide natural vertical whitespace. 179 - 180 - ### Data Model Changes: `extractData` → Hierarchical Structure 181 - 182 - The current `extractData` function in `src/lib/transcript.ts` returns a flat `{ words: WordSpan[], concepts, wordConcepts }`. This must change to return a hierarchical structure: 183 - 184 - ```typescript 185 - interface ParagraphSpan { 186 - byteStart: number; 187 - byteEnd: number; 188 - sentences: SentenceSpan[]; 189 - } 190 - 191 - interface SentenceSpan { 192 - byteStart: number; 193 - byteEnd: number; 194 - words: WordSpan[]; // existing WordSpan type, unchanged 195 - } 196 - 197 - interface TranscriptStructure { 198 - paragraphs: ParagraphSpan[]; 199 - concepts: ConceptSpan[]; 200 - // wordConcepts lookup remains flat (indexed by global word index) 201 - wordConcepts: ConceptSpan[][]; 202 - } 203 - ``` 204 - 205 - `extractData` builds this hierarchy by: 206 - 1. Extracting all word spans from `#timestamp` facets (existing logic, unchanged). 207 - 2. Reading `#paragraph` facets to get paragraph byte ranges. Sorting by `byteStart`. 208 - 3. Reading `#sentence` facets to get sentence byte ranges. Sorting by `byteStart`. 209 - 4. Assigning each word to its containing sentence (by byte range overlap). 210 - 5. Assigning each sentence to its containing paragraph (by byte range overlap). 211 - 6. Words not covered by any sentence facet form singleton sentences. Sentences not covered by any paragraph facet form singleton paragraphs. This graceful degradation means the renderer works identically on transcripts that have not yet been enriched. 212 - 213 - The brightness gradient system (`boundaryStartTime`/`boundaryEndTime`) continues to use the global word ordering — paragraph visual gaps do not affect the temporal continuity of the gradient. The existing `WordSpanComponent` is reused unchanged inside the sentence/paragraph wrappers. 214 - 215 - ### Document Assembly 216 - 217 - Document assembly is a **build-time step** that runs after the NLP pipeline and before publishing. It: 218 - 1. Reads the compact transcript record (`tv.ionosphere.transcript`). 219 - 2. Reads all layers.pub annotation layer records for this transcript. 220 - 3. Converts layers.pub sentence/paragraph annotations into `#sentence` and `#paragraph` ionosphere facets. 221 - 4. Merges with existing `#timestamp` and `#concept-ref` facets from `decodeToDocument`. 222 - 5. Writes the assembled RelationalText document onto the `tv.ionosphere.talk` record's `document` field. 223 - 224 - This replaces the current runtime assembly in the appview serve path with a pre-computed document. The appview serves the pre-assembled document directly — zero runtime cost. 225 - 226 - Annotation layers of different `subkind` values naturally have overlapping byte ranges (a paragraph span contains sentence spans, which contain word spans). This is expected and correct — they represent different levels of the hierarchy, not conflicting annotations. 227 - 228 - ### Scroll/Time Mapping 229 - 230 - Both `TranscriptView` and `WindowedTranscriptView` must account for paragraph whitespace in their scroll-to-time and time-to-scroll mappings. 231 - 232 - **TranscriptView:** The line-map computation already handles variable-height content. Paragraph `<div>` elements with margin/padding become part of the natural layout — no special handling needed beyond the existing line grouping logic. 233 - 234 - **WindowedTranscriptView:** The `computeMonospaceLayout` function currently returns `LineEntry[]` with uniform `LINE_HEIGHT`. Changes: 235 - - Accept an additional `paragraphBreaks: Set<number>` parameter (set of word indices where a paragraph starts). 236 - - When a word is a paragraph start, insert a gap of `PARAGRAPH_GAP` pixels (default: `LINE_HEIGHT`, i.e., one blank line) before its line entry. 237 - - `LineEntry` gains `isParagraphStart: boolean` for rendering the gap spacer. 238 - - Gap entries have no time range — `timeToScrollY` and `scrollYToTime` skip gaps by treating them as extensions of the preceding line's time range (scrolling through a gap seeks to the end of the previous paragraph). 239 - 240 - ## Testing Strategy 241 - 242 - **Python pipeline (pytest):** 243 - - Golden-file tests: run the sentence/paragraph pipeline on 2-3 known transcripts, compare output annotation layers to curated expected output. These transcripts should cover: a clean well-punctuated talk, a messy conversational panel, and a lightning talk with rapid transitions. 244 - - Unit tests for the paragraph algorithm: verify that paragraph breaks only land at sentence boundaries, that pauses below threshold produce no breaks, and that the proximity constraint works correctly. 245 - 246 - **TypeScript rendering (vitest):** 247 - - Unit tests for the updated `extractData`: verify hierarchical output from facets, and verify graceful degradation when sentence/paragraph facets are absent (flat word array wrapped in singleton sentence/paragraph). 248 - - Snapshot tests for `computeMonospaceLayout` with paragraph gaps. 249 - 250 - **Manual validation:** 251 - - After running the pipeline on all transcripts, spot-check 5-10 talks across different rooms/talk types. Verify paragraph breaks land at natural topic transitions, not mid-thought. Measure average sentences-per-paragraph (expect 3-8 for well-structured talks). 252 - 253 - ## Phase Roadmap 254 - 255 - ### Phase 1 — Structural formatting + layers.pub migration (this work) 256 - 257 - - Vendor layers.pub lexicon definitions into `lexicons/pub/layers/` 258 - - Python NLP pipeline: sentence detection (spaCy) + paragraph segmentation (pause + sentence alignment) 259 - - layers.pub expression + segmentation records for each transcript 260 - - Sentence and paragraph annotation layers 261 - - Panproto lenses: compact transcript <-> layers.pub expression + segmentation 262 - - Document assembly reads annotation layers, emits structural facets 263 - - Renderer: sentences as inline spans, paragraphs as block elements 264 - - Remove `tv.ionosphere.annotation` records, `enrich.ts`/`enrich-all.ts`, `overlayAnnotations` — fully replaced by layers.pub 265 - - **Goal:** Transcripts read as paragraphed prose; all enrichment flows through layers.pub 266 - 267 - ### Phase 2 — Entity recognition + record linking 268 - 269 - - spaCy NER pass in the pipeline 270 - - AT Protocol record resolver: people -> Bluesky profiles (DID resolution via handle/display name lookup), projects -> `tv.ionosphere.concept` records 271 - - Concept annotation layer replaces the old `tv.ionosphere.annotation`-based concept system with richer, NLP-derived results 272 - - Entity annotation layer with `knowledgeRefs` to resolved records 273 - - Renderer: entity spans as links/tooltips to profiles and concept pages 274 - - **Goal:** People and projects mentioned in talks are clickable, linked to real AT Protocol identities 275 - 276 - ### Phase 3 — Topic segmentation 277 - 278 - - Sentence-transformer embedding pass (e.g., `all-MiniLM-L6-v2`) 279 - - Sliding-window cosine similarity topic boundary detection 280 - - Topic segment annotation layer 281 - - Renderer: section dividers or topic labels at major transitions 282 - - UI: topic-based navigation within a talk (jump to "Q&A", "Demo", etc.) 283 - - **Goal:** Long talks become navigable by topic 284 - 285 - ### Phase 4 — Manual curation layer 286 - 287 - - UI for creating manual override annotations (correct a concept link, fix an entity, adjust a paragraph break) 288 - - Published as AT Protocol records with `sourceMethod: "manual-native"`, higher `rank` 289 - - Pipeline respects overrides on re-run 290 - - Multi-user: anyone with write access can contribute corrections 291 - - **Goal:** Community-curated enrichment that improves over time 292 - 293 - ### Phase 5 — Concept enrichment + cross-talk linking 294 - 295 - - Supersede auto-detected concepts with curated concept records 296 - - Cross-reference talks that mention the same entities/concepts 297 - - `tv.ionosphere.facet#talk-xref` links between related talks 298 - - Knowledge graph across the entire conference 299 - - **Goal:** The archive becomes a connected knowledge base, not just isolated transcripts
-113
docs/superpowers/specs/2026-04-13-layers-pub-panproto-preplan.md
··· 1 - # layers.pub Record Publishing + Panproto Lenses — Pre-Plan 2 - 3 - **Status:** Pre-plan for next session 4 - **Depends on:** Phases 1-3 enrichment (complete), concept deduplication (complete) 5 - 6 - ## Context 7 - 8 - The NLP enrichment pipeline produces sentence/paragraph/entity/topic annotations that are currently stored as ionosphere facets in pre-assembled RelationalText documents. The layers.pub lexicons are vendored (`lexicons/pub/layers/`) but no actual AT Protocol records are published. 9 - 10 - This work makes the enrichment data first-class AT Protocol records — publishable, indexable, and interoperable with the broader layers.pub ecosystem. 11 - 12 - ## What Exists 13 - 14 - ### Vendored lexicons (in `lexicons/pub/layers/`) 15 - - `defs.json` — shared types: span, temporalSpan, uuid, tokenRef, anchor, annotationMetadata, feature/featureMap 16 - - `expression/expression.json` — record: id, kind, text, language, sourceRef, parentRef, anchor, metadata 17 - - `segmentation/segmentation.json` — record: expression ref, tokenizations (with textSpan + temporalSpan per token) 18 - - `annotation/annotationLayer.json` — record: expression ref, kind/subkind, sourceMethod, annotations array 19 - 20 - ### Existing panproto integration (`formats/tv.ionosphere/ts/panproto.ts`) 21 - - WASM-based runtime (lazy singleton init) 22 - - `loadSchema()` — parse lexicon JSON into BuiltSchema 23 - - `buildMigration()` — explicit migration between schemas 24 - - `createLens()` — auto-generated lens between schemas 25 - - `autoGenerateWithHints()` — protolens chain with morphism hints 26 - - `createPipeline()` — PipelineBuilder for combinator transforms 27 - - `serializeChain()` / `serializeMigrationSpec()` — serialization for storage 28 - 29 - ### Existing lenses (`formats/tv.ionosphere/lenses/`) 30 - - `openai-whisper-to-transcript.lens.json` 31 - - `transcript-to-document.lens.json` 32 - - `schedule-to-talk.lens.json` 33 - - `vod-to-talk.lens.json` 34 - 35 - ## What Needs to Be Built 36 - 37 - ### 1. Publish layers.pub records to PDS 38 - 39 - For each transcript, publish: 40 - 41 - **Expression record** (`pub.layers.expression.expression`): 42 - - `kind: "transcript"` 43 - - `text`: the transcript text 44 - - `language: "en"` 45 - - `sourceRef`: AT URI of the `tv.ionosphere.transcript` record 46 - - `createdAt`: timestamp 47 - 48 - **Segmentation record** (`pub.layers.segmentation.segmentation`): 49 - - `expression`: AT URI of the expression record above 50 - - One tokenization with `kind: "whitespace"` 51 - - Each token has `textSpan` (byteStart/byteEnd) and `temporalSpan` (start/ending in ms) 52 - - Derived from the compact transcript's timing data 53 - 54 - **Annotation layers** (`pub.layers.annotation.annotationLayer`): 55 - - **Sentence layer**: `kind: "span"`, `subkind: "sentence-boundary"`, `sourceMethod: "automatic"` 56 - - **Paragraph layer**: `kind: "span"`, `subkind: "paragraph-boundary"`, `sourceMethod: "automatic"` 57 - - **Entity layer**: `kind: "span"`, `subkind: "ner"`, `sourceMethod: "automatic"`, with `knowledgeRefs` for resolved entities 58 - - **Topic layer**: `kind: "span"`, `subkind: "topic-segment"`, `sourceMethod: "automatic"` 59 - - Each layer references the expression record and includes `metadata` (tool, confidence, timestamp) 60 - 61 - ### 2. Panproto lenses 62 - 63 - **Lens: compact transcript → layers.pub expression + segmentation** 64 - - Source: `tv.ionosphere.transcript` (text, startMs, timings) 65 - - Target: `pub.layers.expression.expression` + `pub.layers.segmentation.segmentation` 66 - - This is the transform that `decodeToDocument` / `encode` already implement in code — the lens formalizes it 67 - 68 - **Lens: NLP annotations → layers.pub annotation layers** 69 - - Source: NLP pipeline JSON output (sentences, paragraphs, entities, topicBreaks) 70 - - Target: `pub.layers.annotation.annotationLayer` records 71 - - Mostly structural mapping — the NLP output already has byte ranges and labels 72 - 73 - **Lens: layers.pub → ionosphere document facets** 74 - - Source: layers.pub records (expression + segmentation + annotation layers) 75 - - Target: RelationalText document with ionosphere facets (#timestamp, #sentence, #paragraph, etc.) 76 - - This is the reverse of what `decodeToDocumentWithStructure` does — reading layers.pub records and emitting facets 77 - 78 - ### 3. Appview indexer updates 79 - 80 - The appview needs to index layers.pub records from Jetstream: 81 - - Add `pub.layers.expression.expression`, `pub.layers.segmentation.segmentation`, `pub.layers.annotation.annotationLayer` to `IONOSPHERE_COLLECTIONS` 82 - - Create DB tables for these records 83 - - On indexing, rebuild the pre-assembled document from the layers.pub records (using the lens) 84 - 85 - ### 4. Schema versioning 86 - 87 - - Initialize panproto VCS for the layers.pub schemas 88 - - Pin to layers.pub v0.5.0 (current vendored version) 89 - - Define migration strategy for when layers.pub evolves 90 - 91 - ## Architecture Decision: Build-Time vs Runtime 92 - 93 - Currently, document assembly happens at **build time** (publish.ts) and the assembled document is stored on the talk record. With layers.pub records, the assembly could move to **runtime** (appview reads layers.pub records and assembles on the fly). 94 - 95 - **Recommendation:** Keep build-time assembly for the ionosphere document (fast serving), AND publish layers.pub records alongside (for interoperability). The layers.pub records are the canonical source; the ionosphere document is a materialized view. 96 - 97 - ## Suggested Task Order 98 - 99 - 1. Write the publish step for layers.pub expression + segmentation records 100 - 2. Write the publish step for annotation layer records 101 - 3. Define panproto lens: compact transcript → expression + segmentation 102 - 4. Define panproto lens: NLP annotations → annotation layers 103 - 5. Update appview indexer to handle layers.pub records 104 - 6. Initialize panproto VCS, tag v0.5.0 105 - 7. Test round-trip: publish → index → serve → verify in browser 106 - 8. Define panproto lens: layers.pub → ionosphere document facets (reverse lens) 107 - 108 - ## Questions for the Session 109 - 110 - - Should we publish layers.pub records under the ionosphere.tv DID, or a separate account? 111 - - How should the appview discover which annotation layers belong to a given transcript? (By expression URI reference? By convention?) 112 - - Do we want to support third-party annotation layers from other DIDs? (e.g., someone else annotating our transcripts) 113 - - Should the panproto lenses be published as `org.relationaltext.lens` records (like the existing ones)?
-211
docs/superpowers/specs/2026-04-13-layers-pub-publishing-design.md
··· 1 - # layers.pub Record Publishing via Panproto Lenses — Design 2 - 3 - **Status:** Approved 4 - **Date:** 2026-04-13 5 - **Depends on:** NLP enrichment pipeline (complete), concept deduplication (complete) 6 - 7 - ## Overview 8 - 9 - Publish enrichment data as first-class AT Protocol records using the layers.pub schema. Panproto lenses are the authoritative transforms — all data flows through lenses, no parallel TypeScript pipelines. The existing ionosphere document (text + facets embedded on the talk record) becomes a materialized view rebuilt from layers.pub records. 10 - 11 - ## Decisions 12 - 13 - - **DID:** Publish under ionosphere.tv 14 - - **Annotation layer references:** Point to the transcript's AT URI (`tv.ionosphere.transcript`) 15 - - **Third-party layers:** Not supported yet (no moderation). Comments and reactions are unaffected. 16 - - **Lens namespace:** `org.relationaltext.lens` (consistent with existing 4 lenses) 17 - - **Record keys:** Deterministic, derived from talk rkey (e.g., `{talk.rkey}-expression`) 18 - - **Expression kind:** `kind: "transcript"` (no `kindUri` — it requires AT URI format, and there's no meaningful record to reference) 19 - - **Segmentation tokenization:** `kind: "word"` — carries the temporal mapping; annotations use `anchor.textSpan` (byte offsets) directly 20 - - **Publish ordering:** Expression URI pre-computed from DID + rkey; all records publish in parallel 21 - - **Architecture:** Lenses-first (Approach B) — lenses are authoritative, publish step runs data through lenses 22 - 23 - ## Section 1: Record Model & Relationships 24 - 25 - For each transcript, 6 new records published under ionosphere.tv: 26 - 27 - ``` 28 - tv.ionosphere.transcript/{talk.rkey}-transcript (already exists) 29 - 30 - ▼ sourceRef 31 - pub.layers.expression.expression/{talk.rkey}-expression 32 - 33 - ├── pub.layers.segmentation.segmentation/{talk.rkey}-segmentation 34 - │ └── tokenization: kind "word", tokens with textSpan + temporalSpan 35 - 36 - ├── pub.layers.annotation.annotationLayer/{talk.rkey}-sentences 37 - ├── pub.layers.annotation.annotationLayer/{talk.rkey}-paragraphs 38 - ├── pub.layers.annotation.annotationLayer/{talk.rkey}-entities 39 - └── pub.layers.annotation.annotationLayer/{talk.rkey}-topics 40 - ``` 41 - 42 - Example AT URI: `at://did:plc:xxxxxx/pub.layers.expression.expression/atproto-for-everyone-expression` 43 - 44 - > **Future work:** A `-speakers` annotation layer (`subkind: "speaker-segment"`) for diarization spans. 45 - > The NLP pipeline does not yet produce `speakerSegments` data, so this layer is deferred until the 46 - > diarization pipeline is integrated. 47 - 48 - ### Expression record 49 - 50 - | Field | Value | 51 - |---|---| 52 - | `id` | talk rkey | 53 - | `$type` | `"pub.layers.expression.expression"` | 54 - | `kind` | `"transcript"` | 55 - | `text` | full transcript text | 56 - | `language` | `"en"` | 57 - | `sourceRef` | AT URI of `tv.ionosphere.transcript` record | 58 - | `metadata` | `{ tool: "ionosphere-pipeline", timestamp: "<ISO 8601 datetime>" }` | 59 - | `createdAt` | ISO 8601 timestamp | 60 - 61 - ### Segmentation record 62 - 63 - | Field | Value | 64 - |---|---| 65 - | `expression` | AT URI of expression (pre-computed) | 66 - | `tokenizations` | Single tokenization, `kind: "word"` | 67 - | `createdAt` | ISO 8601 timestamp | 68 - 69 - Each token: `tokenIndex` (0-based), `text` (word), `textSpan` (UTF-8 byte offsets), `temporalSpan` (start/ending in ms, derived from compact transcript timings). 70 - 71 - ### Annotation layers 72 - 73 - All records include `$type: "pub.layers.annotation.annotationLayer"`. All reference the expression URI. All use `sourceMethod: "automatic"`. `createdAt` is optional in the lexicon but always included. 74 - 75 - | rkey suffix | kind | subkind | annotations | 76 - |---|---|---|---| 77 - | `-sentences` | `"span"` | `"sentence-boundary"` | One annotation per sentence. `anchor: { textSpan: { byteStart, byteEnd } }`, `label`: first ~50 chars of sentence text. | 78 - | `-paragraphs` | `"span"` | `"paragraph-boundary"` | One annotation per paragraph. `anchor: { textSpan: { byteStart, byteEnd } }`, `label`: `"paragraph"`. | 79 - | `-entities` | `"span"` | `"ner"` | One annotation per entity mention. `anchor: { textSpan: { byteStart, byteEnd } }`, `label`: entity name. `features: { entries: [{ key: "nerType", value: "PERSON" }, { key: "conceptUri", value: "at://..." }] }` — `nerType` always present, `conceptUri` if resolved, plus any future entity keys (e.g., `speakerDid`) forwarded as passthrough. | 80 - | `-topics` | `"span"` | `"topic-segment"` | One annotation per topic break. `anchor: { textSpan: { byteStart, byteEnd } }` where `byteEnd = byteStart` (zero-width span). `label`: `"topic-break"`. | 81 - 82 - `metadata` on all layers: `{ tool: "ionosphere-nlp-pipeline", timestamp: "<ISO 8601 datetime>" }`. 83 - 84 - ## Section 2: Panproto Lens Architecture 85 - 86 - Three lenses. All authoritative — data flows through them. 87 - 88 - ### Lens 1: Compact Transcript → Expression + Segmentation 89 - 90 - - **Source:** `tv.ionosphere.transcript` (text, startMs, timings) 91 - - **Target:** `pub.layers.expression.expression` + `pub.layers.segmentation.segmentation` 92 - - **Transform:** maps text → text, injects kind/language, replays timings array to produce token list with textSpan + temporalSpan 93 - - **Fan-out:** Single source → two target records (protolens chain) 94 - - **Morphism hints:** `text → text`, `startMs + timings → tokenizations[0].tokens` 95 - 96 - ### Lens 2: NLP Annotations → Annotation Layers 97 - 98 - - **Source:** `tv.ionosphere.nlpAnnotations` (new lexicon, not published to PDS — exists as lens source schema) 99 - - **Target:** 4× `pub.layers.annotation.annotationLayer` 100 - - **Transform:** Byte ranges → `anchor.textSpan`, labels → `annotation.label`, entity metadata → `features` 101 - - **Fan-out:** Single source → four target records 102 - 103 - The `tv.ionosphere.nlpAnnotations` lexicon formalizes the NlpAnnotations TypeScript interface as a proper schema that panproto can parse and validate. 104 - 105 - ### Lens 3: Layers.pub → Ionosphere Document Facets (reverse) 106 - 107 - - **Source:** expression + segmentation + annotation layers 108 - - **Target:** RelationalText document with ionosphere facets (`{ text, facets }`) 109 - - **Purpose:** Materialized view builder, used by appview indexer 110 - - **Round-trip property:** Lens 1+2 followed by Lens 3 should reproduce the same document that `decodeToDocumentWithStructure` currently produces. This is the correctness test. 111 - 112 - ### Lens storage 113 - 114 - Published as `org.relationaltext.lens` records (consistent with existing 4 lenses). Rkeys: `transcript-to-expression`, `nlp-to-annotation-layers`, `layers-to-document`. 115 - 116 - ## Section 3: Publish Pipeline 117 - 118 - New Stage 6 in `publish.ts`, runs after transcript publishing. 119 - 120 - For each talk with both transcript and NLP annotation files: 121 - 122 - 1. Load compact transcript (`transcripts/{rkey}.json`) 123 - 2. Load NLP annotations (`nlp/{rkey}.json`) 124 - 3. Run **Lens 1** via panproto WASM → expression + segmentation records 125 - 4. Run **Lens 2** via panproto WASM → 4 annotation layer records 126 - 5. Inject AT URIs (pre-computed from DID + rkey): `sourceRef` on expression, `expression` ref on segmentation and all annotation layers 127 - 6. Publish all 6 records via `PdsClient.putRecord()` (parallel within each talk, sequential across talks) 128 - 129 - **Idempotency:** Deterministic rkeys mean re-publish overwrites, no duplicates. 130 - 131 - **Rate limiting:** ~98 talks × 6 records = ~588 putRecord calls. Records within a single talk publish in parallel (6 concurrent). Talks are processed sequentially. The existing PdsClient writeDelay (100ms) and 429/backoff handling apply. 132 - 133 - **Skip condition:** If a talk has transcript but no NLP annotations, skip layers.pub entirely for that talk (keep as a unit). 134 - 135 - **WASM lifecycle:** Panproto runtime initializes once (lazy singleton), schemas load once at stage start, each talk runs data through compiled lenses. 136 - 137 - ## Section 4: Appview Indexer Updates 138 - 139 - ### New Jetstream subscriptions 140 - 141 - Add to `IONOSPHERE_COLLECTIONS`: 142 - - `pub.layers.expression.expression` 143 - - `pub.layers.segmentation.segmentation` 144 - - `pub.layers.annotation.annotationLayer` 145 - 146 - **DID filter:** Only index records from ionosphere.tv DID (no third-party layers). 147 - 148 - ### New DB tables 149 - 150 - ```sql 151 - layers_expressions: 152 - uri TEXT PRIMARY KEY, rkey TEXT, did TEXT, transcript_uri TEXT, 153 - text TEXT, language TEXT, created_at TEXT 154 - -- uri IS the expression URI; other tables reference it via expression_uri 155 - 156 - layers_segmentations: 157 - rkey TEXT, did TEXT, expression_uri TEXT, 158 - tokens_json TEXT, created_at TEXT 159 - 160 - layers_annotations: 161 - rkey TEXT, did TEXT, expression_uri TEXT, 162 - kind TEXT, subkind TEXT, annotations_json TEXT, created_at TEXT 163 - ``` 164 - 165 - Key indexes: `expression_uri` (find all layers for expression), `transcript_uri` (find expression for transcript). 166 - 167 - ### Document rebuild on ingest 168 - 169 - When any layers.pub record arrives or updates: 170 - 171 - 1. Look up the expression URI → transcript URI 172 - 2. Check for complete set: expression + segmentation + at least one annotation layer 173 - 3. If complete, run **Lens 3** (layers.pub → ionosphere document facets) to produce materialized `{ text, facets }` 174 - 4. Update the talk record's `document` field in DB 175 - 176 - **Deletion:** If an annotation layer is deleted, rebuild with remaining layers (graceful degradation — fewer annotations). If the expression record is deleted, cascade-delete its segmentation and annotation rows from the DB and clear the talk's materialized document. Annotation/segmentation JSON is stored as TEXT blobs — all queries are by expression URI, not individual annotations, so this is sufficient. 177 - 178 - **Backfill:** Add three new collections to existing startup backfill loop; records go through same ingest → rebuild path. 179 - 180 - ## Section 5: Schema Versioning 181 - 182 - ### Panproto VCS initialization 183 - 184 - - `schema init` in project 185 - - Add layers.pub lexicons + ionosphere lexicons + NLP annotations lexicon + all lens definitions 186 - - `schema commit` initial state 187 - - `schema tag v0.5.0` — pin to current vendored layers.pub version 188 - 189 - ### What's tracked 190 - 191 - - `lexicons/pub/layers/*.json` 192 - - `formats/tv.ionosphere/ionosphere.lexicon.json` 193 - - `formats/tv.ionosphere/nlpAnnotations.lexicon.json` (new) 194 - - `formats/tv.ionosphere/lenses/*.lens.json` (existing 4 + new 3) 195 - 196 - ### Migration strategy 197 - 198 - When layers.pub evolves: vendor updated lexicons, `schema diff`, update lenses if needed. The materialized view insulates the frontend — layers.pub can change without affecting ionosphere document format until we choose to update. 199 - 200 - ## Task Order 201 - 202 - 1. Define `tv.ionosphere.nlpAnnotations` lexicon (lens source schema) 203 - 2. Define Lens 1: compact transcript → expression + segmentation 204 - 3. Define Lens 2: NLP annotations → 4 annotation layers 205 - 4. Add publish Stage 6: run lenses, publish 6 records per talk 206 - 5. Define Lens 3: layers.pub → ionosphere document facets (reverse) 207 - 6. Add appview DB tables + indexer for layers.pub collections (depends on 5) 208 - 7. Wire indexer rebuild: on layers.pub ingest, run Lens 3, update talk document 209 - 8. Initialize panproto VCS, tag v0.5.0 210 - 9. Test round-trip: publish → index → rebuild → verify document matches current output 211 - 10. Deploy