···11-# Panproto Wishlist for Ionosphere
22-33-What we need from panproto to make the ionosphere lens layer fully declarative — no mechanical shims, all transforms expressed as protolens chains stored as AT Protocol records.
44-55-Filed upstream: panproto/panproto#15
66-77----
88-99-## 1. Pipeline combinator API in the TypeScript SDK
1010-1111-**Priority: critical**
1212-1313-The tutorial (Chapter 6) shows:
1414-```typescript
1515-const lens = pipeline([
1616- renameField('displayName', 'name'),
1717- addField('bio', 'string', ''),
1818- removeField('legacyField'),
1919- hoistField('additionalData.room'),
2020-]);
2121-```
2222-2323-This is exactly what we need for cross-schema transforms (calendar event → talk, VOD → talk). The Rust engine has these combinators, but they're not exported from `@panproto/core`. The WASM exposes `apply_protolens_step` with elementary steps (`rename_sort`, `drop_sort`, `add_sort`, `rename_op`, `drop_op`, `add_op`), but:
2424-2525-- `rename_sort` renames the vertex, not the prop edge `name` (which controls the JSON key)
2626-- `rename_op` renames the edge kind, not the edge name
2727-- There's no hoist step (move nested field to top level)
2828-2929-**What we need:** `renameField(oldPropName, newPropName)` that renames the JSON property key — which means renaming the prop edge's `name` attribute, not just the vertex.
3030-3131-**Our workaround:** Existing `applyLens` field mapper with JSON lens specs.
3232-3333-## 2. Prop edge name rename step
3434-3535-**Priority: critical**
3636-3737-The elementary step vocabulary has 6 types: `{add,drop,rename}_{sort,op}`. None of these rename a prop edge's `name` attribute. In ATProto lexicons, the JSON property name comes from the prop edge name (e.g., `edge('body', 'body.name', 'prop', { name: 'name' })`). To rename a JSON key from `name` to `title`, we need to rename this edge attribute.
3838-3939-Either:
4040-- A new step type: `rename_prop_name` (or `rename_edge_name`)
4141-- Or make `rename_op` also handle the edge `name` attribute when the edge kind is `prop`
4242-4343-## 3. Hoist / restructure steps
4444-4545-**Priority: high**
4646-4747-ATmosphereConf calendar events store metadata in `additionalData.room`, `additionalData.category`, etc. Ionosphere talks have these as top-level fields: `room`, `category`, `talkType`. This is a hoist: move a property from a nested object to the parent.
4848-4949-The auto-lens generation (Chapter 17) has a "Restructuring Pass" that handles this, but it requires finding a morphism first — which fails when the schemas are too different.
5050-5151-We need either:
5252-- A `hoist_field` elementary step
5353-- Or the ability to express this as a composed rename_sort + edge rearrangement
5454-5555-## 4. Morphism hints for cross-namespace schemas
5656-5757-**Priority: high**
5858-5959-`pp.lens(calendarSchema, talkSchema)` fails with "no morphism found" because the schemas have completely different NSID namespaces. The morphism search uses name similarity (edit distance) for scoring, but `community.lexicon.calendar.event:body.name` has no name similarity to `tv.ionosphere.talk:body.title`.
6060-6161-The overlap discovery (`discover_overlap`) also fails because `find_best_morphism` returns empty.
6262-6363-We need a way to provide explicit vertex correspondence hints:
6464-```typescript
6565-const lens = pp.lens(calendarSchema, talkSchema, {
6666- hints: {
6767- 'community.lexicon.calendar.event:body.name': 'tv.ionosphere.talk:body.title',
6868- 'community.lexicon.calendar.event:body.description': 'tv.ionosphere.talk:body.description',
6969- }
7070-});
7171-```
7272-7373-Or equivalently, the ability to seed the morphism search with known correspondences.
7474-7575-## 5. `lift_json` exposed in the TypeScript SDK
7676-7777-**Priority: medium**
7878-7979-The WASM has `lift_json(migration, json_bytes, root_vertex)` which takes JSON in and returns JSON out. The SDK's `CompiledMigration.lift()` requires an `Instance` object (from `pp.parseJson()`), which is less ergonomic for the common case of transforming plain JSON records.
8080-8181-Exposing `lift_json` in the SDK would eliminate the `parseJson` → `lift` → `toJson` round-trip.
8282-8383-## 6. `parseLexicon` metadata format fix
8484-8585-**Priority: medium**
8686-8787-`schema_metadata` WASM function returns positional arrays `[protocol, vertices[], edges[]]` but the SDK's `parseLexicon` method expects named object keys (`meta.vertices.map(...)`). We work around this with a direct WASM call that handles both formats.
8888-8989-## 7. WASM binary in npm package
9090-9191-**Priority: medium**
9292-9393-`@panproto/core@0.22.0` ships the TypeScript SDK shell but not the WASM binary (`panproto_wasm.js` + `panproto_wasm_bg.wasm`). We build from source with `wasm-pack build crates/panproto-wasm --target web --release` and copy the output into `node_modules`.
9494-9595-Additionally, the web-target glue module uses `fetch()` to load the `.wasm` file, which doesn't work with `file://` URLs in Node.js. We work around this by pre-loading the binary with `readFileSync` and using `initSync({ module: wasmBytes })` via a wrapped glue module.
9696-9797-Options:
9898-- Ship the WASM in the npm tarball (adds ~6MB)
9999-- Publish a separate `@panproto/wasm` package
100100-- Ship a Node.js-target build alongside the web-target build
101101-102102-## 8. Migration builder for partial cross-schema transforms
103103-104104-**Priority: low (covered by items 1-4)**
105105-106106-`pp.migration(src, tgt).map(...)` works for structurally similar schemas but fails with "root node was pruned during restriction" when the source schema has many unmapped vertices. This is the expected behavior for a morphism-based migration, but it means the migration builder can't be used for cross-schema transforms where most source fields are dropped.
107107-108108-If items 1-4 land, this becomes unnecessary — the pipeline combinator API is the right tool for cross-schema transforms.
109109-110110----
111111-112112-## What works great today
113113-114114-- `parseLexicon()` — parses any ATProto lexicon into a schema graph
115115-- `diff()` / `diffFull()` — accurate structural diffs between schema versions
116116-- `validateSchema()` — validates schemas against protocol rules
117117-- `pp.lens(v1, v2)` — auto-generates lenses between structurally similar schemas
118118-- `migration().map().compile()` — explicit vertex mapping for version migration
119119-- `protolensChain().toJson()` / `fromJson()` — serialization for storage as AT Protocol records
120120-- `protocol('atproto')` + `SchemaBuilder` — manual schema construction
121121-- Schema diffing with 20+ change categories
122122-123123-The algebraic foundations are excellent. These SDK gaps are the last mile to making cross-schema transforms fully declarative.
···11-# Enrichment Phases 2-3: NER + Entity Linking, Topic Segmentation
22-33-**Date:** 2026-04-12
44-**Status:** Approved
55-**Depends on:** Phase 1 transcript formatting (complete)
66-77-## Goal
88-99-Add named entity recognition with AT Protocol record linking (Phase 2) and topic segmentation with visual dividers (Phase 3) to the existing NLP enrichment pipeline. Achieve feature parity with the old concept system while adding speaker attribution and topic navigation, before deploying.
1010-1111-## Constraints
1212-1313-- **Text is immutable.** Same as Phase 1 — annotations only, no word changes.
1414-- **Build-time processing.** New passes extend the existing Python pipeline.
1515-- **Leverage existing data.** Speaker records, diarization records, and concept records are already in the database. Use them for entity resolution.
1616-- **layers.pub annotation model.** Each pass produces a separate annotation layer, consistent with Phase 1's approach.
1717-1818-## Pipeline Passes
1919-2020-### Pass 3: Named Entity Recognition + Entity Linking
2121-2222-**Input:** transcript text + speaker records (from SQLite) + diarization records (from SQLite) + concept records (from SQLite)
2323-2424-**Steps:**
2525-2626-1. **Build speaker lookup.** Query `speakers` table, build a map of `{name, aliases, handle, did}` for all speakers. Include normalized variants (lowercase, first-name-only for disambiguation with diarization context).
2727-2828-2. **Load diarization.** Query `stream_diarizations` table for the talk's stream. Map diarization time ranges to speaker identities. Each segment tells us who is speaking when — this provides context for resolving ambiguous names.
2929-3030-3. **Run spaCy NER.** The existing `en_core_web_sm` model (already loaded for sentence detection) provides NER via `doc.ents`. Extract entities with types: PERSON, ORG, PRODUCT, WORK_OF_ART, GPE, EVENT. Compute byte ranges using the same char→byte conversion as sentence detection.
3131-3232-4. **Resolve entities:**
3333- - **PERSON entities:** Match against speaker lookup by name similarity. Use diarization context for disambiguation — if a first name is mentioned while a known speaker with that first name is presenting or was just speaking, prefer that match. Resolved entities get a `speakerDid` linking to the Bluesky profile.
3434- - **ORG/PRODUCT entities:** Match against concept records by name and aliases. Resolved entities get a `conceptUri`.
3535- - **Unresolved entities:** Keep as labeled spans with NER type but no link target. Available for manual curation in Phase 4.
3636-3737-5. **Emit speaker attribution.** For each diarization segment, emit a speaker-segment annotation spanning the corresponding byte range in the transcript. Cross-reference diarization time ranges with word timestamps to find byte boundaries.
3838-3939-**Output:** NLP JSON with `entities` array and `speakerSegments` array.
4040-4141-### Pass 4: Topic Segmentation
4242-4343-**Input:** transcript text + sentence boundaries (from Pass 1)
4444-4545-**Steps:**
4646-4747-1. **Embed sentences.** Run each sentence through `all-MiniLM-L6-v2` (384-dim sentence embeddings). The model is ~80MB, downloaded on first run. Embedding 300 sentences takes ~2 seconds on CPU.
4848-4949-2. **Compute similarity.** For each pair of adjacent sentence windows (window size N, default 3 sentences), compute cosine similarity between the mean embedding of the left window and the right window.
5050-5151-3. **Detect boundaries.** Similarity drops below a threshold (tunable, default 0.3) indicate topic shifts. Apply a minimum segment length (default 5 sentences) to avoid over-segmentation.
5252-5353-4. **Snap to structure.** Topic breaks are snapped to the nearest paragraph boundary where possible (since paragraphs already represent pause-based thought transitions). If no paragraph boundary is within 2 sentences of the detected break, snap to the nearest sentence boundary.
5454-5555-**Output:** NLP JSON with `topicBreaks` array (byte positions of topic boundaries).
5656-5757-**Parameters stored in metadata:** `embeddingModel`, `windowSize`, `similarityThreshold`, `minSegmentSentences`.
5858-5959-## Facet Schema
6060-6161-**Existing facets now populated:**
6262-6363-| Facet type | Class | Use |
6464-|---|---|---|
6565-| `tv.ionosphere.facet#speaker-segment` | `block` | Wraps diarization segment — attributes text to speaker |
6666-| `tv.ionosphere.facet#speaker-ref` | `inline` | Links person mention to speaker DID/profile |
6767-| `tv.ionosphere.facet#concept-ref` | `inline` | Links ORG/PRODUCT mention to concept record |
6868-6969-**New facets to add to format lexicon:**
7070-7171-| Facet type | Class | Use |
7272-|---|---|---|
7373-| `tv.ionosphere.facet#topic-break` | `block` | Topic boundary — renderer inserts divider |
7474-| `tv.ionosphere.facet#entity` | `inline` | Unresolved entity — has label + NER type, no linked record |
7575-7676-## Document Assembly
7777-7878-The `NlpAnnotations` interface in `transcript-encoding.ts` extends to:
7979-8080-```typescript
8181-interface NlpAnnotations {
8282- sentences: Array<{ byteStart: number; byteEnd: number }>;
8383- paragraphs: Array<{ byteStart: number; byteEnd: number }>;
8484- entities: Array<{
8585- byteStart: number; byteEnd: number;
8686- label: string; nerType: string;
8787- speakerDid?: string; conceptUri?: string;
8888- }>;
8989- speakerSegments: Array<{
9090- byteStart: number; byteEnd: number;
9191- speakerDid: string; speakerName: string;
9292- }>;
9393- topicBreaks: Array<{ byteStart: number }>;
9494-}
9595-```
9696-9797-`decodeToDocumentWithStructure` maps these to facets:
9898-- `entities` with `speakerDid` → `#speaker-ref` facets
9999-- `entities` with `conceptUri` → `#concept-ref` facets
100100-- `entities` with neither → `#entity` facets (unresolved)
101101-- `speakerSegments` → `#speaker-segment` facets
102102-- `topicBreaks` → `#topic-break` facets
103103-104104-## Renderer Changes
105105-106106-### Entity spans
107107-108108-`extractData` returns `entities: EntitySpan[]` with byte range, label, NER type, and optional link target. The renderer overlays these on word spans:
109109-110110-- **`#speaker-ref`** — renders as a link styled with a subtle blue underline. Clicking navigates to the speaker page or Bluesky profile.
111111-- **`#concept-ref`** — renders as a link with amber underline (matching existing concept highlighting). Clicking navigates to the concept page.
112112-- **`#entity`** (unresolved) — renders as subtly styled text (dotted underline, slightly different color) to indicate a recognized entity without a link.
113113-114114-These are inline facets that overlay on word spans. A word can have multiple facets (timestamp + entity). The existing `wordConcepts` pattern in `extractData` extends to handle all entity types.
115115-116116-### Speaker segments
117117-118118-Not visually rendered in this phase. The data is stored in facets for future use (speaker-colored text, margin labels, etc.). Getting the attribution data right is the priority.
119119-120120-### Topic dividers
121121-122122-A subtle `<hr>` between paragraphs where a topic break falls:
123123-124124-```html
125125-<div class="mb-4"><!-- paragraph --></div>
126126-<hr class="border-neutral-800 my-6" />
127127-<div class="mb-4"><!-- paragraph --></div>
128128-```
129129-130130-`extractData` returns `topicBreaks: Set<number>` — a set of paragraph indices where topic breaks occur. The renderer checks this set when iterating paragraphs and inserts dividers.
131131-132132-## Speaker Lookup Generation
133133-134134-The Python pipeline reads speaker data directly from the SQLite database (Python's `sqlite3` is in the standard library). The lookup table is built at pipeline startup:
135135-136136-```python
137137-speakers = db.execute("SELECT name, handle, speaker_did FROM speakers").fetchall()
138138-lookup = {}
139139-for name, handle, did in speakers:
140140- lookup[name.lower()] = {"name": name, "handle": handle, "did": did}
141141- # Also index by first name for diarization-context matching
142142- first = name.split()[0].lower()
143143- if first not in lookup:
144144- lookup[first] = {"name": name, "handle": handle, "did": did}
145145-```
146146-147147-This is ephemeral — rebuilt each pipeline run from the current speaker records. No separate file to maintain.
148148-149149-## Dependencies
150150-151151-**New Python dependency:** `sentence-transformers>=2.0` (adds torch, transformers, tokenizers — ~2GB install). Build-time only, no runtime impact.
152152-153153-**spaCy NER:** Zero-cost addition — `en_core_web_sm` already loaded for sentence detection. NER entities are read from `doc.ents` in the same pass.
154154-155155-**SQLite access:** Python `sqlite3` standard library. Pipeline reads speaker, diarization, and concept records from the same database the appview uses.
156156-157157-## Testing Strategy
158158-159159-**Python pipeline (pytest):**
160160-- Unit tests for speaker lookup construction (name variants, first-name matching)
161161-- Unit tests for entity resolution (exact match, first-name match with diarization context, unresolved fallback)
162162-- Unit tests for topic segmentation (boundary detection, minimum segment length, snap-to-paragraph)
163163-- Integration test: full pipeline on a known transcript, verify entity and topic output
164164-165165-**TypeScript (vitest):**
166166-- `decodeToDocumentWithStructure` with entity/speaker/topic annotations
167167-- `extractData` with entity facets and topic breaks
168168-- Renderer: entity links, topic dividers between paragraphs
169169-170170-**Manual validation:**
171171-- Spot-check 5-10 talks: verify entity links point to correct profiles/concepts
172172-- Verify topic breaks land at natural transitions, not mid-thought
173173-- Check that speaker attribution aligns with diarization (correct speaker for each segment)
···11-# Transcript Formatting: NLP Enrichment Pipeline
22-33-**Date:** 2026-04-12
44-**Status:** Approved
55-66-## Problem
77-88-Transcripts are currently rendered as an infinitely long, unbroken run of text. Word-level timing and concept facets exist, but there is no structural formatting — no sentences, no paragraphs, no visual hierarchy. The goal is for transcripts to read as though they were essays.
99-1010-## Constraints
1111-1212-- **Text is immutable.** The pipeline adds structural annotations only — no words are modified, added, or removed. Transcript editing is a separate future concern.
1313-- **Reliability over ambition.** Each enrichment pass must be dependable enough to run unsupervised across all transcripts. Noisy output is worse than no output.
1414-- **Build-time processing.** NLP runs once in the batch pipeline; results are published as AT Protocol records. Zero runtime cost.
1515-- **Python NLP stack.** spaCy for sentence detection and NER; sentence-transformers for topic segmentation.
1616-1717-## Schema Design: layers.pub Integration
1818-1919-The enrichment pipeline uses [layers.pub](https://layers.pub) (`pub.layers.*`) lexicons — composable AT Protocol schemas for linguistic annotation. This gives us a standard, interoperable representation with built-in support for multiple annotation passes, provenance tracking, and manual overrides.
2020-2121-**Vendoring strategy:** layers.pub is at v0.5.0 draft. We vendor the specific lexicon definitions we use into `lexicons/pub/layers/` in this repo. Panproto lenses provide forward-compatibility — when layers.pub evolves, we define migrations rather than rewriting our pipeline. This follows the project's principle of prioritizing the lens layer for forward-compat.
2222-2323-### Record Architecture
2424-2525-#### 1. Source transcript (existing)
2626-2727-`tv.ionosphere.transcript` — compact storage format with `text`, `startMs`, and `timings` array. Stays as-is. Source of truth for raw transcription output.
2828-2929-#### 2. Expression record
3030-3131-`pub.layers.expression.expression` (kind: `"transcript"`) — the transcript text published as a layers.pub expression. Links back to the ionosphere transcript via `sourceRef`. This is the anchoring point for all annotations.
3232-3333-#### 3. Segmentation record
3434-3535-`pub.layers.segmentation.segmentation` — word-level tokenization derived from the transcript's compact timing data. Each token carries:
3636-- `textSpan`: UTF-8 byte offsets (`byteStart`, `byteEnd`)
3737-- `temporalSpan`: timing in milliseconds (`start`, `ending`)
3838-3939-This replaces the per-word timestamp facets with a standard representation.
4040-4141-#### 4. Annotation layers
4242-4343-`pub.layers.annotation.annotationLayer` — one record per enrichment pass:
4444-4545-| Pass | `kind` | `subkind` | `sourceMethod` |
4646-|------|--------|-----------|----------------|
4747-| Sentence detection | `span` | `sentence-boundary` | `automatic` |
4848-| Paragraph segmentation | `span` | `paragraph-boundary` | `automatic` |
4949-| Topic segmentation (future) | `span` | `topic-segment` | `automatic` |
5050-| Named entity recognition (future) | `span` | `ner` | `automatic` |
5151-| Concept linking (future) | `span` | `concept` | `automatic` |
5252-| Speaker attribution (future) | `span` | `speaker` | `automatic` |
5353-| Manual corrections (future) | varies | varies | `manual-native` |
5454-5555-Each layer includes `metadata` (agent, tool, confidence, timestamp) for provenance. Pipeline parameters (e.g., paragraph pause threshold) are stored in `metadata.features` so provenance is complete and results are reproducible.
5656-5757-#### 5. Manual override layer (future)
5858-5959-A separate annotation layer with `sourceMethod: "manual-native"` and higher `rank`. The merge step prefers higher-ranked layers. Example: correcting "Blue Sky" to link to the Bluesky concept record is an annotation in this layer that supersedes the auto-detected concept. Published as first-class AT Protocol records — auditable, attributable, and preservable across pipeline re-runs.
6060-6161-### Replacing `tv.ionosphere.annotation`
6262-6363-The existing `tv.ionosphere.annotation` record type (concept mentions anchored to byte ranges) is replaced wholesale by layers.pub annotation layers. Since the entire pipeline rebuilds from raw transcripts, there is no migration burden — the next pipeline run produces layers.pub records instead of `tv.ionosphere.annotation` records, and the old records are deleted.
6464-6565-The existing concept data is re-derived by the NLP pipeline as a concept annotation layer (Phase 2), which will produce better results than the current approach. The `tv.ionosphere.annotation` lexicon and related code (`enrich.ts`, `enrich-all.ts`, `overlayAnnotations` in the appview) are removed in Phase 1.
6666-6767-### Panproto Integration
6868-6969-- **Lenses:** Transform between compact transcript format (`tv.ionosphere.transcript`) and layers.pub expression + segmentation format. Lens definitions live in `formats/tv.ionosphere/lenses/`.
7070-- **Schema validation:** Validate all layers.pub records before publishing to PDS. Runs in the TypeScript publish step (after the Python NLP pipeline outputs JSON).
7171-- **Migration support:** As layers.pub evolves from v0.5.0, panproto migrations keep ionosphere records compatible. Vendored lexicons in `lexicons/pub/layers/` are the pinned source of truth.
7272-- **Pipeline boundary:** The Python NLP pipeline outputs annotation layer JSON files. The TypeScript publish step validates them against panproto-parsed lexicons and publishes to PDS. This reuses the existing panproto TypeScript integration.
7373-7474-## Pipeline Architecture
7575-7676-```
7777-transcript record (text + timings)
7878- |
7979- v
8080-+-------------------+
8181-| Pass 1: Sentences | <-- spaCy sentence boundary detection
8282-+---------+---------+
8383- |
8484- v
8585-+--------------------+
8686-| Pass 2: Paragraphs | <-- pause data + sentence boundaries
8787-+---------+----------+
8888- |
8989- v
9090-+--------------------+
9191-| Pass N: (future) | <-- topics, entities, speaker linking
9292-+---------+----------+
9393- |
9494- v
9595-+--------------------+
9696-| Override layer | <-- manual corrections (higher rank)
9797-+---------+----------+
9898- |
9999- v
100100-+--------------------+
101101-| Merge & publish | <-- assemble RelationalText document
102102-+--------------------+
103103-```
104104-105105-Properties:
106106-- **Each pass is a standalone Python module** with a consistent interface: takes transcript text + timings + prior layer output, returns a new annotation layer.
107107-- **Passes are additive** — they never modify text, only emit new annotations.
108108-- **Override layer applies last** — manual corrections supersede auto-generated annotations at matching byte ranges via `rank`.
109109-- **Idempotent** — re-running the pipeline produces the same output; manual overrides are preserved because they are separate records.
110110-111111-### Pass 1: Sentence Boundary Detection
112112-113113-**Tool:** spaCy with `en_core_web_sm`. The small model is nearly as accurate as the transformer model for sentence boundary detection (its most battle-tested feature), and runs without GPU on a standard dev machine. If accuracy proves insufficient on speech transcripts, upgrade to `en_core_web_trf` in a later pass.
114114-115115-spaCy's sentence segmenter uses dependency parsing, which is significantly more robust than punctuation-splitting for speech transcripts where Whisper's punctuation can be unreliable.
116116-117117-**Output:** An annotation layer with one annotation per sentence, anchored by byte span.
118118-119119-**Reliability:** Very high (95%+ accuracy on messy speech text).
120120-121121-### Pass 2: Paragraph Segmentation
122122-123123-**Tool:** Custom algorithm combining two signals.
124124-125125-**Signal 1 — Pause duration:** The transcript's timing data encodes silence gaps as negative values. Pauses above a tunable threshold are paragraph boundary candidates. Default threshold: **2.0 seconds** (a conservative starting point — most speech pauses are under 1s; pauses over 2s reliably indicate topic transitions).
126126-127127-**Signal 2 — Sentence alignment:** Paragraph breaks only occur at sentence boundaries (from Pass 1). A long pause mid-sentence is a speaker thinking, not a paragraph break.
128128-129129-**Algorithm:**
130130-```
131131-for each silence gap > pause_threshold_ms (default: 2000):
132132- find the nearest sentence boundary (from Pass 1)
133133- if the sentence boundary is within proximity_words (default: 5) of the pause:
134134- emit paragraph break at that sentence boundary
135135-```
136136-137137-The proximity constraint of 5 words allows for the common case where a speaker finishes a thought (pause), says a brief connective phrase ("so", "and then"), and starts the next topic — the paragraph break lands at the sentence boundary closest to the actual pause.
138138-139139-Both `pause_threshold_ms` and `proximity_words` are stored in the annotation layer's `metadata.features` for reproducibility.
140140-141141-**Reliability:** High. Pause duration is a genuine speech signal, and constraining to sentence boundaries eliminates false positives.
142142-143143-## Rendering
144144-145145-### Format Lexicon Updates
146146-147147-Two new facet types added to `tv.ionosphere.facet`:
148148-149149-| Facet type | `featureClass` | Description |
150150-|---|---|---|
151151-| `tv.ionosphere.facet#sentence` | `inline` | Wraps all words in a sentence as a contiguous inline span |
152152-| `tv.ionosphere.facet#paragraph` | `block` | Groups sentences into a block-level paragraph container |
153153-154154-Note: the annotation _storage_ format is layers.pub annotation layers (on the PDS). The _rendering_ format is ionosphere facets in the RelationalText document. The document assembly step bridges these — it reads layers.pub annotations and emits ionosphere facets. This separation means the renderer does not need to know about layers.pub.
155155-156156-### DOM Structure
157157-158158-The renderer groups words into sentence spans and sentences into paragraph blocks:
159159-160160-```html
161161-<div> <!-- paragraph (block) -->
162162- <span> <!-- sentence (inline) -->
163163- <span>word</span> <span>word</span> <span>word</span>
164164- </span>
165165- <span> <!-- sentence (inline) -->
166166- <span>word</span> <span>word</span>
167167- </span>
168168-</div>
169169-<div> <!-- paragraph (block) -->
170170- <span> <!-- sentence (inline) -->
171171- <span>word</span> <span>word</span>
172172- </span>
173173-</div>
174174-```
175175-176176-This mirrors the layers.pub expression hierarchy (transcript > paragraph > sentence) and maps directly to the format lexicon's `featureClass` system (`block` for paragraphs, `inline` for sentences).
177177-178178-Sentence spans provide styling hooks for hover, selection, and transitions at sentence granularity. Paragraph blocks provide natural vertical whitespace.
179179-180180-### Data Model Changes: `extractData` → Hierarchical Structure
181181-182182-The current `extractData` function in `src/lib/transcript.ts` returns a flat `{ words: WordSpan[], concepts, wordConcepts }`. This must change to return a hierarchical structure:
183183-184184-```typescript
185185-interface ParagraphSpan {
186186- byteStart: number;
187187- byteEnd: number;
188188- sentences: SentenceSpan[];
189189-}
190190-191191-interface SentenceSpan {
192192- byteStart: number;
193193- byteEnd: number;
194194- words: WordSpan[]; // existing WordSpan type, unchanged
195195-}
196196-197197-interface TranscriptStructure {
198198- paragraphs: ParagraphSpan[];
199199- concepts: ConceptSpan[];
200200- // wordConcepts lookup remains flat (indexed by global word index)
201201- wordConcepts: ConceptSpan[][];
202202-}
203203-```
204204-205205-`extractData` builds this hierarchy by:
206206-1. Extracting all word spans from `#timestamp` facets (existing logic, unchanged).
207207-2. Reading `#paragraph` facets to get paragraph byte ranges. Sorting by `byteStart`.
208208-3. Reading `#sentence` facets to get sentence byte ranges. Sorting by `byteStart`.
209209-4. Assigning each word to its containing sentence (by byte range overlap).
210210-5. Assigning each sentence to its containing paragraph (by byte range overlap).
211211-6. Words not covered by any sentence facet form singleton sentences. Sentences not covered by any paragraph facet form singleton paragraphs. This graceful degradation means the renderer works identically on transcripts that have not yet been enriched.
212212-213213-The brightness gradient system (`boundaryStartTime`/`boundaryEndTime`) continues to use the global word ordering — paragraph visual gaps do not affect the temporal continuity of the gradient. The existing `WordSpanComponent` is reused unchanged inside the sentence/paragraph wrappers.
214214-215215-### Document Assembly
216216-217217-Document assembly is a **build-time step** that runs after the NLP pipeline and before publishing. It:
218218-1. Reads the compact transcript record (`tv.ionosphere.transcript`).
219219-2. Reads all layers.pub annotation layer records for this transcript.
220220-3. Converts layers.pub sentence/paragraph annotations into `#sentence` and `#paragraph` ionosphere facets.
221221-4. Merges with existing `#timestamp` and `#concept-ref` facets from `decodeToDocument`.
222222-5. Writes the assembled RelationalText document onto the `tv.ionosphere.talk` record's `document` field.
223223-224224-This replaces the current runtime assembly in the appview serve path with a pre-computed document. The appview serves the pre-assembled document directly — zero runtime cost.
225225-226226-Annotation layers of different `subkind` values naturally have overlapping byte ranges (a paragraph span contains sentence spans, which contain word spans). This is expected and correct — they represent different levels of the hierarchy, not conflicting annotations.
227227-228228-### Scroll/Time Mapping
229229-230230-Both `TranscriptView` and `WindowedTranscriptView` must account for paragraph whitespace in their scroll-to-time and time-to-scroll mappings.
231231-232232-**TranscriptView:** The line-map computation already handles variable-height content. Paragraph `<div>` elements with margin/padding become part of the natural layout — no special handling needed beyond the existing line grouping logic.
233233-234234-**WindowedTranscriptView:** The `computeMonospaceLayout` function currently returns `LineEntry[]` with uniform `LINE_HEIGHT`. Changes:
235235-- Accept an additional `paragraphBreaks: Set<number>` parameter (set of word indices where a paragraph starts).
236236-- When a word is a paragraph start, insert a gap of `PARAGRAPH_GAP` pixels (default: `LINE_HEIGHT`, i.e., one blank line) before its line entry.
237237-- `LineEntry` gains `isParagraphStart: boolean` for rendering the gap spacer.
238238-- Gap entries have no time range — `timeToScrollY` and `scrollYToTime` skip gaps by treating them as extensions of the preceding line's time range (scrolling through a gap seeks to the end of the previous paragraph).
239239-240240-## Testing Strategy
241241-242242-**Python pipeline (pytest):**
243243-- Golden-file tests: run the sentence/paragraph pipeline on 2-3 known transcripts, compare output annotation layers to curated expected output. These transcripts should cover: a clean well-punctuated talk, a messy conversational panel, and a lightning talk with rapid transitions.
244244-- Unit tests for the paragraph algorithm: verify that paragraph breaks only land at sentence boundaries, that pauses below threshold produce no breaks, and that the proximity constraint works correctly.
245245-246246-**TypeScript rendering (vitest):**
247247-- Unit tests for the updated `extractData`: verify hierarchical output from facets, and verify graceful degradation when sentence/paragraph facets are absent (flat word array wrapped in singleton sentence/paragraph).
248248-- Snapshot tests for `computeMonospaceLayout` with paragraph gaps.
249249-250250-**Manual validation:**
251251-- After running the pipeline on all transcripts, spot-check 5-10 talks across different rooms/talk types. Verify paragraph breaks land at natural topic transitions, not mid-thought. Measure average sentences-per-paragraph (expect 3-8 for well-structured talks).
252252-253253-## Phase Roadmap
254254-255255-### Phase 1 — Structural formatting + layers.pub migration (this work)
256256-257257-- Vendor layers.pub lexicon definitions into `lexicons/pub/layers/`
258258-- Python NLP pipeline: sentence detection (spaCy) + paragraph segmentation (pause + sentence alignment)
259259-- layers.pub expression + segmentation records for each transcript
260260-- Sentence and paragraph annotation layers
261261-- Panproto lenses: compact transcript <-> layers.pub expression + segmentation
262262-- Document assembly reads annotation layers, emits structural facets
263263-- Renderer: sentences as inline spans, paragraphs as block elements
264264-- Remove `tv.ionosphere.annotation` records, `enrich.ts`/`enrich-all.ts`, `overlayAnnotations` — fully replaced by layers.pub
265265-- **Goal:** Transcripts read as paragraphed prose; all enrichment flows through layers.pub
266266-267267-### Phase 2 — Entity recognition + record linking
268268-269269-- spaCy NER pass in the pipeline
270270-- AT Protocol record resolver: people -> Bluesky profiles (DID resolution via handle/display name lookup), projects -> `tv.ionosphere.concept` records
271271-- Concept annotation layer replaces the old `tv.ionosphere.annotation`-based concept system with richer, NLP-derived results
272272-- Entity annotation layer with `knowledgeRefs` to resolved records
273273-- Renderer: entity spans as links/tooltips to profiles and concept pages
274274-- **Goal:** People and projects mentioned in talks are clickable, linked to real AT Protocol identities
275275-276276-### Phase 3 — Topic segmentation
277277-278278-- Sentence-transformer embedding pass (e.g., `all-MiniLM-L6-v2`)
279279-- Sliding-window cosine similarity topic boundary detection
280280-- Topic segment annotation layer
281281-- Renderer: section dividers or topic labels at major transitions
282282-- UI: topic-based navigation within a talk (jump to "Q&A", "Demo", etc.)
283283-- **Goal:** Long talks become navigable by topic
284284-285285-### Phase 4 — Manual curation layer
286286-287287-- UI for creating manual override annotations (correct a concept link, fix an entity, adjust a paragraph break)
288288-- Published as AT Protocol records with `sourceMethod: "manual-native"`, higher `rank`
289289-- Pipeline respects overrides on re-run
290290-- Multi-user: anyone with write access can contribute corrections
291291-- **Goal:** Community-curated enrichment that improves over time
292292-293293-### Phase 5 — Concept enrichment + cross-talk linking
294294-295295-- Supersede auto-detected concepts with curated concept records
296296-- Cross-reference talks that mention the same entities/concepts
297297-- `tv.ionosphere.facet#talk-xref` links between related talks
298298-- Knowledge graph across the entire conference
299299-- **Goal:** The archive becomes a connected knowledge base, not just isolated transcripts
···11-# layers.pub Record Publishing + Panproto Lenses — Pre-Plan
22-33-**Status:** Pre-plan for next session
44-**Depends on:** Phases 1-3 enrichment (complete), concept deduplication (complete)
55-66-## Context
77-88-The NLP enrichment pipeline produces sentence/paragraph/entity/topic annotations that are currently stored as ionosphere facets in pre-assembled RelationalText documents. The layers.pub lexicons are vendored (`lexicons/pub/layers/`) but no actual AT Protocol records are published.
99-1010-This work makes the enrichment data first-class AT Protocol records — publishable, indexable, and interoperable with the broader layers.pub ecosystem.
1111-1212-## What Exists
1313-1414-### Vendored lexicons (in `lexicons/pub/layers/`)
1515-- `defs.json` — shared types: span, temporalSpan, uuid, tokenRef, anchor, annotationMetadata, feature/featureMap
1616-- `expression/expression.json` — record: id, kind, text, language, sourceRef, parentRef, anchor, metadata
1717-- `segmentation/segmentation.json` — record: expression ref, tokenizations (with textSpan + temporalSpan per token)
1818-- `annotation/annotationLayer.json` — record: expression ref, kind/subkind, sourceMethod, annotations array
1919-2020-### Existing panproto integration (`formats/tv.ionosphere/ts/panproto.ts`)
2121-- WASM-based runtime (lazy singleton init)
2222-- `loadSchema()` — parse lexicon JSON into BuiltSchema
2323-- `buildMigration()` — explicit migration between schemas
2424-- `createLens()` — auto-generated lens between schemas
2525-- `autoGenerateWithHints()` — protolens chain with morphism hints
2626-- `createPipeline()` — PipelineBuilder for combinator transforms
2727-- `serializeChain()` / `serializeMigrationSpec()` — serialization for storage
2828-2929-### Existing lenses (`formats/tv.ionosphere/lenses/`)
3030-- `openai-whisper-to-transcript.lens.json`
3131-- `transcript-to-document.lens.json`
3232-- `schedule-to-talk.lens.json`
3333-- `vod-to-talk.lens.json`
3434-3535-## What Needs to Be Built
3636-3737-### 1. Publish layers.pub records to PDS
3838-3939-For each transcript, publish:
4040-4141-**Expression record** (`pub.layers.expression.expression`):
4242-- `kind: "transcript"`
4343-- `text`: the transcript text
4444-- `language: "en"`
4545-- `sourceRef`: AT URI of the `tv.ionosphere.transcript` record
4646-- `createdAt`: timestamp
4747-4848-**Segmentation record** (`pub.layers.segmentation.segmentation`):
4949-- `expression`: AT URI of the expression record above
5050-- One tokenization with `kind: "whitespace"`
5151-- Each token has `textSpan` (byteStart/byteEnd) and `temporalSpan` (start/ending in ms)
5252-- Derived from the compact transcript's timing data
5353-5454-**Annotation layers** (`pub.layers.annotation.annotationLayer`):
5555-- **Sentence layer**: `kind: "span"`, `subkind: "sentence-boundary"`, `sourceMethod: "automatic"`
5656-- **Paragraph layer**: `kind: "span"`, `subkind: "paragraph-boundary"`, `sourceMethod: "automatic"`
5757-- **Entity layer**: `kind: "span"`, `subkind: "ner"`, `sourceMethod: "automatic"`, with `knowledgeRefs` for resolved entities
5858-- **Topic layer**: `kind: "span"`, `subkind: "topic-segment"`, `sourceMethod: "automatic"`
5959-- Each layer references the expression record and includes `metadata` (tool, confidence, timestamp)
6060-6161-### 2. Panproto lenses
6262-6363-**Lens: compact transcript → layers.pub expression + segmentation**
6464-- Source: `tv.ionosphere.transcript` (text, startMs, timings)
6565-- Target: `pub.layers.expression.expression` + `pub.layers.segmentation.segmentation`
6666-- This is the transform that `decodeToDocument` / `encode` already implement in code — the lens formalizes it
6767-6868-**Lens: NLP annotations → layers.pub annotation layers**
6969-- Source: NLP pipeline JSON output (sentences, paragraphs, entities, topicBreaks)
7070-- Target: `pub.layers.annotation.annotationLayer` records
7171-- Mostly structural mapping — the NLP output already has byte ranges and labels
7272-7373-**Lens: layers.pub → ionosphere document facets**
7474-- Source: layers.pub records (expression + segmentation + annotation layers)
7575-- Target: RelationalText document with ionosphere facets (#timestamp, #sentence, #paragraph, etc.)
7676-- This is the reverse of what `decodeToDocumentWithStructure` does — reading layers.pub records and emitting facets
7777-7878-### 3. Appview indexer updates
7979-8080-The appview needs to index layers.pub records from Jetstream:
8181-- Add `pub.layers.expression.expression`, `pub.layers.segmentation.segmentation`, `pub.layers.annotation.annotationLayer` to `IONOSPHERE_COLLECTIONS`
8282-- Create DB tables for these records
8383-- On indexing, rebuild the pre-assembled document from the layers.pub records (using the lens)
8484-8585-### 4. Schema versioning
8686-8787-- Initialize panproto VCS for the layers.pub schemas
8888-- Pin to layers.pub v0.5.0 (current vendored version)
8989-- Define migration strategy for when layers.pub evolves
9090-9191-## Architecture Decision: Build-Time vs Runtime
9292-9393-Currently, document assembly happens at **build time** (publish.ts) and the assembled document is stored on the talk record. With layers.pub records, the assembly could move to **runtime** (appview reads layers.pub records and assembles on the fly).
9494-9595-**Recommendation:** Keep build-time assembly for the ionosphere document (fast serving), AND publish layers.pub records alongside (for interoperability). The layers.pub records are the canonical source; the ionosphere document is a materialized view.
9696-9797-## Suggested Task Order
9898-9999-1. Write the publish step for layers.pub expression + segmentation records
100100-2. Write the publish step for annotation layer records
101101-3. Define panproto lens: compact transcript → expression + segmentation
102102-4. Define panproto lens: NLP annotations → annotation layers
103103-5. Update appview indexer to handle layers.pub records
104104-6. Initialize panproto VCS, tag v0.5.0
105105-7. Test round-trip: publish → index → serve → verify in browser
106106-8. Define panproto lens: layers.pub → ionosphere document facets (reverse lens)
107107-108108-## Questions for the Session
109109-110110-- Should we publish layers.pub records under the ionosphere.tv DID, or a separate account?
111111-- How should the appview discover which annotation layers belong to a given transcript? (By expression URI reference? By convention?)
112112-- Do we want to support third-party annotation layers from other DIDs? (e.g., someone else annotating our transcripts)
113113-- Should the panproto lenses be published as `org.relationaltext.lens` records (like the existing ones)?
···11-# layers.pub Record Publishing via Panproto Lenses — Design
22-33-**Status:** Approved
44-**Date:** 2026-04-13
55-**Depends on:** NLP enrichment pipeline (complete), concept deduplication (complete)
66-77-## Overview
88-99-Publish enrichment data as first-class AT Protocol records using the layers.pub schema. Panproto lenses are the authoritative transforms — all data flows through lenses, no parallel TypeScript pipelines. The existing ionosphere document (text + facets embedded on the talk record) becomes a materialized view rebuilt from layers.pub records.
1010-1111-## Decisions
1212-1313-- **DID:** Publish under ionosphere.tv
1414-- **Annotation layer references:** Point to the transcript's AT URI (`tv.ionosphere.transcript`)
1515-- **Third-party layers:** Not supported yet (no moderation). Comments and reactions are unaffected.
1616-- **Lens namespace:** `org.relationaltext.lens` (consistent with existing 4 lenses)
1717-- **Record keys:** Deterministic, derived from talk rkey (e.g., `{talk.rkey}-expression`)
1818-- **Expression kind:** `kind: "transcript"` (no `kindUri` — it requires AT URI format, and there's no meaningful record to reference)
1919-- **Segmentation tokenization:** `kind: "word"` — carries the temporal mapping; annotations use `anchor.textSpan` (byte offsets) directly
2020-- **Publish ordering:** Expression URI pre-computed from DID + rkey; all records publish in parallel
2121-- **Architecture:** Lenses-first (Approach B) — lenses are authoritative, publish step runs data through lenses
2222-2323-## Section 1: Record Model & Relationships
2424-2525-For each transcript, 6 new records published under ionosphere.tv:
2626-2727-```
2828-tv.ionosphere.transcript/{talk.rkey}-transcript (already exists)
2929- │
3030- ▼ sourceRef
3131-pub.layers.expression.expression/{talk.rkey}-expression
3232- │
3333- ├── pub.layers.segmentation.segmentation/{talk.rkey}-segmentation
3434- │ └── tokenization: kind "word", tokens with textSpan + temporalSpan
3535- │
3636- ├── pub.layers.annotation.annotationLayer/{talk.rkey}-sentences
3737- ├── pub.layers.annotation.annotationLayer/{talk.rkey}-paragraphs
3838- ├── pub.layers.annotation.annotationLayer/{talk.rkey}-entities
3939- └── pub.layers.annotation.annotationLayer/{talk.rkey}-topics
4040-```
4141-4242-Example AT URI: `at://did:plc:xxxxxx/pub.layers.expression.expression/atproto-for-everyone-expression`
4343-4444-> **Future work:** A `-speakers` annotation layer (`subkind: "speaker-segment"`) for diarization spans.
4545-> The NLP pipeline does not yet produce `speakerSegments` data, so this layer is deferred until the
4646-> diarization pipeline is integrated.
4747-4848-### Expression record
4949-5050-| Field | Value |
5151-|---|---|
5252-| `id` | talk rkey |
5353-| `$type` | `"pub.layers.expression.expression"` |
5454-| `kind` | `"transcript"` |
5555-| `text` | full transcript text |
5656-| `language` | `"en"` |
5757-| `sourceRef` | AT URI of `tv.ionosphere.transcript` record |
5858-| `metadata` | `{ tool: "ionosphere-pipeline", timestamp: "<ISO 8601 datetime>" }` |
5959-| `createdAt` | ISO 8601 timestamp |
6060-6161-### Segmentation record
6262-6363-| Field | Value |
6464-|---|---|
6565-| `expression` | AT URI of expression (pre-computed) |
6666-| `tokenizations` | Single tokenization, `kind: "word"` |
6767-| `createdAt` | ISO 8601 timestamp |
6868-6969-Each token: `tokenIndex` (0-based), `text` (word), `textSpan` (UTF-8 byte offsets), `temporalSpan` (start/ending in ms, derived from compact transcript timings).
7070-7171-### Annotation layers
7272-7373-All records include `$type: "pub.layers.annotation.annotationLayer"`. All reference the expression URI. All use `sourceMethod: "automatic"`. `createdAt` is optional in the lexicon but always included.
7474-7575-| rkey suffix | kind | subkind | annotations |
7676-|---|---|---|---|
7777-| `-sentences` | `"span"` | `"sentence-boundary"` | One annotation per sentence. `anchor: { textSpan: { byteStart, byteEnd } }`, `label`: first ~50 chars of sentence text. |
7878-| `-paragraphs` | `"span"` | `"paragraph-boundary"` | One annotation per paragraph. `anchor: { textSpan: { byteStart, byteEnd } }`, `label`: `"paragraph"`. |
7979-| `-entities` | `"span"` | `"ner"` | One annotation per entity mention. `anchor: { textSpan: { byteStart, byteEnd } }`, `label`: entity name. `features: { entries: [{ key: "nerType", value: "PERSON" }, { key: "conceptUri", value: "at://..." }] }` — `nerType` always present, `conceptUri` if resolved, plus any future entity keys (e.g., `speakerDid`) forwarded as passthrough. |
8080-| `-topics` | `"span"` | `"topic-segment"` | One annotation per topic break. `anchor: { textSpan: { byteStart, byteEnd } }` where `byteEnd = byteStart` (zero-width span). `label`: `"topic-break"`. |
8181-8282-`metadata` on all layers: `{ tool: "ionosphere-nlp-pipeline", timestamp: "<ISO 8601 datetime>" }`.
8383-8484-## Section 2: Panproto Lens Architecture
8585-8686-Three lenses. All authoritative — data flows through them.
8787-8888-### Lens 1: Compact Transcript → Expression + Segmentation
8989-9090-- **Source:** `tv.ionosphere.transcript` (text, startMs, timings)
9191-- **Target:** `pub.layers.expression.expression` + `pub.layers.segmentation.segmentation`
9292-- **Transform:** maps text → text, injects kind/language, replays timings array to produce token list with textSpan + temporalSpan
9393-- **Fan-out:** Single source → two target records (protolens chain)
9494-- **Morphism hints:** `text → text`, `startMs + timings → tokenizations[0].tokens`
9595-9696-### Lens 2: NLP Annotations → Annotation Layers
9797-9898-- **Source:** `tv.ionosphere.nlpAnnotations` (new lexicon, not published to PDS — exists as lens source schema)
9999-- **Target:** 4× `pub.layers.annotation.annotationLayer`
100100-- **Transform:** Byte ranges → `anchor.textSpan`, labels → `annotation.label`, entity metadata → `features`
101101-- **Fan-out:** Single source → four target records
102102-103103-The `tv.ionosphere.nlpAnnotations` lexicon formalizes the NlpAnnotations TypeScript interface as a proper schema that panproto can parse and validate.
104104-105105-### Lens 3: Layers.pub → Ionosphere Document Facets (reverse)
106106-107107-- **Source:** expression + segmentation + annotation layers
108108-- **Target:** RelationalText document with ionosphere facets (`{ text, facets }`)
109109-- **Purpose:** Materialized view builder, used by appview indexer
110110-- **Round-trip property:** Lens 1+2 followed by Lens 3 should reproduce the same document that `decodeToDocumentWithStructure` currently produces. This is the correctness test.
111111-112112-### Lens storage
113113-114114-Published as `org.relationaltext.lens` records (consistent with existing 4 lenses). Rkeys: `transcript-to-expression`, `nlp-to-annotation-layers`, `layers-to-document`.
115115-116116-## Section 3: Publish Pipeline
117117-118118-New Stage 6 in `publish.ts`, runs after transcript publishing.
119119-120120-For each talk with both transcript and NLP annotation files:
121121-122122-1. Load compact transcript (`transcripts/{rkey}.json`)
123123-2. Load NLP annotations (`nlp/{rkey}.json`)
124124-3. Run **Lens 1** via panproto WASM → expression + segmentation records
125125-4. Run **Lens 2** via panproto WASM → 4 annotation layer records
126126-5. Inject AT URIs (pre-computed from DID + rkey): `sourceRef` on expression, `expression` ref on segmentation and all annotation layers
127127-6. Publish all 6 records via `PdsClient.putRecord()` (parallel within each talk, sequential across talks)
128128-129129-**Idempotency:** Deterministic rkeys mean re-publish overwrites, no duplicates.
130130-131131-**Rate limiting:** ~98 talks × 6 records = ~588 putRecord calls. Records within a single talk publish in parallel (6 concurrent). Talks are processed sequentially. The existing PdsClient writeDelay (100ms) and 429/backoff handling apply.
132132-133133-**Skip condition:** If a talk has transcript but no NLP annotations, skip layers.pub entirely for that talk (keep as a unit).
134134-135135-**WASM lifecycle:** Panproto runtime initializes once (lazy singleton), schemas load once at stage start, each talk runs data through compiled lenses.
136136-137137-## Section 4: Appview Indexer Updates
138138-139139-### New Jetstream subscriptions
140140-141141-Add to `IONOSPHERE_COLLECTIONS`:
142142-- `pub.layers.expression.expression`
143143-- `pub.layers.segmentation.segmentation`
144144-- `pub.layers.annotation.annotationLayer`
145145-146146-**DID filter:** Only index records from ionosphere.tv DID (no third-party layers).
147147-148148-### New DB tables
149149-150150-```sql
151151-layers_expressions:
152152- uri TEXT PRIMARY KEY, rkey TEXT, did TEXT, transcript_uri TEXT,
153153- text TEXT, language TEXT, created_at TEXT
154154- -- uri IS the expression URI; other tables reference it via expression_uri
155155-156156-layers_segmentations:
157157- rkey TEXT, did TEXT, expression_uri TEXT,
158158- tokens_json TEXT, created_at TEXT
159159-160160-layers_annotations:
161161- rkey TEXT, did TEXT, expression_uri TEXT,
162162- kind TEXT, subkind TEXT, annotations_json TEXT, created_at TEXT
163163-```
164164-165165-Key indexes: `expression_uri` (find all layers for expression), `transcript_uri` (find expression for transcript).
166166-167167-### Document rebuild on ingest
168168-169169-When any layers.pub record arrives or updates:
170170-171171-1. Look up the expression URI → transcript URI
172172-2. Check for complete set: expression + segmentation + at least one annotation layer
173173-3. If complete, run **Lens 3** (layers.pub → ionosphere document facets) to produce materialized `{ text, facets }`
174174-4. Update the talk record's `document` field in DB
175175-176176-**Deletion:** If an annotation layer is deleted, rebuild with remaining layers (graceful degradation — fewer annotations). If the expression record is deleted, cascade-delete its segmentation and annotation rows from the DB and clear the talk's materialized document. Annotation/segmentation JSON is stored as TEXT blobs — all queries are by expression URI, not individual annotations, so this is sufficient.
177177-178178-**Backfill:** Add three new collections to existing startup backfill loop; records go through same ingest → rebuild path.
179179-180180-## Section 5: Schema Versioning
181181-182182-### Panproto VCS initialization
183183-184184-- `schema init` in project
185185-- Add layers.pub lexicons + ionosphere lexicons + NLP annotations lexicon + all lens definitions
186186-- `schema commit` initial state
187187-- `schema tag v0.5.0` — pin to current vendored layers.pub version
188188-189189-### What's tracked
190190-191191-- `lexicons/pub/layers/*.json`
192192-- `formats/tv.ionosphere/ionosphere.lexicon.json`
193193-- `formats/tv.ionosphere/nlpAnnotations.lexicon.json` (new)
194194-- `formats/tv.ionosphere/lenses/*.lens.json` (existing 4 + new 3)
195195-196196-### Migration strategy
197197-198198-When layers.pub evolves: vendor updated lexicons, `schema diff`, update lenses if needed. The materialized view insulates the frontend — layers.pub can change without affecting ionosphere document format until we choose to update.
199199-200200-## Task Order
201201-202202-1. Define `tv.ionosphere.nlpAnnotations` lexicon (lens source schema)
203203-2. Define Lens 1: compact transcript → expression + segmentation
204204-3. Define Lens 2: NLP annotations → 4 annotation layers
205205-4. Add publish Stage 6: run lenses, publish 6 records per talk
206206-5. Define Lens 3: layers.pub → ionosphere document facets (reverse)
207207-6. Add appview DB tables + indexer for layers.pub collections (depends on 5)
208208-7. Wire indexer rebuild: on layers.pub ingest, run Lens 3, update talk document
209209-8. Initialize panproto VCS, tag v0.5.0
210210-9. Test round-trip: publish → index → rebuild → verify document matches current output
211211-10. Deploy