docs: sharpen content extraction notes · zzstoatzz.io/leaflet-search@c2113e0

+33 -51

1 changed file

expand all

docs

+33 -51

docs/content-extraction.md

··· 8 8 9 9 > The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint. 10 10 11 - short answer: yes, and it's messier than the spec suggests. 11 + short answer: yes. but once you handle `content.pages` extraction, it's straightforward. 12 12 13 - ## what the spec promises 13 + ## textContent: platform-dependent 14 14 15 - `site.standard.document` has a `textContent` field that should contain pre-flattened plaintext: 15 + `site.standard.document` has a `textContent` field for pre-flattened plaintext: 16 16 17 17 ```json 18 18 { ··· 25 25 } 26 26 ``` 27 27 28 - in theory, you just use `textContent` and ignore the nested `content` structure. 28 + **pckt, offprint, greengale** populate `textContent`. extraction is trivial. 29 + 30 + **leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`. 29 31 30 - ## what we actually found 32 + ## extraction strategy 31 33 32 - ### pckt, offprint, greengale: textContent works 34 + priority order (in `extractor.zig`): 33 35 34 - these platforms properly populate `textContent`. extraction is simple: 36 + 1. `textContent` - use if present 37 + 2. `pages` - top-level blocks (pub.leaflet.document) 38 + 3. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content) 35 39 36 40 ```zig 41 + // try textContent first 37 42 if (zat.json.getString(record, "textContent")) |text| { 38 - return text; // done 43 + return text; 39 44 } 40 - ``` 41 45 42 - ### leaflet: textContent is often empty 43 - 44 - when `site.standard.document` has `content.$type: "pub.leaflet.content"`, the `textContent` field is frequently empty. the actual content lives in: 45 - 46 - ``` 47 - content.pages[].blocks[].block.plaintext 48 - ``` 49 - 50 - our extraction priority (in `extractor.zig`): 51 - 52 - 1. `textContent` - use if present (ideal case) 53 - 2. `pages` - parse blocks at top level (pub.leaflet.document) 54 - 3. `content.pages` - parse blocks nested in content (site.standard.document with pub.leaflet.content) 55 - 56 - ### sometimes you need a PDS fetch 57 - 58 - when we receive a `site.standard.document` with embedded `pub.leaflet.content` but no parseable content, we fetch the corresponding `pub.leaflet.document` from the user's PDS: 59 - 60 - ```zig 61 - // in tap.zig 62 - if (doc.content.len == 0 and collection == STANDARD_DOCUMENT) { 63 - if (content_type == "pub.leaflet.content") { 64 - // fetch pub.leaflet.document from PDS to get actual content 65 - const leaflet_content = fetchLeafletContent(allocator, did, rkey); 66 - // ... 67 - } 68 - } 46 + // fall back to block parsing 47 + const pages = zat.json.getArray(record, "pages") orelse 48 + zat.json.getArray(record, "content.pages"); 69 49 ``` 70 50 71 - this adds latency but ensures we don't miss content. 51 + the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls. 72 52 73 53 ## deduplication 74 54 75 - the same document can appear in both collections with identical `(did, rkey)`: 55 + documents can appear in both collections with identical `(did, rkey)`: 76 56 - `site.standard.document` 77 57 - `pub.leaflet.document` 78 58 79 - we dedupe on insert - if a record with the same `(did, rkey)` exists under a different URI, we delete the old one first. 59 + handle with `ON CONFLICT`: 60 + 61 + ```sql 62 + INSERT INTO documents (uri, ...) 63 + ON CONFLICT(uri) DO UPDATE SET ... 64 + ``` 65 + 66 + note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat. 80 67 81 68 ## platform detection 82 69 83 - collection name alone doesn't tell you the platform for `site.standard.*` records. we infer from publication `basePath`: 70 + collection name doesn't indicate platform for `site.standard.*` records. infer from publication `basePath`: 84 71 85 72 | basePath contains | platform | 86 73 |-------------------|----------| ··· 88 75 | `pckt.blog` | pckt | 89 76 | `offprint.app` | offprint | 90 77 | `greengale.app` | greengale | 91 - | (none of the above) | other | 78 + | (none) | other | 92 79 93 - ## implications for implementers 80 + ## summary 94 81 95 - **if you're building a renderer**: you need to understand each platform's block structure anyway, so indexing platform-specific collections (`pub.leaflet.document`, etc.) might be simpler. 96 - 97 - **if you're building search/backlinks/RSS**: `site.standard.document` gives you a unified envelope, but: 98 - - check `textContent` first 99 - - fall back to parsing `content.pages` for leaflet 100 - - consider fetching from PDS as last resort 101 - 102 - **the wrapper pattern is useful** for cross-platform tools, but the reality is messier than "just use textContent." 82 + - **pckt/offprint/greengale**: use `textContent` directly 83 + - **leaflet**: extract from `content.pages[].blocks[].block.plaintext` 84 + - **deduplication**: `ON CONFLICT` on `(did, rkey)` or `uri` 85 + - **platform**: infer from publication basePath, not collection name 103 86 104 87 ## code references 105 88 106 89 - `backend/src/extractor.zig` - content extraction logic 107 - - `backend/src/tap.zig:232-244` - PDS fetch fallback 108 90 - `backend/src/indexer.zig:99-112` - platform detection from basePath

Configure Feed

Configure Feed