search for standard sites pub-search.waow.tech
search zig blog atproto
11
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: add content extraction learnings, reorganize

- add content-extraction.md documenting site.standard.document gotchas
- links to eli mallon's question about the content field wrapper pattern
- move old planning docs to docs/scratch/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

zzstoatzz 0469d2f4 e1202722

+108
+108
docs/content-extraction.md
··· 1 + # content extraction for site.standard.document 2 + 3 + lessons learned from implementing cross-platform content extraction. 4 + 5 + ## the problem 6 + 7 + [eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y): 8 + 9 + > The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint. 10 + 11 + short answer: yes, and it's messier than the spec suggests. 12 + 13 + ## what the spec promises 14 + 15 + `site.standard.document` has a `textContent` field that should contain pre-flattened plaintext: 16 + 17 + ```json 18 + { 19 + "title": "my post", 20 + "textContent": "the full text content, ready for indexing...", 21 + "content": { 22 + "$type": "blog.pckt.content", 23 + "items": [ /* platform-specific blocks */ ] 24 + } 25 + } 26 + ``` 27 + 28 + in theory, you just use `textContent` and ignore the nested `content` structure. 29 + 30 + ## what we actually found 31 + 32 + ### pckt, offprint, greengale: textContent works 33 + 34 + these platforms properly populate `textContent`. extraction is simple: 35 + 36 + ```zig 37 + if (zat.json.getString(record, "textContent")) |text| { 38 + return text; // done 39 + } 40 + ``` 41 + 42 + ### leaflet: textContent is often empty 43 + 44 + when `site.standard.document` has `content.$type: "pub.leaflet.content"`, the `textContent` field is frequently empty. the actual content lives in: 45 + 46 + ``` 47 + content.pages[].blocks[].block.plaintext 48 + ``` 49 + 50 + our extraction priority (in `extractor.zig`): 51 + 52 + 1. `textContent` - use if present (ideal case) 53 + 2. `pages` - parse blocks at top level (pub.leaflet.document) 54 + 3. `content.pages` - parse blocks nested in content (site.standard.document with pub.leaflet.content) 55 + 56 + ### sometimes you need a PDS fetch 57 + 58 + when we receive a `site.standard.document` with embedded `pub.leaflet.content` but no parseable content, we fetch the corresponding `pub.leaflet.document` from the user's PDS: 59 + 60 + ```zig 61 + // in tap.zig 62 + if (doc.content.len == 0 and collection == STANDARD_DOCUMENT) { 63 + if (content_type == "pub.leaflet.content") { 64 + // fetch pub.leaflet.document from PDS to get actual content 65 + const leaflet_content = fetchLeafletContent(allocator, did, rkey); 66 + // ... 67 + } 68 + } 69 + ``` 70 + 71 + this adds latency but ensures we don't miss content. 72 + 73 + ## deduplication 74 + 75 + the same document can appear in both collections with identical `(did, rkey)`: 76 + - `site.standard.document` 77 + - `pub.leaflet.document` 78 + 79 + we dedupe on insert - if a record with the same `(did, rkey)` exists under a different URI, we delete the old one first. 80 + 81 + ## platform detection 82 + 83 + collection name alone doesn't tell you the platform for `site.standard.*` records. we infer from publication `basePath`: 84 + 85 + | basePath contains | platform | 86 + |-------------------|----------| 87 + | `leaflet.pub` | leaflet | 88 + | `pckt.blog` | pckt | 89 + | `offprint.app` | offprint | 90 + | `greengale.app` | greengale | 91 + | (none of the above) | other | 92 + 93 + ## implications for implementers 94 + 95 + **if you're building a renderer**: you need to understand each platform's block structure anyway, so indexing platform-specific collections (`pub.leaflet.document`, etc.) might be simpler. 96 + 97 + **if you're building search/backlinks/RSS**: `site.standard.document` gives you a unified envelope, but: 98 + - check `textContent` first 99 + - fall back to parsing `content.pages` for leaflet 100 + - consider fetching from PDS as last resort 101 + 102 + **the wrapper pattern is useful** for cross-platform tools, but the reality is messier than "just use textContent." 103 + 104 + ## code references 105 + 106 + - `backend/src/extractor.zig` - content extraction logic 107 + - `backend/src/tap.zig:232-244` - PDS fetch fallback 108 + - `backend/src/indexer.zig:99-112` - platform detection from basePath
docs/leaflet-publishing-plan.md docs/scratch/leaflet-publishing-plan.md
docs/standard-search-planning.md docs/scratch/standard-search-planning.md