search for standard sites pub-search.waow.tech
search zig blog atproto
11
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 111 lines 3.9 kB view raw view rendered
1# content extraction for site.standard.document 2 3lessons learned from implementing cross-platform content extraction. 4 5## the problem 6 7[eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y): 8 9> The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint. 10 11short answer: yes. but once you handle `content.pages` extraction, it's straightforward. 12 13## textContent: platform-dependent 14 15`site.standard.document` has a `textContent` field for pre-flattened plaintext: 16 17```json 18{ 19 "title": "my post", 20 "textContent": "the full text content, ready for indexing...", 21 "content": { 22 "$type": "blog.pckt.content", 23 "items": [ /* platform-specific blocks */ ] 24 } 25} 26``` 27 28**pckt, offprint, greengale** populate `textContent`. extraction is trivial. 29 30**leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`. 31 32## extraction strategy 33 34priority order (in `extractor.zig`): 35 361. `textContent` - use if present 372. `pages` - top-level blocks (pub.leaflet.document) 383. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content) 39 40```zig 41// try textContent first 42if (zat.json.getString(record, "textContent")) |text| { 43 return text; 44} 45 46// fall back to block parsing 47const pages = zat.json.getArray(record, "pages") orelse 48 zat.json.getArray(record, "content.pages"); 49``` 50 51the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls. 52 53## deduplication 54 55documents can appear in both collections with identical `(did, rkey)`: 56- `site.standard.document` 57- `pub.leaflet.document` 58 59handle with `ON CONFLICT`: 60 61```sql 62INSERT INTO documents (uri, ...) 63ON CONFLICT(uri) DO UPDATE SET ... 64``` 65 66note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat. 67 68## platform detection 69 70collection name doesn't indicate platform for `site.standard.*` records. detection order: 71 721. **basePath** - infer from publication basePath: 73 74| basePath contains | platform | 75|-------------------|----------| 76| `leaflet.pub` | leaflet | 77| `pckt.blog` | pckt | 78| `offprint.app` | offprint | 79| `greengale.app` | greengale | 80 812. **content.$type** - fallback for custom domains (e.g., `cailean.journal.ewancroft.uk`): 82 83| content.$type starts with | platform | 84|---------------------------|----------| 85| `pub.leaflet.` | leaflet | 86 873. if neither matches → `other` 88 89## whitewind 90 91[WhiteWind](https://whtwnd.com) (`com.whtwnd.blog.entry`) stores content as markdown in the `content` field (a string, not a blocks structure). extraction is trivial — just use the string directly. author-only posts (`visibility: "author"`) are skipped. 92 93## deduplication 94 95two layers prevent duplicate results: 96 971. **ingestion-time**: content hash (wyhash of `title + \x00 + content`) per author. if the same author publishes identical content across platforms (different rkeys), only the first is indexed. 982. **search-time**: `(did, title)` dedup collapses any remaining duplicates in results (e.g. records indexed before content hash was added). 99 100## summary 101 102- **pckt/offprint/greengale**: use `textContent` directly 103- **leaflet**: extract from `content.pages[].blocks[].block.plaintext` 104- **whitewind**: use `content` string directly (markdown) 105- **deduplication**: content hash at ingestion + `(did, title)` at search time 106- **platform**: infer from basePath, fallback to content.$type for custom domains 107 108## code references 109 110- `backend/src/ingest/extractor.zig` - content extraction logic, content_type field 111- `backend/src/ingest/indexer.zig` - platform detection from basePath + content_type, content hash dedup