add architecture docs: data flow, storage, moderation, scripts

+88

1 changed file

expand all

docs

+88

docs/architecture.md

··· 1 + # architecture 2 + 3 + community actor search for atproto. indexes the network via jetstream 4 + and serves FTS5 prefix search from cloudflare's edge. 5 + 6 + ## data flow 7 + 8 + ``` 9 + jetstream (JSON over WebSocket) 10 + | 11 + v 12 + ingester (zig, fly.io) 13 + | batches of {did, handle, display_name, avatar_cid, hidden} 14 + v 15 + worker (cloudflare) 16 + | 17 + +---> D1 (actors table + FTS5 index) 18 + +---> KV (cursor, config flags, mod_cursor) 19 + +---> cache API (60s edge cache for search) 20 + ``` 21 + 22 + ### write paths 23 + 24 + 1. **ingester**: streams jetstream, buffers actor events, POSTs batches to 25 + `/admin/ingest`. sends avatar CIDs (not full URLs — the CDN URL is 26 + deterministic from DID + CID). detects `!no-unauthenticated` self-labels 27 + and sets `hidden` flag. 28 + 29 + 2. **backfill**: when a search has gaps (missing avatars, few results), the 30 + worker calls bluesky's `searchActorsTypeahead` API and upserts results. 31 + extracts CIDs from the full avatar URLs returned by the API. 32 + 33 + 3. **request-indexing**: POST `/request-indexing?handle=...` resolves via 34 + slingshot, fetches profile from bluesky public API, extracts avatar CID, 35 + checks moderation labels, upserts. 36 + 37 + 4. **hourly cron**: 38 + - records actor count snapshot (total, with handles, with avatars) 39 + - refreshes moderation labels (walks 1000 actors/run via `mod_cursor`) 40 + - resolves missing handles via slingshot (up to 1000/run) 41 + 42 + ### read path 43 + 44 + search query -> cache API (hit?) -> FTS5 prefix match -> reconstruct avatar 45 + URLs from DID + CID -> return `{did, handle, displayName?, avatar?}` 46 + 47 + ## storage 48 + 49 + the actors table stores one row per DID: 50 + 51 + did TEXT PRIMARY KEY -- ~32 bytes 52 + handle TEXT -- ~25 bytes 53 + display_name TEXT -- ~20 bytes 54 + avatar_url TEXT -- ~59 bytes (CID only, URL reconstructed at query time) 55 + hidden INTEGER -- 1 byte 56 + updated_at INTEGER -- 8 bytes 57 + 58 + plus FTS5 index overhead. roughly ~280 bytes/row total. 59 + 60 + D1 has a 10GB hard limit (non-negotiable). at current row size that's ~35M 61 + actors. D1 is designed for per-tenant databases, not single large datasets — 62 + sharding a global search index across multiple D1s degrades FTS5 ranking 63 + (scores aren't comparable across shards) and fans out every query. 64 + 65 + when we outgrow D1, the natural move is **Turso** (hosted libSQL). our sibling 66 + project [leaflet-search](https://tangled.org/zzstoatzz.io/leaflet-search) 67 + already runs Turso + local SQLite read replica in production for FTS5 search. 68 + the schema and queries would port with minimal changes. 69 + 70 + ## moderation 71 + 72 + actors with certain labels are hidden from search results (`hidden = 1`): 73 + 74 + - bluesky moderation service (`did:plc:ar7c4by46qjdydhdevvrndac`) issued 75 + `!hide`, `!takedown`, or `spam` 76 + - any source (including self-labels): `!no-unauthenticated` 77 + 78 + this is an unauthenticated service, so we respect user opt-outs. the ingester 79 + catches self-labels on ingest; the hourly cron refreshes moderation labels 80 + by walking the full index over multiple runs. 81 + 82 + ## scripts 83 + 84 + - `scripts/smoke.py` — end-to-end smoke tests against a live deployment 85 + - `scripts/backfill-moderation.py` — one-shot sweep to set hidden flags on 86 + existing actors (run once after adding moderation support) 87 + - `scripts/migrate-avatar-cid.sql` — one-shot migration from full avatar 88 + URLs to bare CIDs (already applied to production)

Configure Feed

Configure Feed