···11+# architecture
22+33+community actor search for atproto. indexes the network via jetstream
44+and serves FTS5 prefix search from cloudflare's edge.
55+66+## data flow
77+88+```
99+ jetstream (JSON over WebSocket)
1010+ |
1111+ v
1212+ ingester (zig, fly.io)
1313+ | batches of {did, handle, display_name, avatar_cid, hidden}
1414+ v
1515+ worker (cloudflare)
1616+ |
1717+ +---> D1 (actors table + FTS5 index)
1818+ +---> KV (cursor, config flags, mod_cursor)
1919+ +---> cache API (60s edge cache for search)
2020+```
2121+2222+### write paths
2323+2424+1. **ingester**: streams jetstream, buffers actor events, POSTs batches to
2525+ `/admin/ingest`. sends avatar CIDs (not full URLs — the CDN URL is
2626+ deterministic from DID + CID). detects `!no-unauthenticated` self-labels
2727+ and sets `hidden` flag.
2828+2929+2. **backfill**: when a search has gaps (missing avatars, few results), the
3030+ worker calls bluesky's `searchActorsTypeahead` API and upserts results.
3131+ extracts CIDs from the full avatar URLs returned by the API.
3232+3333+3. **request-indexing**: POST `/request-indexing?handle=...` resolves via
3434+ slingshot, fetches profile from bluesky public API, extracts avatar CID,
3535+ checks moderation labels, upserts.
3636+3737+4. **hourly cron**:
3838+ - records actor count snapshot (total, with handles, with avatars)
3939+ - refreshes moderation labels (walks 1000 actors/run via `mod_cursor`)
4040+ - resolves missing handles via slingshot (up to 1000/run)
4141+4242+### read path
4343+4444+search query -> cache API (hit?) -> FTS5 prefix match -> reconstruct avatar
4545+URLs from DID + CID -> return `{did, handle, displayName?, avatar?}`
4646+4747+## storage
4848+4949+the actors table stores one row per DID:
5050+5151+ did TEXT PRIMARY KEY -- ~32 bytes
5252+ handle TEXT -- ~25 bytes
5353+ display_name TEXT -- ~20 bytes
5454+ avatar_url TEXT -- ~59 bytes (CID only, URL reconstructed at query time)
5555+ hidden INTEGER -- 1 byte
5656+ updated_at INTEGER -- 8 bytes
5757+5858+plus FTS5 index overhead. roughly ~280 bytes/row total.
5959+6060+D1 has a 10GB hard limit (non-negotiable). at current row size that's ~35M
6161+actors. D1 is designed for per-tenant databases, not single large datasets —
6262+sharding a global search index across multiple D1s degrades FTS5 ranking
6363+(scores aren't comparable across shards) and fans out every query.
6464+6565+when we outgrow D1, the natural move is **Turso** (hosted libSQL). our sibling
6666+project [leaflet-search](https://tangled.org/zzstoatzz.io/leaflet-search)
6767+already runs Turso + local SQLite read replica in production for FTS5 search.
6868+the schema and queries would port with minimal changes.
6969+7070+## moderation
7171+7272+actors with certain labels are hidden from search results (`hidden = 1`):
7373+7474+- bluesky moderation service (`did:plc:ar7c4by46qjdydhdevvrndac`) issued
7575+ `!hide`, `!takedown`, or `spam`
7676+- any source (including self-labels): `!no-unauthenticated`
7777+7878+this is an unauthenticated service, so we respect user opt-outs. the ingester
7979+catches self-labels on ingest; the hourly cron refreshes moderation labels
8080+by walking the full index over multiple runs.
8181+8282+## scripts
8383+8484+- `scripts/smoke.py` — end-to-end smoke tests against a live deployment
8585+- `scripts/backfill-moderation.py` — one-shot sweep to set hidden flags on
8686+ existing actors (run once after adding moderation support)
8787+- `scripts/migrate-avatar-cid.sql` — one-shot migration from full avatar
8888+ URLs to bare CIDs (already applied to production)