docs: add reconciliation documentation · zzstoatzz.io/leaflet-search@73b8e1f

+108 -1

2 changed files

expand all

docs

README.md

+2 -1

docs/README.md

··· 20 20 static frontend (cloudflare pages) 21 21 ``` 22 22 23 - content flows in one direction: the firehose broadcasts every AT Protocol event in real-time, [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) filters for publishing-related records, and the backend indexes them. 23 + content flows in one direction: the firehose broadcasts every AT Protocol event in real-time, [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) filters for publishing-related records, and the backend indexes them. a [reconciler](reconciliation.md) periodically verifies documents still exist at their source, catching deletions missed while the tap was down. 24 24 25 25 ## how searching works 26 26 ··· 74 74 - [content-extraction.md](content-extraction.md) — how content is extracted from each platform 75 75 - [api.md](api.md) — API endpoint reference 76 76 - [tap.md](tap.md) — firehose consumer setup, debugging, memory tuning 77 + - [reconciliation.md](reconciliation.md) — stale document detection and cleanup 77 78 - [turso-hrana.md](turso-hrana.md) — Turso's HTTP protocol for database queries 78 79 - [performance-saga.md](performance-saga.md) — a debugging story about latency spikes

+106

docs/reconciliation.md

··· 1 + # reconciliation (stale document cleanup) 2 + 3 + the firehose is our only way to learn about ATProto record deletions. it's ephemeral — if the tap is down when a delete event comes through, the record becomes a ghost in turso (and turbopuffer) forever. the reconciler fixes this by periodically verifying documents still exist at their source PDS. 4 + 5 + ## the problem 6 + 7 + tap's resync only re-sends records that *exist* — it never emits delete events for records that disappeared. even forcing a full repo re-crawl (remove + re-add) only adds current records; it doesn't clean up ghosts. we confirmed this by reading indigo/tap source (`resyncer.go`, `firehose.go`). 8 + 9 + additionally, `deleteDocument()` in indexer.zig only cleaned turso — it never deleted the corresponding turbopuffer vector. so even when deletes *were* processed via the firehose, vectors accumulated forever. 10 + 11 + a real user reported this: they deleted and re-published blog posts weeks ago, but our index still had the old versions with broken URLs. 12 + 13 + ## how it works 14 + 15 + ``` 16 + reconciler (background thread, every 30 min) 17 + ↓ 18 + fetch 50 docs from turso (oldest verified_at first, NULLs = never checked) 19 + ↓ for each doc 20 + parse AT-URI → (did, collection, rkey) 21 + ↓ 22 + resolve DID → PDS endpoint via plc.directory (cached across cycles) 23 + ↓ 24 + GET {pds}/xrpc/com.atproto.repo.getRecord?repo={did}&collection={collection}&rkey={rkey} 25 + ↓ 26 + 200 → update verified_at (record still exists) 27 + 400/404 → delete from turso + tpuf (record is gone) 28 + 5xx/timeout → skip (PDS might be temporarily down) 29 + ``` 30 + 31 + at ~12k documents, 50 per cycle every 30 minutes, the full index is verified in ~5 days. documents older than 7 days are re-verified. 32 + 33 + ## what it fixes 34 + 35 + **historical drift (the main problem):** documents deleted while the tap was down are detected and cleaned up. this is the only mechanism that catches these — tap resync can't. 36 + 37 + **forward-looking vector leak:** the tap.zig delete handler now also calls `tpuf.delete()`, so future firehose deletes clean both turso and turbopuffer. 38 + 39 + ## files 40 + 41 + | file | role | 42 + |------|------| 43 + | `backend/src/reconcile.zig` | background worker (~250 lines) | 44 + | `backend/src/main.zig` | wires up `reconcile.start(allocator)` after `tpuf.init()` | 45 + | `backend/src/db/schema.zig` | `verified_at TEXT` column migration | 46 + | `backend/src/ingest/tap.zig` | `tpuf.delete()` after `indexer.deleteDocument()` | 47 + 48 + ## configuration 49 + 50 + all env vars with sensible defaults — no configuration required for normal operation. 51 + 52 + | variable | default | description | 53 + |----------|---------|-------------| 54 + | `RECONCILE_ENABLED` | `true` | kill switch — set to `false` to disable entirely | 55 + | `RECONCILE_INTERVAL_SECS` | `1800` | seconds between cycles (30 min) | 56 + | `RECONCILE_BATCH_SIZE` | `50` | documents checked per cycle | 57 + | `RECONCILE_REVERIFY_DAYS` | `7` | re-verify documents older than N days | 58 + 59 + ## failure modes 60 + 61 + the reconciler is designed to degrade gracefully — it can never break search or indexing. 62 + 63 + | scenario | behavior | 64 + |----------|----------| 65 + | turso down | `error.NoClient` → logged, exponential backoff | 66 + | plc.directory down | all PDS lookups return null → entire batch skipped, no deletes | 67 + | PDS down (5xx/timeout) | `error_skip` → doc not deleted, not verified, retried next cycle | 68 + | turbopuffer down | `tpuf.delete` errors caught → turso deletes still happen | 69 + | reconciler thread panics | isolated thread — search/indexing/embedding unaffected | 70 + 71 + the reconciler never deletes on ambiguity. only a definitive 400 or 404 from the PDS triggers deletion. any error or timeout means "skip and retry later." 72 + 73 + ## race conditions 74 + 75 + **tap creates doc while reconciler deletes it:** safe. `insertDocument`'s `ON CONFLICT` handles re-creation — the document comes right back on the next tap event. 76 + 77 + **reconciler and tap both delete the same doc:** safe. `deleteDocument` and `tpuf.delete` are both idempotent. 78 + 79 + ## observability 80 + 81 + - **fly logs:** `reconcile: background worker started` on boot, `reconcile: verified N documents, deleted M` after each cycle with activity 82 + - **logfire:** `reconcile.cycle` span covers each full cycle. `reconcile: deleted stale document: {uri}` logged for each deletion. 83 + - **turso:** `verified_at` column shows when each document was last verified. `NULL` = never checked. 84 + 85 + ### checking reconciler status 86 + 87 + ```bash 88 + # verify it started 89 + fly logs -a leaflet-search-backend --no-tail | grep reconcile 90 + 91 + # check verified_at coverage (how many docs have been checked) 92 + # via turso shell or dashboard query: 93 + # SELECT COUNT(*) as total, COUNT(verified_at) as verified FROM documents 94 + ``` 95 + 96 + ## design decisions 97 + 98 + **why not use tap resync?** tap resync only sends records that exist. it never sends delete events for records that disappeared. even removing and re-adding a repo only backfills current records — it doesn't identify what was deleted since the last sync. 99 + 100 + **why check the PDS directly?** the PDS is the authoritative source. `com.atproto.repo.getRecord` returns the record if it exists, or 400/404 if it doesn't. no middleman, no ambiguity. 101 + 102 + **why cache PDS endpoints?** many documents share the same author (DID). resolving the PDS once per DID and caching it avoids redundant plc.directory lookups. the cache persists for the lifetime of the worker thread. 103 + 104 + **why 200ms rate limiting?** PDSs are shared infrastructure. we check 50 documents per cycle at most — aggressive polling would be antisocial. 200ms between requests is conservative. 105 + 106 + **why compute timestamps in zig?** turso's handling of `strftime` with parameterized modifiers is untested in this codebase. computing timestamps in zig (same approach as the embedder) eliminates that risk.

Configure Feed

Configure Feed