···2020static frontend (cloudflare pages)
2121```
22222323-content flows in one direction: the firehose broadcasts every AT Protocol event in real-time, [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) filters for publishing-related records, and the backend indexes them.
2323+content flows in one direction: the firehose broadcasts every AT Protocol event in real-time, [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) filters for publishing-related records, and the backend indexes them. a [reconciler](reconciliation.md) periodically verifies documents still exist at their source, catching deletions missed while the tap was down.
24242525## how searching works
2626···7474- [content-extraction.md](content-extraction.md) — how content is extracted from each platform
7575- [api.md](api.md) — API endpoint reference
7676- [tap.md](tap.md) — firehose consumer setup, debugging, memory tuning
7777+- [reconciliation.md](reconciliation.md) — stale document detection and cleanup
7778- [turso-hrana.md](turso-hrana.md) — Turso's HTTP protocol for database queries
7879- [performance-saga.md](performance-saga.md) — a debugging story about latency spikes
+106
docs/reconciliation.md
···11+# reconciliation (stale document cleanup)
22+33+the firehose is our only way to learn about ATProto record deletions. it's ephemeral — if the tap is down when a delete event comes through, the record becomes a ghost in turso (and turbopuffer) forever. the reconciler fixes this by periodically verifying documents still exist at their source PDS.
44+55+## the problem
66+77+tap's resync only re-sends records that *exist* — it never emits delete events for records that disappeared. even forcing a full repo re-crawl (remove + re-add) only adds current records; it doesn't clean up ghosts. we confirmed this by reading indigo/tap source (`resyncer.go`, `firehose.go`).
88+99+additionally, `deleteDocument()` in indexer.zig only cleaned turso — it never deleted the corresponding turbopuffer vector. so even when deletes *were* processed via the firehose, vectors accumulated forever.
1010+1111+a real user reported this: they deleted and re-published blog posts weeks ago, but our index still had the old versions with broken URLs.
1212+1313+## how it works
1414+1515+```
1616+reconciler (background thread, every 30 min)
1717+ ↓
1818+fetch 50 docs from turso (oldest verified_at first, NULLs = never checked)
1919+ ↓ for each doc
2020+parse AT-URI → (did, collection, rkey)
2121+ ↓
2222+resolve DID → PDS endpoint via plc.directory (cached across cycles)
2323+ ↓
2424+GET {pds}/xrpc/com.atproto.repo.getRecord?repo={did}&collection={collection}&rkey={rkey}
2525+ ↓
2626+200 → update verified_at (record still exists)
2727+400/404 → delete from turso + tpuf (record is gone)
2828+5xx/timeout → skip (PDS might be temporarily down)
2929+```
3030+3131+at ~12k documents, 50 per cycle every 30 minutes, the full index is verified in ~5 days. documents older than 7 days are re-verified.
3232+3333+## what it fixes
3434+3535+**historical drift (the main problem):** documents deleted while the tap was down are detected and cleaned up. this is the only mechanism that catches these — tap resync can't.
3636+3737+**forward-looking vector leak:** the tap.zig delete handler now also calls `tpuf.delete()`, so future firehose deletes clean both turso and turbopuffer.
3838+3939+## files
4040+4141+| file | role |
4242+|------|------|
4343+| `backend/src/reconcile.zig` | background worker (~250 lines) |
4444+| `backend/src/main.zig` | wires up `reconcile.start(allocator)` after `tpuf.init()` |
4545+| `backend/src/db/schema.zig` | `verified_at TEXT` column migration |
4646+| `backend/src/ingest/tap.zig` | `tpuf.delete()` after `indexer.deleteDocument()` |
4747+4848+## configuration
4949+5050+all env vars with sensible defaults — no configuration required for normal operation.
5151+5252+| variable | default | description |
5353+|----------|---------|-------------|
5454+| `RECONCILE_ENABLED` | `true` | kill switch — set to `false` to disable entirely |
5555+| `RECONCILE_INTERVAL_SECS` | `1800` | seconds between cycles (30 min) |
5656+| `RECONCILE_BATCH_SIZE` | `50` | documents checked per cycle |
5757+| `RECONCILE_REVERIFY_DAYS` | `7` | re-verify documents older than N days |
5858+5959+## failure modes
6060+6161+the reconciler is designed to degrade gracefully — it can never break search or indexing.
6262+6363+| scenario | behavior |
6464+|----------|----------|
6565+| turso down | `error.NoClient` → logged, exponential backoff |
6666+| plc.directory down | all PDS lookups return null → entire batch skipped, no deletes |
6767+| PDS down (5xx/timeout) | `error_skip` → doc not deleted, not verified, retried next cycle |
6868+| turbopuffer down | `tpuf.delete` errors caught → turso deletes still happen |
6969+| reconciler thread panics | isolated thread — search/indexing/embedding unaffected |
7070+7171+the reconciler never deletes on ambiguity. only a definitive 400 or 404 from the PDS triggers deletion. any error or timeout means "skip and retry later."
7272+7373+## race conditions
7474+7575+**tap creates doc while reconciler deletes it:** safe. `insertDocument`'s `ON CONFLICT` handles re-creation — the document comes right back on the next tap event.
7676+7777+**reconciler and tap both delete the same doc:** safe. `deleteDocument` and `tpuf.delete` are both idempotent.
7878+7979+## observability
8080+8181+- **fly logs:** `reconcile: background worker started` on boot, `reconcile: verified N documents, deleted M` after each cycle with activity
8282+- **logfire:** `reconcile.cycle` span covers each full cycle. `reconcile: deleted stale document: {uri}` logged for each deletion.
8383+- **turso:** `verified_at` column shows when each document was last verified. `NULL` = never checked.
8484+8585+### checking reconciler status
8686+8787+```bash
8888+# verify it started
8989+fly logs -a leaflet-search-backend --no-tail | grep reconcile
9090+9191+# check verified_at coverage (how many docs have been checked)
9292+# via turso shell or dashboard query:
9393+# SELECT COUNT(*) as total, COUNT(verified_at) as verified FROM documents
9494+```
9595+9696+## design decisions
9797+9898+**why not use tap resync?** tap resync only sends records that exist. it never sends delete events for records that disappeared. even removing and re-adding a repo only backfills current records — it doesn't identify what was deleted since the last sync.
9999+100100+**why check the PDS directly?** the PDS is the authoritative source. `com.atproto.repo.getRecord` returns the record if it exists, or 400/404 if it doesn't. no middleman, no ambiguity.
101101+102102+**why cache PDS endpoints?** many documents share the same author (DID). resolving the PDS once per DID and caching it avoids redundant plc.directory lookups. the cache persists for the lifetime of the worker thread.
103103+104104+**why 200ms rate limiting?** PDSs are shared infrastructure. we check 50 documents per cycle at most — aggressive polling would be antisocial. 200ms between requests is conservative.
105105+106106+**why compute timestamps in zig?** turso's handling of `strftime` with parameterized modifiers is untested in this codebase. computing timestamps in zig (same approach as the embedder) eliminates that risk.