tidy docs: move OOM postmortem to notes, update architecture

+76 -13

docs/architecture.md

··· 10 10 | 11 11 v 12 12 ingester (zig, fly.io) 13 - | batches of {did, handle, display_name, avatar_cid, hidden} 13 + | bloom filter dedup (~1.2MB fixed) 14 + | two-tier writes: 15 + | - bare DID: INSERT OR IGNORE (0 Turso writes for known actors) 16 + | - profile/identity event: full UPSERT with COALESCE 14 17 v 15 18 worker (cloudflare) 16 19 | 17 20 +---> Turso (actors table + FTS5 index) 18 - +---> KV (cursor, mod_cursor) 21 + +---> KV (cursor, mod_cursor, enrich_lock) 19 22 +---> cache API (60s edge cache for search) 20 23 ``` 21 24 22 25 ### write paths 23 26 24 - 1. **ingester**: streams jetstream, buffers actor events, POSTs batches to 25 - `/admin/ingest`. sends avatar CIDs (not full URLs — the CDN URL is 26 - deterministic from DID + CID). 27 + 1. **ingester**: streams jetstream (profiles, posts, likes, follows), buffers 28 + actor events, POSTs batches to `/admin/ingest`. sends avatar CIDs (not full 29 + URLs — the CDN URL is deterministic from DID + CID). bloom filter dedup 30 + prevents re-sending already-known bare DIDs. 31 + 32 + 2. **enrichment pipeline** (dispatched from ingest hot path via `waitUntil`): 33 + - **phase 1 — identity**: slingshot resolves DID → handle + PDS endpoint. 34 + 100 DIDs/run, 20 concurrent. attempt tracking (`identity_checked_at`) 35 + backs off failures for 1 hour. 36 + - **phase 2 — profile**: PDS-native `getRecord` fetches avatar CID + 37 + display_name directly from the actor's PDS. 20/run. attempt tracking 38 + (`profile_checked_at`) backs off failures for 1 hour. 39 + - lease-coordinated via KV (`enrich_lock`, 30s TTL) — no stampede from 40 + overlapping ingest batches. 41 + - converges to zero work as gaps fill (gap-driven queries return fewer 42 + results over time). 27 43 28 - 2. **backfill**: when a search has gaps (missing avatars, few results), the 44 + 3. **backfill**: when a search has gaps (missing avatars, few results), the 29 45 worker calls bluesky's `searchActorsTypeahead` API and upserts results. 30 46 extracts CIDs from the full avatar URLs returned by the API. 31 47 32 - 3. **request-indexing**: POST `/request-indexing?handle=...` resolves via 48 + 4. **request-indexing**: POST `/request-indexing?handle=...` resolves via 33 49 slingshot, fetches profile from bluesky public API, extracts avatar CID, 34 50 checks moderation labels, upserts. 35 51 36 - 4. **hourly cron**: 37 - - records actor count snapshot (total, with handles, with avatars) 38 - - refreshes moderation labels (walks 1000 actors/run via `mod_cursor`) 39 - - resolves missing handles via slingshot (up to 1000/run) 52 + 5. **hourly cron**: 53 + - records actor count snapshot (total, with handles, with avatars) — 54 + serves as reconciliation anchor for delta-based tracking 55 + - refreshes moderation labels (walks 1000 actors/run via `mod_cursor`); 56 + piggybacks enrichment — saves handle, display_name, avatar from the 57 + getProfiles response at zero additional API cost 58 + - runs `enrichActors` as a catch-up for idle periods (same function, 59 + same lease — no overlap risk) 60 + 61 + ### metrics 62 + 63 + - **search metrics**: 5-min buckets in `metrics` table (searches + total_ms) 64 + - **actor deltas**: 5-min buckets in `actor_deltas` table tracking incremental 65 + changes to actors, handles, and avatars counts. recorded at write time 66 + (ingest, delete, enrichment) via fire-and-forget `waitUntil`. 67 + - **hourly snapshots**: absolute counts from `COUNT(*)` queries, used as 68 + reconciliation anchors. the trend chart stitches snapshots + deltas for 69 + sub-hourly resolution between anchor points. 70 + - **traffic sources**: cumulative hit counters per client identity (X-Client > 71 + Origin > Referer). loopback/localhost normalized to "unknown". 72 + 73 + ### enrichment convergence 74 + 75 + the enrichment pipeline is gap-driven: 76 + 77 + - identity phase queries actors with `handle = ''` and 78 + `identity_checked_at < now - 1hr` 79 + - profile phase queries actors with `handle != ''`, `avatar_url = ''`, 80 + `pds != ''`, and `profile_checked_at < now - 1hr` 81 + - as actors get enriched, these queries return fewer rows 82 + - system quiesces when all actors are resolved or backed off 83 + 84 + AppView (public.api.bsky.app) is used only for moderation label checks. 85 + profile data comes from PDS-native fetches. 40 86 41 87 ### read path 42 88 ··· 53 99 avatar_url TEXT -- ~59 bytes (CID only, URL reconstructed at query time) 54 100 hidden INTEGER -- 1 byte 55 101 updated_at INTEGER -- 8 bytes 102 + pds TEXT -- ~40 bytes (PDS endpoint URL) 103 + identity_checked_at INTEGER -- 8 bytes (last slingshot attempt) 104 + profile_checked_at INTEGER -- 8 bytes (last PDS profile attempt) 56 105 57 - plus FTS5 index overhead. roughly ~280 bytes/row total. 106 + plus FTS5 index overhead. roughly ~320 bytes/row total. 58 107 59 108 storage is [Turso](https://turso.tech) (hosted libSQL). previously used 60 109 Cloudflare D1 but migrated to Turso to avoid D1's 10GB hard limit and ··· 72 121 accounts, and so do we. 73 122 74 123 the hourly cron refreshes moderation labels by walking the full index over 75 - multiple runs (1000 actors/run via `mod_cursor` in KV). 124 + multiple runs (1000 actors/run via `mod_cursor` in KV). it piggybacks 125 + enrichment data (handle, display_name, avatar) from the same getProfiles 126 + response at no additional API cost. 76 127 77 128 ## scripts 78 129 79 130 - `scripts/smoke.py` — end-to-end smoke tests against a live deployment 80 131 - `scripts/backfill-moderation.py` — one-shot sweep to set hidden flags on 81 132 existing actors (run once after adding moderation support) 133 + - `scripts/add-enrichment-columns.sql` — one-shot migration adding pds, 134 + identity_checked_at, profile_checked_at columns (run once before deploying 135 + enrichment pipeline) 136 + - `scripts/add-actor-deltas.sql` — one-shot migration adding actor_deltas 137 + table for 5-min granularity delta tracking 82 138 - `scripts/migrate-avatar-cid.sql` — one-shot migration from full avatar 83 139 URLs to bare CIDs (already applied to production) 84 140 - `scripts/migrate-to-turso.py` — one-shot D1-to-Turso data migration 85 141 (already applied to production) 142 + 143 + ## notes 144 + 145 + - `docs/notes/ingester-oom-firehose-widening.md` — postmortem from widening 146 + jetstream subscriptions to posts/likes/follows, causing ingester OOM on 147 + 256MB fly.io VM. root cause: in-memory dedup set with retain-capacity 148 + semantics. resolved with bloom filter approach.

+142

docs/notes/ingester-oom-firehose-widening.md

··· 1 + # incident: ingester OOM after widening firehose 2 + 3 + ## what changed 4 + 5 + we widened the jetstream ingester from subscribing to just 6 + `app.bsky.actor.profile` to also include `app.bsky.feed.post`, 7 + `app.bsky.feed.like`, and `app.bsky.graph.follow`. the goal was to 8 + discover more actors — previously we only found people who 9 + created/updated their profile (~92k indexed). now we pick up anyone 10 + who posts, likes, or follows. 11 + 12 + commit: `5822553` on main 13 + 14 + files changed: 15 + - `ingester/src/main.zig` — added collections, added in-memory dedup set 16 + - `src/index.ts` — bumped `resolveHandles` from 1000→5000/hr with concurrency 17 + 18 + ## what nate is seeing 19 + 20 + on the fly.io dashboard (`fly.io/apps/typeahead-ingester/monitoring`), 21 + the firecracker memory usage shows a **sawtooth pattern** — memory climbs 22 + steadily until it hits the 256MB VM limit, the process OOMs and restarts, 23 + then climbs again. this may be happening every few hours. 24 + 25 + nate's exact words: "the firecracker memory usage is in a sawtooth pattern 26 + where it's just ooming every couple hours or every five hours or so." 27 + 28 + he also questioned whether the whole approach is sound: "can't we just 29 + design the database intelligently so that it's just always an upsert of 30 + new things? are we doing this ingestion process intelligently or are we 31 + being silly?" 32 + 33 + ## what i tried (and why it didn't work) 34 + 35 + i kept trying to grep fly logs for OOM/restart signals: 36 + 37 + ```bash 38 + # these all either timed out, returned nothing, or got stuck on streaming 39 + fly logs --app typeahead-ingester 2>&1 | grep -iE "oom|killed|reboot|restart" 40 + fly logs --app typeahead-ingester --no-tail 2>&1 | grep -iE "reboot|connected" 41 + fly logs --app typeahead-ingester 2>&1 | grep -m 20 -iE "reboot|connect|pruning|oom|signal|error" 42 + ``` 43 + 44 + `fly logs` streams indefinitely and the grep patterns weren't matching 45 + anything in the visible window. i should have been looking at the **fly.io 46 + dashboard metrics** directly (memory graph) or used the prometheus/metrics 47 + endpoint, not grepping log lines for kernel OOM messages that may not 48 + appear in application logs. 49 + 50 + the one useful command was: 51 + ```bash 52 + fly machine status 78465ddb202278 --app typeahead-ingester 53 + ``` 54 + which showed the VM is 256MB shared-cpu, state `started`, but doesn't 55 + show memory usage history. 56 + 57 + ## useful commands for debugging this 58 + 59 + ```bash 60 + # check current ingestion rate and whether it's keeping up 61 + fly logs --app typeahead-ingester 2>&1 | head -20 62 + 63 + # check machine status, restarts, event history 64 + fly machine status 78465ddb202278 --app typeahead-ingester 65 + 66 + # check VM config (memory, cpu) 67 + fly machine status 78465ddb202278 --app typeahead-ingester --display-config 68 + 69 + # check app-level status (machine count, versions) 70 + fly status --app typeahead-ingester 71 + 72 + # rebuild ingester locally 73 + cd ingester && zig build 74 + 75 + # deploy ingester 76 + cd ingester && fly deploy 77 + 78 + # deploy worker 79 + npx wrangler deploy 80 + 81 + # check turso DB actor count 82 + turso db shell typeahead 'SELECT COUNT(*) as total FROM actors;' 83 + 84 + # check turso write budget 85 + turso plan show 86 + 87 + # smoke test search endpoint 88 + curl -s 'https://typeahead.waow.tech/xrpc/app.bsky.actor.searchActorsTypeahead?q=nate&limit=3' 89 + ``` 90 + 91 + ## my best understanding of the problem 92 + 93 + the ingester runs on a 256MB fly.io VM. we added a dedup set 94 + (`std.StringHashMapUnmanaged` + `std.heap.ArenaAllocator`) that grows as 95 + it accumulates unique DIDs from the firehose. the set prunes at 500k 96 + entries by calling `clearRetainingCapacity` and `arena.reset(.retain_capacity)` 97 + — but **"retain capacity" means the memory stays allocated**. the RSS 98 + never actually shrinks. the zig `GeneralPurposeAllocator` also doesn't 99 + eagerly return pages to the OS. 100 + 101 + so the pattern is: 102 + 1. seen set grows → arena pages allocated, hash map backing array grows 103 + 2. at 500k, we "prune" — but retain all the allocated pages 104 + 3. new entries reuse the retained pages for a while 105 + 4. but hash map resizing, arena fragmentation, and GPA overhead 106 + accumulate over time 107 + 5. eventually RSS hits 256MB → OOM kill → restart → repeat 108 + 109 + it's also possible the OOM predates our changes — the old ingester on 110 + 256MB may have had a slower version of the same problem with the main 111 + arena. but our changes made it much worse by processing 100x more events. 112 + 113 + ## the deeper design question 114 + 115 + nate is right to question the approach. right now we: 116 + 1. receive every post/like/follow on the entire bluesky network 117 + 2. extract the DID 118 + 3. check an in-memory set (which itself causes OOM) 119 + 4. send a bare upsert to the worker 120 + 5. the worker does `INSERT ... ON CONFLICT DO UPDATE SET handle = COALESCE(NULLIF('', ''), actors.handle)` — which is a **no-op** for the vast majority of DIDs that are already fully indexed 121 + 122 + so we're burning memory, CPU, and turso writes on redundant no-op upserts. 123 + 124 + possible fixes: 125 + - **bump VM to 512MB** — buys time but doesn't fix the design 126 + - **drop the seen set, use INSERT OR IGNORE on the worker** — eliminates 127 + ingester memory growth entirely, worker DB handles dedup naturally 128 + - **bloom filter** — fixed ~1MB memory, no growth, 1% false positive rate 129 + - **worker-side filtering** — have `/admin/ingest` check which DIDs are 130 + new before writing (read-before-write tradeoff) 131 + - **don't retain capacity on prune** — use `deinit` + re-init instead of 132 + `clearRetainingCapacity`, so memory actually returns to OS 133 + 134 + ## current state 135 + 136 + the ingester is running and ingesting (~100 actors/batch, ~2 batches/sec). 137 + it went from 92k to 133k actors in ~6 minutes after deploy. it has not 138 + (yet) OOMed since the latest deploy, but the sawtooth pattern nate sees 139 + on the dashboard suggests it will. 140 + 141 + DB is at 133k actors. turso writes went from 1.2M to 1.4M (of 25M monthly 142 + budget). search endpoint is working fine.

+6

scripts/add-enrichment-columns.sql

··· 1 + -- one-shot migration: add enrichment pipeline columns to actors table 2 + -- run once against Turso before deploying the updated worker 3 + 4 + ALTER TABLE actors ADD COLUMN pds TEXT DEFAULT ''; 5 + ALTER TABLE actors ADD COLUMN identity_checked_at INTEGER DEFAULT 0; 6 + ALTER TABLE actors ADD COLUMN profile_checked_at INTEGER DEFAULT 0;

Configure Feed

Configure Feed