ops changelog: march 17 incident + resync + malloc + eval fix

+177

1 changed file

expand all

docs

+177

docs/ops-changelog.md

··· 31 31 32 32 --- 33 33 34 + ## 2026-03-17 35 + 36 + ### incident: zlay coverage collapse (99% → ~5%) 37 + 38 + **timeline**: 39 + - 2026-03-16 05:37 UTC: deploy of a785f8c (cleaner.zig) triggers pod restart. 40 + coverage drops from 99.16% to near-zero within ~1 hour. 41 + - 2026-03-17 ~06:00 UTC: investigation begins after noticing zlay at 0.99% 42 + on Pulsar and ~5% on the relay-eval leaderboard. 43 + 44 + **root cause**: bsky.network (a relay, not a PDS) was in the `host` table as 45 + host_id=1 with status=active since 2026-03-01. on every pod restart, the 46 + slurper loaded it from the DB and spawned a subscriber for it. bsky.network's 47 + firehose pumped the entire network's events into zlay, but host authority 48 + checks rejected them all (DID docs point to actual PDS hosts, not 49 + bsky.network). this consumed ~37.5% of processing capacity on rejected events. 50 + 51 + **how it got there**: `addHost()` in slurper.zig calls `getOrCreateHost()` 52 + (which inserts) BEFORE `checkHost()` (which validates). someone sent 53 + `requestCrawl` for bsky.network on March 1 — it got inserted into the host 54 + table, then `checkHost` rejected it with `IsARelay`, but the row persisted. 55 + 56 + **why it only became visible on March 16**: unclear. the March 13 deploy 57 + (d70f6e5) also would have subscribed to bsky.network with the same code. 58 + possible that the impact was present but less visible, or that shard cursor 59 + state differed between restarts. the investigation didn't produce a definitive 60 + answer for why this specific restart was worse. 61 + 62 + **fix applied**: 63 + 1. blocked bsky.network via `POST /admin/hosts/block` 64 + 2. restarted pod — host authority failures dropped to zero 65 + 3. discovered bsky PDS shard cursors were extremely stale (some at 9M vs 66 + current ~878M) — shards were replaying millions of old events, 97%+ of 67 + which were for now-inactive accounts 68 + 4. reset 78 stale shard cursors to 0 (`UPDATE host SET last_seq = 0`) so 69 + subscribers reconnect from live tail instead of replaying 70 + 5. restarted pod again — workers ramped back to 2,800+, validated throughput 71 + recovered to ~300-950/s 72 + 73 + **bugs to fix**: 74 + - `addHost()` should validate before inserting (move `checkHost()` before 75 + `getOrCreateHost()`) 76 + - `spawnWorkers()` at startup should skip or validate hosts that look like 77 + relays, not blindly subscribe to every active host 78 + 79 + ### feat: resync collection index on #sync events (5ec37cb, not yet deployed) 80 + 81 + `#sync` means a repo discontinuity (migration, rebase, bulk import). the 82 + collection index was not updated on these events, leaving it stale until 83 + future commits self-corrected. 84 + 85 + new `resync.zig`: background worker with bounded queue (4096 items). on 86 + `#sync`, enqueues the DID for resync. worker fetches `describeRepo` from 87 + the PDS, does `removeAll(did)` + `addCollection(did, nsid)` for each 88 + collection returned. 50ms delay between items, no retry on failure. 89 + 90 + admin API: `GET /admin/resync` (status), `POST /admin/resync` (manual trigger). 91 + 92 + files: resync.zig (new), frame_worker.zig, subscriber.zig, slurper.zig, 93 + main.zig, api/router.zig, api/admin.zig 94 + 95 + ### fix: guard malloc_trim behind comptime linux check (4fc14ec, not yet deployed) 96 + 97 + `@cImport(@cInclude("malloc.h"))` at module level caused build failures on 98 + macOS (malloc.h is glibc-specific). moved it inside a 99 + `comptime builtin.os.tag == .linux` branch so the import is pruned on 100 + non-Linux targets. 101 + 102 + ### fix: relay-eval outlier detection didn't adjust coverage (822a179, deployed) 103 + 104 + **problem**: when a relay replays events after a restart (e.g. zlay with 97K 105 + events vs ~66K for others), its inflated DID set bloats the union, dragging 106 + all other relays' coverage percentages down to ~44%. 107 + 108 + existing outlier detection flagged the warning icon but didn't adjust the 109 + coverage calculation. also, the 1.5x threshold was too generous — zlay's 110 + 1.47x ratio slipped under it. 111 + 112 + **fix**: lowered threshold to 1.3x median. when outliers are detected, 113 + coverage is computed against `max(non-outlier unique_dids)` instead of the 114 + inflated union. applied to both the dashboard frontend and the OG SVG image. 115 + 116 + files: relay-eval/src/static/index.html, relay-eval/src/server.zig 117 + 118 + --- 119 + 120 + ## 2026-03-16 121 + 122 + ### fix: collection index stale entries (~10% → ~2%) (a785f8c, deployed) 123 + 124 + **problem**: ~10% of zlay's `listReposByCollection` results were stale — 125 + suspended/taken-down/deactivated accounts. confirmed by sampling DIDs and 126 + checking against `public.api.bsky.app`: 9 of 90 sampled were 127 + `AccountTakedown`/suspended. 128 + 129 + **root causes**: 130 + 1. `removeAll` trigger only fired on `"deleted"` and `"takendown"` string 131 + matches, missing `"suspended"` and `"deactivated"`. the `active` boolean 132 + is the canonical signal. 133 + 2. admin ban path (`handleBan`) didn't remove collection index entries. 134 + 3. no mechanism to purge existing stale data from the ~13.6M entry index. 135 + 136 + **fix**: 137 + - replaced string-matching `removeAll` trigger with `!is_active` in both 138 + `frame_worker.zig` and `subscriber.zig` — covers all non-active statuses 139 + including future unknown ones 140 + - added `removeAll` to admin ban path in `admin.zig` 141 + - new `cleaner.zig`: admin-triggered bulk cleanup job that pages through 142 + postgres for inactive accounts and removes them from the collection index. 143 + `POST /admin/cleanup-collection-index` to trigger, `GET` for status. 144 + 145 + **result**: cleanup removed 110,239 inactive accounts. stale rate dropped 146 + from ~10% to ~2.2%. remaining stale entries are backfill-imported DIDs not 147 + in the `account` table — they'll self-correct as `#account` events arrive. 148 + 149 + files: frame_worker.zig, subscriber.zig, api/admin.zig, api/router.zig, 150 + main.zig, cleaner.zig (new) 151 + 152 + ### note: bluesky backfill infrastructure context 153 + 154 + paul frazee [posted](https://bsky.app/profile/pfrazee.com/post/3mh7izdvu222n) 155 + about bluesky's internal backfill work: 14-18M repos enumerated from their 156 + most recent relay, 24 hours to backfill into clickhouse (records + backlinks 157 + tables). their relay only enumerates repos active since it started — same 158 + subset problem. 159 + 160 + relevant to our work: anyone consuming relay enumeration endpoints for 161 + backfills (like paul's team) gets bitten by stale entries. the cleanup we 162 + just shipped means downstream consumers pulling `listReposByCollection` 163 + from zlay get accurate data. the `!is_active` fix ensures it stays clean 164 + going forward. 165 + 166 + paul also noted they're bypassing Tap (bluesky's official backfill tool) 167 + due to throughput issues, writing custom code instead. zlay's collection 168 + index approach — lightweight `(DID, collection)` mappings without pulling 169 + full repo content — serves a different use case but complements full 170 + backfills for consumers that just need to know which DIDs have records in 171 + a given collection. 172 + 173 + ### feat: allow disabling bootstrap (disable-bootstrap branch) 174 + 175 + **context**: [issue #2](https://tangled.org/zzstoatzz.io/zlay/issues/2) — 176 + mia (mia.omg.lol) wants to run zlay on a local network without connecting 177 + to upstream relays. 178 + 179 + **change**: `RELAY_UPSTREAM=""` or `RELAY_UPSTREAM="none"` skips `pullHosts` 180 + at startup. the slurper still starts, loads hosts from DB, and accepts 181 + `requestCrawl` submissions — just no initial seed from an external relay. 182 + default behavior (`bsky.network`) unchanged. 183 + 184 + files: main.zig, slurper.zig (branch: `disable-bootstrap`, PR open) 185 + 186 + --- 187 + 188 + ## 2026-03-13 189 + 190 + ### known: `since: ""` passed through by both relays 191 + 192 + **issue**: [bluesky-social/indigo#1357](https://github.com/bluesky-social/indigo/issues/1357). 193 + a PDS sent `since: ""` (empty string) instead of `since: null` for 194 + `did:plc:jq3zvrb5ewg2qgup73qsouze`. per the atproto sync spec, `since` must 195 + be a valid TID or null. 196 + 197 + **impact**: both relay.waow.tech (indigo) and zlay.waow.tech pass the malformed 198 + field through to downstream consumers. strict parsers (like dawn's) break on 199 + the empty string — they expect a 13-char TID or null. 200 + 201 + **zlay behavior**: `getString("since")` returns a non-null zero-length slice. 202 + chain continuity check logs a "chain break" (empty string != stored rev) but 203 + doesn't drop the frame. broadcaster re-emits `since: ""` as-is. 204 + 205 + **status**: documenting for now. fix would be to normalize `since: ""` → null 206 + in broadcaster before re-emitting, or reject in validator. waiting on upstream 207 + indigo fix first. 208 + 209 + --- 210 + 34 211 ## 2026-03-12 35 212 36 213 ### tabled: strict validation on cache miss (branch: nate/strict-validation-on-cache-miss)

Configure Feed

Configure Feed