docs: zlay handoff — stale-cursor + optimistic-validation asks

+198

1 changed file

expand all

docs

zlay-handoff-2026-04-17-attack-residue.md

+198

docs/zlay-handoff-2026-04-17-attack-residue.md

··· 1 + # zlay handoff — 2026-04-17 post-attack residue 2 + 3 + two specific zlay fixes the operator needs. both surfaced during recovery 4 + from an attack earlier today ("expensive HTTP requests, then a flood of 5 + normal ones" per `why.bsky.team`). both are concrete, localized changes. 6 + 7 + ## tl;dr 8 + 9 + 1. `listActiveHostsImpl` should zero `last_seq` for stale rows, so 10 + reactivating a long-dormant host doesn't trigger a multi-week replay 11 + storm. 12 + 2. `host_authority` needs to gate broadcast on cache miss, not 13 + post-hoc. today, zlay emits ~3-5% of its firehose as 14 + unresolvable/forged events that bsky and indigo correctly drop. 15 + 16 + ## what happened (briefly) 17 + 18 + - attack pushed ~585 legitimate PDSes into `exhausted`/`dormant` status. 19 + zlay's reconciliation only touches `active` rows, so these got 20 + stranded. coverage dropped to 60%. 21 + - operator ran `UPDATE host SET status='active', failed_attempts=0 22 + WHERE status IN ('exhausted','dormant','idle') AND last_seq > 0` to 23 + unstick them. confirmed 585 rows updated. 24 + - reconcile fiber turned out to be silently dead (zero 25 + `reconcile: respawned N host(s)` log lines in 20k-line tail over 26 + multi-day history). flagged but not blocking. 27 + - pod was restarted to force startup spawn to pick up the reactivated 28 + rows. cold start took ~55 min to fully ramp from 0 → 2751 workers. 29 + that's slow for zlay (see ask #1). 30 + - after recovery, zlay sits at 98.19% coverage on relay-eval — but 31 + ~3.5% of that is forged traffic that strict relays reject (ask #2). 32 + 33 + ## ask #1 — stale cursor replay storm 34 + 35 + **symptom observed:** during the cold-start spawn, the first ~25 min 36 + ran at ~27 workers/min instead of the normal ~120/min. `persist_order_spins_total` 37 + sat at **3.79M/s** (alarm threshold is 1M/s per operator skill). 38 + `relay_frames_received_total` climbed to **3000 fps** (normal ~400 fps). 39 + `relay_pool_queued_bytes` grew from 0 → 118 MB. 40 + 41 + **root cause:** the 585 reactivated hosts had `last_seq` values from 42 + weeks ago — some as high as `1.77 × 10¹²` (for `pds.social.clipsymphony.com`, 43 + `blueshaft.sour.coffee`, etc. — cluster of trillion-scale values that 44 + are either attack-injected or legitimately-deep cursors). on reconnect 45 + each worker passes `last_seq` as the upstream `subscribeRepos` cursor, 46 + so each PDS replays everything since that point. 585 PDSes replaying 47 + weeks of history simultaneously = the fps + contention storm. 48 + 49 + meanwhile, `spawnWorker` in `src/slurper.zig:545-607` does a synchronous 50 + `getEffectiveAccountCount` via the shared `DbRequestQueue`. that queue 51 + is serialized behind frame persistence. frame persistence is 100% busy 52 + under replay contention. spawn dispatch stalls behind N queued frame 53 + writes per host. negative-feedback loop — it self-healed only because 54 + the initial batch eventually got past enough replay to unblock persist. 55 + 56 + **specific fix — one-line SQL change in `src/event_log.zig:714`:** 57 + 58 + ```sql 59 + SELECT id, hostname, status, 60 + CASE 61 + WHEN now() - updated_at > interval '1 hour' THEN 0 62 + ELSE last_seq 63 + END AS last_seq, 64 + failed_attempts, account_limit 65 + FROM host 66 + WHERE status = 'active' 67 + ORDER BY id ASC; 68 + ``` 69 + 70 + a running host has `updated_at` refreshed per persisted frame, so 71 + `updated_at` is always fresh for currently-delivering workers — they 72 + keep their cursor. a reactivated dormant/exhausted host by definition 73 + has a stale `updated_at`, so it gets `last_seq = 0` = "start from 74 + current head, no replay". 75 + 76 + on normal restart this is a no-op (every active host touched seconds 77 + ago). on attack-recovery or any bulk reactivation, this converts the 78 + replay storm into a cold subscribe that catches up organically. 79 + 80 + **testing hook:** `kubectl exec` into postgres, verify the CASE 81 + returns 0 for rows with `updated_at < now() - interval '1 hour'` 82 + before shipping. 83 + 84 + ## ask #2 — host_authority as a gate on cache miss, not a flag 85 + 86 + **symptom observed:** in the latest relay-eval run 87 + (`2026-04-17T06:36:30Z`, 5-min window), zlay reported 21,758 unique DIDs 88 + vs bsky.network's 21,005 — zlay "ahead" by 753 DIDs. but: 89 + 90 + ``` 91 + relay coverage_gap unresolvable 92 + zlay.waow.tech 11 336 93 + bsky.network 22 1078 94 + relay1.us-east.bsky.network 23 1082 95 + relay.waow.tech (indigo) 23 1069 96 + northamerica.firehose.network 21 1055 97 + ``` 98 + 99 + every strict relay classifies ~1078 DIDs in this window as 100 + `unresolvable` (classifier couldn't resolve the DID via PLC, and the 101 + hosting PDS doesn't match). zlay classifies only 336 as unresolvable. 102 + the ~740 delta ≈ the 753 DID lead. **the lead is forged traffic 103 + pollution.** 104 + 105 + `relay_validation_failed{reason="host_authority"}` sits at 27.6k 106 + cumulative, rate ~4.4/sec. but those rejects fire *after* the first 107 + commit from a new DID has already shipped to downstream consumers, 108 + because of the optimistic "cache miss = pass through + background 109 + resolve" model. an attacker who spawns N fake DIDs gets N free 110 + broadcasts before zlay learns to reject. 111 + 112 + sample unresolvable DIDs zlay broadcast (bsky dropped): 113 + ``` 114 + did:plc:epirykax5cky5p76pmvktyao 115 + did:plc:kxkwkbwwjqz7lo3a5zp3xj2i 116 + did:plc:ebenh2s2jekhmhv37mqt52jx 117 + did:plc:ymogigvmlplmhqxe4k3h5pdv 118 + did:plc:us7obet6wrhpybtlvre56wzm 119 + did:plc:3sawpk7qdwqakmvhc7xxybn7 120 + did:plc:kcvsmhua7j3wg7adf3djq5zy 121 + did:plc:i5c7p35bysykkbjwjbrcr426 122 + ``` 123 + 124 + well-formed did:plc syntax, unresolvable via PLC. given attack 125 + context, mostly forged. 126 + 127 + **specific fix — make the cache-miss path gate rather than flag.** 128 + current flow (conceptual, in the broadcaster's pre-broadcast validation 129 + chain that ends at `subscriber.zig:602-606` where `failed_host_authority` 130 + increments and returns): 131 + 132 + ``` 133 + on commit frame: 134 + check host_authority cache 135 + if hit + accept: broadcast 136 + if hit + reject: drop, increment counter 137 + if miss: kick background resolver, BROADCAST IMMEDIATELY 138 + ``` 139 + 140 + proposed: 141 + 142 + ``` 143 + on commit frame: 144 + check host_authority cache 145 + if hit + accept: broadcast 146 + if hit + reject: drop, increment counter 147 + if miss: 148 + enqueue frame (keyed by DID) in a pending-resolution buffer 149 + kick resolver with a deadline (e.g., 2s) 150 + on resolver accept: drain queued frames for this DID → broadcast 151 + on resolver reject / timeout: drop queued frames, increment 152 + validation_failed{reason="host_authority",sub="miss_timeout"} 153 + ``` 154 + 155 + cost: +500ms p50 latency on the *first* commit from any brand-new 156 + DID (PLC roundtrip). every subsequent commit hits the cache and is 157 + unaffected. benefit: zero forged broadcasts. matches indigo's strict 158 + Sync 1.1 semantics. 159 + 160 + memory / bounded-queue considerations: 161 + - pending buffer keyed by DID, LRU-evicted above N entries 162 + - drop oldest on evict (increment a dedicated counter) 163 + - this should naturally be tiny under normal load — most DIDs hit 164 + cache — and bounded under attack since evict happens 165 + 166 + as with ask #1, this is a one-file change most likely (the subscriber 167 + or frame_worker branch that handles pre-broadcast validation). the 168 + resolver and cache already exist. 169 + 170 + ## unrelated observability nit (while you're in there) 171 + 172 + during the replay storm, `chain break uid=N prevData mismatch` log 173 + lines were emitted at ~3000/sec, producing **19,799 of 20,000 lines** 174 + in the last kubectl-logs tail. anything useful (startup progress, 175 + reconcile events, spawn failures) got rotated out within seconds. 176 + 177 + two small options: 178 + - sample: emit 1 of N chain breaks per uid per minute 179 + - dedupe: maintain a small TTL'd set of "recently logged" uids and 180 + skip repeats 181 + 182 + this doesn't gate either of the two asks above, but made diagnosis 183 + harder during the incident. 184 + 185 + ## operator follow-up (for reference, not asking) 186 + 187 + - `.claude/skills/zlay-diagnose` + `scripts/zlay-admin` are the tools 188 + we used. both safe to re-invoke in future incidents. 189 + - 746 hosts remain in `exhausted` status after this round. i'm 190 + leaving them until ask #1 ships — re-running the SQL reactivation 191 + today would re-trigger the replay storm on any with stale `last_seq`. 192 + once `listActiveHostsImpl` zero-on-stale is merged, one follow-up 193 + `UPDATE host SET status='active'` sweep clears them cleanly. 194 + - reconcile fiber being silent in 4+ days of logs is the third 195 + concern we didn't chase during this incident. separate debug 196 + session. may want a liveness metric like 197 + `relay_reconcile_last_run_timestamp_seconds` so we can tell from 198 + outside whether the fiber is alive.

Configure Feed

Configure Feed