···31313232---
33333434+## 2026-03-17
3535+3636+### incident: zlay coverage collapse (99% → ~5%)
3737+3838+**timeline**:
3939+- 2026-03-16 05:37 UTC: deploy of a785f8c (cleaner.zig) triggers pod restart.
4040+ coverage drops from 99.16% to near-zero within ~1 hour.
4141+- 2026-03-17 ~06:00 UTC: investigation begins after noticing zlay at 0.99%
4242+ on Pulsar and ~5% on the relay-eval leaderboard.
4343+4444+**root cause**: bsky.network (a relay, not a PDS) was in the `host` table as
4545+host_id=1 with status=active since 2026-03-01. on every pod restart, the
4646+slurper loaded it from the DB and spawned a subscriber for it. bsky.network's
4747+firehose pumped the entire network's events into zlay, but host authority
4848+checks rejected them all (DID docs point to actual PDS hosts, not
4949+bsky.network). this consumed ~37.5% of processing capacity on rejected events.
5050+5151+**how it got there**: `addHost()` in slurper.zig calls `getOrCreateHost()`
5252+(which inserts) BEFORE `checkHost()` (which validates). someone sent
5353+`requestCrawl` for bsky.network on March 1 — it got inserted into the host
5454+table, then `checkHost` rejected it with `IsARelay`, but the row persisted.
5555+5656+**why it only became visible on March 16**: unclear. the March 13 deploy
5757+(d70f6e5) also would have subscribed to bsky.network with the same code.
5858+possible that the impact was present but less visible, or that shard cursor
5959+state differed between restarts. the investigation didn't produce a definitive
6060+answer for why this specific restart was worse.
6161+6262+**fix applied**:
6363+1. blocked bsky.network via `POST /admin/hosts/block`
6464+2. restarted pod — host authority failures dropped to zero
6565+3. discovered bsky PDS shard cursors were extremely stale (some at 9M vs
6666+ current ~878M) — shards were replaying millions of old events, 97%+ of
6767+ which were for now-inactive accounts
6868+4. reset 78 stale shard cursors to 0 (`UPDATE host SET last_seq = 0`) so
6969+ subscribers reconnect from live tail instead of replaying
7070+5. restarted pod again — workers ramped back to 2,800+, validated throughput
7171+ recovered to ~300-950/s
7272+7373+**bugs to fix**:
7474+- `addHost()` should validate before inserting (move `checkHost()` before
7575+ `getOrCreateHost()`)
7676+- `spawnWorkers()` at startup should skip or validate hosts that look like
7777+ relays, not blindly subscribe to every active host
7878+7979+### feat: resync collection index on #sync events (5ec37cb, not yet deployed)
8080+8181+`#sync` means a repo discontinuity (migration, rebase, bulk import). the
8282+collection index was not updated on these events, leaving it stale until
8383+future commits self-corrected.
8484+8585+new `resync.zig`: background worker with bounded queue (4096 items). on
8686+`#sync`, enqueues the DID for resync. worker fetches `describeRepo` from
8787+the PDS, does `removeAll(did)` + `addCollection(did, nsid)` for each
8888+collection returned. 50ms delay between items, no retry on failure.
8989+9090+admin API: `GET /admin/resync` (status), `POST /admin/resync` (manual trigger).
9191+9292+files: resync.zig (new), frame_worker.zig, subscriber.zig, slurper.zig,
9393+main.zig, api/router.zig, api/admin.zig
9494+9595+### fix: guard malloc_trim behind comptime linux check (4fc14ec, not yet deployed)
9696+9797+`@cImport(@cInclude("malloc.h"))` at module level caused build failures on
9898+macOS (malloc.h is glibc-specific). moved it inside a
9999+`comptime builtin.os.tag == .linux` branch so the import is pruned on
100100+non-Linux targets.
101101+102102+### fix: relay-eval outlier detection didn't adjust coverage (822a179, deployed)
103103+104104+**problem**: when a relay replays events after a restart (e.g. zlay with 97K
105105+events vs ~66K for others), its inflated DID set bloats the union, dragging
106106+all other relays' coverage percentages down to ~44%.
107107+108108+existing outlier detection flagged the warning icon but didn't adjust the
109109+coverage calculation. also, the 1.5x threshold was too generous — zlay's
110110+1.47x ratio slipped under it.
111111+112112+**fix**: lowered threshold to 1.3x median. when outliers are detected,
113113+coverage is computed against `max(non-outlier unique_dids)` instead of the
114114+inflated union. applied to both the dashboard frontend and the OG SVG image.
115115+116116+files: relay-eval/src/static/index.html, relay-eval/src/server.zig
117117+118118+---
119119+120120+## 2026-03-16
121121+122122+### fix: collection index stale entries (~10% → ~2%) (a785f8c, deployed)
123123+124124+**problem**: ~10% of zlay's `listReposByCollection` results were stale —
125125+suspended/taken-down/deactivated accounts. confirmed by sampling DIDs and
126126+checking against `public.api.bsky.app`: 9 of 90 sampled were
127127+`AccountTakedown`/suspended.
128128+129129+**root causes**:
130130+1. `removeAll` trigger only fired on `"deleted"` and `"takendown"` string
131131+ matches, missing `"suspended"` and `"deactivated"`. the `active` boolean
132132+ is the canonical signal.
133133+2. admin ban path (`handleBan`) didn't remove collection index entries.
134134+3. no mechanism to purge existing stale data from the ~13.6M entry index.
135135+136136+**fix**:
137137+- replaced string-matching `removeAll` trigger with `!is_active` in both
138138+ `frame_worker.zig` and `subscriber.zig` — covers all non-active statuses
139139+ including future unknown ones
140140+- added `removeAll` to admin ban path in `admin.zig`
141141+- new `cleaner.zig`: admin-triggered bulk cleanup job that pages through
142142+ postgres for inactive accounts and removes them from the collection index.
143143+ `POST /admin/cleanup-collection-index` to trigger, `GET` for status.
144144+145145+**result**: cleanup removed 110,239 inactive accounts. stale rate dropped
146146+from ~10% to ~2.2%. remaining stale entries are backfill-imported DIDs not
147147+in the `account` table — they'll self-correct as `#account` events arrive.
148148+149149+files: frame_worker.zig, subscriber.zig, api/admin.zig, api/router.zig,
150150+main.zig, cleaner.zig (new)
151151+152152+### note: bluesky backfill infrastructure context
153153+154154+paul frazee [posted](https://bsky.app/profile/pfrazee.com/post/3mh7izdvu222n)
155155+about bluesky's internal backfill work: 14-18M repos enumerated from their
156156+most recent relay, 24 hours to backfill into clickhouse (records + backlinks
157157+tables). their relay only enumerates repos active since it started — same
158158+subset problem.
159159+160160+relevant to our work: anyone consuming relay enumeration endpoints for
161161+backfills (like paul's team) gets bitten by stale entries. the cleanup we
162162+just shipped means downstream consumers pulling `listReposByCollection`
163163+from zlay get accurate data. the `!is_active` fix ensures it stays clean
164164+going forward.
165165+166166+paul also noted they're bypassing Tap (bluesky's official backfill tool)
167167+due to throughput issues, writing custom code instead. zlay's collection
168168+index approach — lightweight `(DID, collection)` mappings without pulling
169169+full repo content — serves a different use case but complements full
170170+backfills for consumers that just need to know which DIDs have records in
171171+a given collection.
172172+173173+### feat: allow disabling bootstrap (disable-bootstrap branch)
174174+175175+**context**: [issue #2](https://tangled.org/zzstoatzz.io/zlay/issues/2) —
176176+mia (mia.omg.lol) wants to run zlay on a local network without connecting
177177+to upstream relays.
178178+179179+**change**: `RELAY_UPSTREAM=""` or `RELAY_UPSTREAM="none"` skips `pullHosts`
180180+at startup. the slurper still starts, loads hosts from DB, and accepts
181181+`requestCrawl` submissions — just no initial seed from an external relay.
182182+default behavior (`bsky.network`) unchanged.
183183+184184+files: main.zig, slurper.zig (branch: `disable-bootstrap`, PR open)
185185+186186+---
187187+188188+## 2026-03-13
189189+190190+### known: `since: ""` passed through by both relays
191191+192192+**issue**: [bluesky-social/indigo#1357](https://github.com/bluesky-social/indigo/issues/1357).
193193+a PDS sent `since: ""` (empty string) instead of `since: null` for
194194+`did:plc:jq3zvrb5ewg2qgup73qsouze`. per the atproto sync spec, `since` must
195195+be a valid TID or null.
196196+197197+**impact**: both relay.waow.tech (indigo) and zlay.waow.tech pass the malformed
198198+field through to downstream consumers. strict parsers (like dawn's) break on
199199+the empty string — they expect a 13-char TID or null.
200200+201201+**zlay behavior**: `getString("since")` returns a non-null zero-length slice.
202202+chain continuity check logs a "chain break" (empty string != stored rev) but
203203+doesn't drop the frame. broadcaster re-emits `since: ""` as-is.
204204+205205+**status**: documenting for now. fix would be to normalize `since: ""` → null
206206+in broadcaster before re-emitting, or reject in validator. waiting on upstream
207207+indigo fix first.
208208+209209+---
210210+34211## 2026-03-12
3521236213### tabled: strict validation on cache miss (branch: nate/strict-validation-on-cache-miss)