atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

revert to Io.Threaded: shelve Evented backend until upstream stabilizes

Io.Evented (io_uring fibers) has a probabilistic SIGSEGV in
std.Io.fiber.contextSwitch that crashes every 30-90 min under load.
After 28 commits fixing cross-Io crashes, heap corruption, and mutex
incompatibilities, this stdlib bug is the remaining blocker — and
upstream fiber.zig is unchanged as of dev.3091 with no fix in sight.

Switching to Io.Threaded restores the thread-per-PDS model (~2,800
threads, stable) and lets us use ReleaseSafe again. The Evented work
is preserved in patches/, scripts/repro_evented.zig, and the new
docs/evented-attempt.md for when upstream catches up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

+102 -15
+4 -2
CLAUDE.md
··· 1 1 # zlay 2 2 3 - AT Protocol relay in zig 0.15.2. reader thread per PDS + shared frame 4 - processing pool. ReleaseSafe in production. 3 + AT Protocol relay in zig 0.16. reader thread per PDS + shared frame 4 + processing pool. ReleaseSafe in production. Io.Threaded backend (Evented 5 + attempt shelved — see docs/evented-attempt.md). 5 6 6 7 ## before pushing 7 8 ··· 21 22 - [docs/deployment.md](docs/deployment.md) — build flags, infra, resource usage 22 23 - [docs/gotchas.md](docs/gotchas.md) — zig/pg.zig/rocksdb-zig/deploy traps 23 24 - [docs/incident-2026-03-04.md](docs/incident-2026-03-04.md) — ReleaseSafe RSS analysis 25 + - [docs/evented-attempt.md](docs/evented-attempt.md) — Evented backend attempt and why we reverted
+4 -6
Dockerfile
··· 8 8 ENV PATH=/opt/zig-x86_64-linux-0.16.0-dev.3059+42e33db9d:$PATH 9 9 WORKDIR /build 10 10 11 - # patch Io.Uring networking (stubbed as Unavailable upstream, zig#31723) 12 - COPY patches/ patches/ 13 - RUN patch /opt/zig-x86_64-linux-0.16.0-dev.3059+42e33db9d/lib/std/Io/Uring.zig < patches/uring-networking.patch 11 + # Uring networking patch preserved in patches/ for when upstream stabilizes 12 + # (see docs/evented-attempt.md). Not applied — using Io.Threaded backend. 14 13 15 14 # fetch dependencies first (cacheable — only changes when build.zig.zon changes) 16 15 COPY build.zig build.zig.zon ./ ··· 18 17 19 18 # then copy source and build 20 19 COPY src/ src/ 21 - # ReleaseFast (not ReleaseSafe): Io.Uring fiber context-switch GPFs under ReleaseSafe 22 - # due to a zig 0.16-dev optimizer+safety interaction bug. See scripts/repro_evented.zig. 23 - RUN zig build -Doptimize=ReleaseFast -Dcpu=baseline -Dtarget=x86_64-linux-gnu 20 + # ReleaseSafe: Evented GPF was the reason for ReleaseFast — back to Threaded now. 21 + RUN zig build -Doptimize=ReleaseSafe -Dcpu=baseline -Dtarget=x86_64-linux-gnu 24 22 25 23 FROM --platform=linux/amd64 debian:bookworm-slim 26 24 RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates && rm -rf /var/lib/apt/lists/*
+83
docs/evented-attempt.md
··· 1 + # Evented backend attempt (2026-03 to 2026-04) 2 + 3 + we spent roughly a month trying to run zlay on `Io.Evented` (io_uring fibers) 4 + instead of `Io.Threaded` (one OS thread per task). the goal was to drop from 5 + ~2,800 threads to ~35. it didn't work out — the upstream fiber machinery is 6 + still experimental and unstable. this doc records what we did so it's not lost. 7 + 8 + ## what we built 9 + 10 + - **uring networking patch** (`patches/uring-networking.patch`): implemented 6 11 + networking operations that are stubbed as `error.NetworkDown` upstream — 12 + listen, accept, connect, send, read, write. all use proper io_uring opcodes 13 + except bind/listen (sync syscalls, kernel <6.11). tracked upstream as zig#31723. 14 + 15 + - **dual-Io architecture**: Evented for network I/O (subscribers, broadcaster, 16 + metrics server), Threaded for blocking work (frame pool, DB, GC, resyncer). 17 + careful segregation to avoid cross-Io mutex/futex incompatibilities. 18 + 19 + - **DNS fallback**: `Io.Uring` doesn't implement `netLookup`. subscribers 20 + resolve hostnames through `pool_io` (Threaded), then use the socket with 21 + Evented I/O for data transfer. 22 + 23 + - **DbRequestQueue**: MPSC queue bridging Evented fibers to Threaded DB 24 + workers, replacing attempts at running pg.Pool from fiber context. 25 + 26 + ## what went wrong 27 + 28 + 28 commits between `39134d1` (first evented backend) and `b8ef148` (last fix), 29 + most of them crash fixes: 30 + 31 + | commit | issue | 32 + |--------|-------| 33 + | `6674812` | SIGSEGV: plain threads calling Evented Io.Mutex → NULL Thread.current() | 34 + | `439c678` | startup deadlock: resyncer on Evented fiber blocked Uring thread | 35 + | `2156d08` | heap corruption: GC using Threaded pg.Pool from Evented fiber | 36 + | `3533416` | heap corruption: subscriber pg.Pool access from Evented fiber | 37 + | `949e9a7` | heap corruption: remaining cross-Io pg.Pool paths | 38 + | `72ba680` | use-after-free: broadcaster destroying consumers still referenced | 39 + | `4a23671` | ReleaseSafe GPF: optimizer bug in fiber context-switch | 40 + | `c3bc3be` | replace ev_db with DbRequestQueue (final cross-Io DB fix) | 41 + 42 + the fundamental problem: `Io.Mutex`, `Io.Condition`, and `Io.Event` use 43 + backend-specific futex operations. calling a Threaded mutex from an Evented 44 + fiber (or vice versa) dereferences a NULL threadlocal → SIGSEGV or silent 45 + heap corruption. this means any shared resource (pg.Pool, DiskPersist mutex) 46 + must be carefully routed to the right Io backend. we fixed all of these, but 47 + the architecture became fragile. 48 + 49 + the final remaining bug: a probabilistic SIGSEGV in `std.Io.fiber.contextSwitch` 50 + (fiber.zig:29) that manifests every 30-90 minutes under sustained load with 51 + ~2,800 fibers. under ReleaseSafe it's an immediate GPF; under ReleaseFast 52 + it's probabilistic. the repro (`scripts/repro_evented.zig`) demonstrates the 53 + ReleaseSafe case. this is a zig stdlib bug — the inline asm that swaps 54 + rsp/rbp/rip between fiber contexts corrupts state under certain optimizer 55 + configurations. 56 + 57 + ## upstream status (checked 2026-04-05) 58 + 59 + - `fiber.zig`: **unchanged** between dev.3059 (our pin) and dev.3091 (latest) 60 + - uring networking: **still fully stubbed** — all six `*Unavailable` functions 61 + - zig team position: Evented is "experimental" with "important followup work 62 + to be done before they can be used reliably" (HN, feb 2026) 63 + - `Io.Threaded` is the recommended production backend 64 + 65 + ## what we kept 66 + 67 + - `patches/uring-networking.patch` — the networking implementation, for 68 + reference and to resume from when upstream catches up 69 + - `scripts/repro_evented.zig` — minimal reproduction of the fiber GPF 70 + - `docs/stdlib-patches.md` — catalog of all 6 workarounds we needed 71 + - this document 72 + 73 + ## when to try again 74 + 75 + revisit when: 76 + 1. upstream implements uring networking (zig#31723 closed) 77 + 2. `fiber.zig` gets meaningful changes (context switch rewrite or fix) 78 + 3. a zig release (not dev build) ships with Evented marked as stable 79 + 80 + the dual-Io architecture, DbRequestQueue, and cross-Io segregation patterns 81 + are all still in the codebase (dead code paths behind `Backend == Io.Evented` 82 + checks). switching back is a one-line change in `src/main.zig:58` plus 83 + re-enabling the patch in the Dockerfile.
+11 -7
src/main.zig
··· 48 48 pub const default_stack_size = 8 * 1024 * 1024; 49 49 50 50 // -- Io backend selection -- 51 - // Io.Evented (fibers on io_uring): ~35 threads instead of ~2,800 (one per PDS). 52 - // Networking via patched Uring.zig (patches/uring-networking.patch) — implements 53 - // listen, accept, connect, read, write, send via io_uring opcodes. DNS (netLookup) 54 - // is NOT patched; subscribers resolve hostnames through pool_io (Threaded) instead. 51 + // Io.Threaded: one OS thread per concurrent task (~2,800 subscribers + services). 52 + // not what we want long-term — Evented (fibers on io_uring) drops thread count 53 + // from ~2,800 to ~35 — but the upstream fiber machinery is not ready. 55 54 // 56 - // Known issue: Io.Uring GPFs under ReleaseSafe (optimizer + safety interaction). 57 - // Build with ReleaseFast. See scripts/repro_evented.zig for minimal reproduction. 58 - const Backend = Io.Evented; 55 + // We built and shipped a full Evented backend (see docs/evented-attempt.md and 56 + // patches/uring-networking.patch). After 28 commits fixing cross-Io crashes, 57 + // heap corruption, mutex incompatibilities, and a ReleaseSafe GPF in fiber 58 + // context-switch, the remaining bug is a probabilistic SIGSEGV under ReleaseFast 59 + // (~every 30-90 min under load). Root cause is in std.Io.fiber.contextSwitch 60 + // (zig 0.16-dev stdlib). As of dev.3091, fiber.zig is unchanged and Uring 61 + // networking is still fully stubbed upstream. Switch back when upstream stabilizes. 62 + const Backend = Io.Threaded; 59 63 60 64 var backend: Backend = undefined; 61 65 var debug_threaded_io: Io.Threaded = undefined;