atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

re-enable Evented backend: production SIGSEGV was websocket bug, not fibers

the SIGSEGV that prompted the Threaded revert was a TCP split mid-CRLF
in the websocket handshake reader (fixed in 9ac64da), not a fiber
context-switch issue. re-enabling Evented + ReleaseSafe to see it through.

if the repro GPF (scripts/repro_evented.zig) hits production code paths,
fallback is ReleaseFast or one-line flip back to Threaded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

+34 -28
+5 -3
Dockerfile
··· 8 8 ENV PATH=/opt/zig-x86_64-linux-0.16.0-dev.3059+42e33db9d:$PATH 9 9 WORKDIR /build 10 10 11 - # Uring networking patch preserved in patches/ for when upstream stabilizes 12 - # (see docs/evented-attempt.md). Not applied — using Io.Threaded backend. 11 + # patch Io.Uring networking (stubbed as Unavailable upstream, zig#31723) 12 + COPY patches/ patches/ 13 + RUN patch /opt/zig-x86_64-linux-0.16.0-dev.3059+42e33db9d/lib/std/Io/Uring.zig < patches/uring-networking.patch 13 14 14 15 # fetch dependencies first (cacheable — only changes when build.zig.zon changes) 15 16 COPY build.zig build.zig.zon ./ ··· 17 18 18 19 # then copy source and build 19 20 COPY src/ src/ 20 - # ReleaseSafe: Evented GPF was the reason for ReleaseFast — back to Threaded now. 21 + # ReleaseSafe: the production SIGSEGV was a websocket bug, not a fiber issue. 22 + # trying Evented + ReleaseSafe — if the repro GPF hits production, fall back to ReleaseFast. 21 23 RUN zig build -Doptimize=ReleaseSafe -Dcpu=baseline -Dtarget=x86_64-linux-gnu 22 24 23 25 FROM --platform=linux/amd64 debian:bookworm-slim
+19 -14
docs/evented-attempt.md
··· 1 - # Evented backend attempt (2026-03 to 2026-04) 1 + # Evented backend (2026-03 to present) 2 + 3 + log of getting zlay running on `Io.Evented` (io_uring fibers) instead of 4 + `Io.Threaded` (one OS thread per task). goal: drop from ~2,800 threads to ~35. 2 5 3 - we spent roughly a month trying to run zlay on `Io.Evented` (io_uring fibers) 4 - instead of `Io.Threaded` (one OS thread per task). the goal was to drop from 5 - ~2,800 threads to ~35. it didn't work out — the upstream fiber machinery is 6 - still experimental and unstable. this doc records what we did so it's not lost. 6 + after 28 commits fixing cross-Io issues and a brief revert to Threaded, we 7 + discovered the production SIGSEGV was a websocket handshake bug (TCP split 8 + mid-CRLF, fixed in websocket.zig `9ac64da`), not a fiber context-switch 9 + issue. Evented backend is back, now running ReleaseSafe. 7 10 8 11 ## what we built 9 12 ··· 70 73 - `docs/stdlib-patches.md` — catalog of all 6 workarounds we needed 71 74 - this document 72 75 73 - ## when to try again 76 + ## timeline 74 77 75 - revisit when: 76 - 1. upstream implements uring networking (zig#31723 closed) 77 - 2. `fiber.zig` gets meaningful changes (context switch rewrite or fix) 78 - 3. a zig release (not dev build) ships with Evented marked as stable 78 + - 2026-03-08: first Evented deploy (`39134d1`) 79 + - 2026-03-08 to 2026-04-04: 28 commits fixing cross-Io crashes 80 + - 2026-04-05: reverted to Threaded after SIGSEGV blamed on fiber machinery 81 + - 2026-04-05: discovered real cause — websocket handshake TCP split bug 82 + - 2026-04-05: fixed websocket.zig (`9ac64da`), re-enabled Evented + ReleaseSafe 79 83 80 - the dual-Io architecture, DbRequestQueue, and cross-Io segregation patterns 81 - are all still in the codebase (dead code paths behind `Backend == Io.Evented` 82 - checks). switching back is a one-line change in `src/main.zig:58` plus 83 - re-enabling the patch in the Dockerfile. 84 + ## fallback 85 + 86 + if Evented + ReleaseSafe hits the repro GPF in production, switch to 87 + ReleaseFast in the Dockerfile. if that also crashes, flip `Backend` back 88 + to `Io.Threaded` in `src/main.zig:58` — one-line change.
+10 -11
src/main.zig
··· 48 48 pub const default_stack_size = 8 * 1024 * 1024; 49 49 50 50 // -- Io backend selection -- 51 - // Io.Threaded: one OS thread per concurrent task (~2,800 subscribers + services). 52 - // not what we want long-term — Evented (fibers on io_uring) drops thread count 53 - // from ~2,800 to ~35 — but the upstream fiber machinery is not ready. 51 + // Io.Evented (fibers on io_uring): ~35 threads instead of ~2,800 (one per PDS). 52 + // Networking via patched Uring.zig (patches/uring-networking.patch) — implements 53 + // listen, accept, connect, read, write, send via io_uring opcodes. DNS (netLookup) 54 + // is NOT patched; subscribers resolve hostnames through pool_io (Threaded) instead. 54 55 // 55 - // We built and shipped a full Evented backend (see docs/evented-attempt.md and 56 - // patches/uring-networking.patch). After 28 commits fixing cross-Io crashes, 57 - // heap corruption, mutex incompatibilities, and a ReleaseSafe GPF in fiber 58 - // context-switch, the remaining bug is a probabilistic SIGSEGV under ReleaseFast 59 - // (~every 30-90 min under load). Root cause is in std.Io.fiber.contextSwitch 60 - // (zig 0.16-dev stdlib). As of dev.3091, fiber.zig is unchanged and Uring 61 - // networking is still fully stubbed upstream. Switch back when upstream stabilizes. 62 - const Backend = Io.Threaded; 56 + // History: 28 commits of cross-Io fixes (see docs/evented-attempt.md). The 57 + // production SIGSEGV that prompted a Threaded revert turned out to be a websocket 58 + // handshake bug (TCP split mid-CRLF), not a fiber issue. Fixed in websocket.zig 59 + // 9ac64da. The repro (scripts/repro_evented.zig) still GPFs under ReleaseSafe 60 + // but production code paths may differ — deploying ReleaseSafe to find out. 61 + const Backend = Io.Evented; 63 62 64 63 var backend: Backend = undefined; 65 64 var debug_threaded_io: Io.Threaded = undefined;