Evented backend (2026-03 to present)#
log of getting zlay running on Io.Evented (io_uring fibers) instead of
Io.Threaded (one OS thread per task). goal: drop from ~2,800 threads to ~35.
after 28 commits fixing cross-Io issues and a brief revert to Threaded, we
discovered the production SIGSEGV was a websocket handshake bug (TCP split
mid-CRLF, fixed in websocket.zig 9ac64da), not a fiber context-switch
issue. Evented backend is back, now running ReleaseSafe.
what we built#
-
uring networking patch (
patches/uring-networking.patch): implemented 6 networking operations that are stubbed aserror.NetworkDownupstream — listen, accept, connect, send, read, write. all use proper io_uring opcodes except bind/listen (sync syscalls, kernel <6.11). tracked upstream as zig#31723. -
dual-Io architecture: Evented for network I/O (subscribers, broadcaster, metrics server), Threaded for blocking work (frame pool, DB, GC, resyncer). careful segregation to avoid cross-Io mutex/futex incompatibilities.
-
DNS fallback:
Io.Uringdoesn't implementnetLookup. subscribers resolve hostnames throughpool_io(Threaded), then use the socket with Evented I/O for data transfer. -
DbRequestQueue: MPSC queue bridging Evented fibers to Threaded DB workers, replacing attempts at running pg.Pool from fiber context.
what went wrong#
28 commits between 39134d1 (first evented backend) and b8ef148 (last fix),
most of them crash fixes:
| commit | issue |
|---|---|
6674812 |
SIGSEGV: plain threads calling Evented Io.Mutex → NULL Thread.current() |
439c678 |
startup deadlock: resyncer on Evented fiber blocked Uring thread |
2156d08 |
heap corruption: GC using Threaded pg.Pool from Evented fiber |
3533416 |
heap corruption: subscriber pg.Pool access from Evented fiber |
949e9a7 |
heap corruption: remaining cross-Io pg.Pool paths |
72ba680 |
use-after-free: broadcaster destroying consumers still referenced |
4a23671 |
ReleaseSafe GPF: optimizer bug in fiber context-switch |
c3bc3be |
replace ev_db with DbRequestQueue (final cross-Io DB fix) |
the fundamental problem: Io.Mutex, Io.Condition, and Io.Event use
backend-specific futex operations. calling a Threaded mutex from an Evented
fiber (or vice versa) dereferences a NULL threadlocal → SIGSEGV or silent
heap corruption. this means any shared resource (pg.Pool, DiskPersist mutex)
must be carefully routed to the right Io backend. we fixed all of these, but
the architecture became fragile.
the final remaining bug: a probabilistic SIGSEGV in std.Io.fiber.contextSwitch
(fiber.zig:29) that manifests every 30-90 minutes under sustained load with
~2,800 fibers. under ReleaseSafe it's an immediate GPF; under ReleaseFast
it's probabilistic. the repro (scripts/repro_evented.zig) demonstrates the
ReleaseSafe case. this is a zig stdlib bug — the inline asm that swaps
rsp/rbp/rip between fiber contexts corrupts state under certain optimizer
configurations.
upstream status (checked 2026-04-05)#
fiber.zig: unchanged between dev.3059 (our pin) and dev.3091 (latest)- uring networking: still fully stubbed — all six
*Unavailablefunctions - zig team position: Evented is "experimental" with "important followup work to be done before they can be used reliably" (HN, feb 2026)
Io.Threadedis the recommended production backend
what we kept#
patches/uring-networking.patch— the networking implementation, for reference and to resume from when upstream catches upscripts/repro_evented.zig— minimal reproduction of the fiber GPFdocs/stdlib-patches.md— catalog of all 6 workarounds we needed- this document
timeline#
- 2026-03-08: first Evented deploy (
39134d1) - 2026-03-08 to 2026-04-04: 28 commits fixing cross-Io crashes
- 2026-04-05: reverted to Threaded after SIGSEGV blamed on fiber machinery
- 2026-04-05: discovered real cause — websocket handshake TCP split bug
- 2026-04-05: fixed websocket.zig (
9ac64da), re-enabled Evented + ReleaseSafe
fallback#
if Evented + ReleaseSafe hits the repro GPF in production, switch to
ReleaseFast in the Dockerfile. if that also crashes, flip Backend back
to Io.Threaded in src/main.zig:58 — one-line change.