Evented backend (2026-03 to present)#

log of getting zlay running on Io.Evented (io_uring fibers) instead of Io.Threaded (one OS thread per task). goal: drop from ~2,800 threads to ~35.

after 28 commits fixing cross-Io issues and a brief revert to Threaded, we discovered the production SIGSEGV was a websocket handshake bug (TCP split mid-CRLF, fixed in websocket.zig 9ac64da), not a fiber context-switch issue. Evented backend is back, now running ReleaseSafe.

what we built#

uring networking patch (patches/uring-networking.patch): implemented 6 networking operations that are stubbed as error.NetworkDown upstream — listen, accept, connect, send, read, write. all use proper io_uring opcodes except bind/listen (sync syscalls, kernel <6.11). tracked upstream as zig#31723.
dual-Io architecture: Evented for network I/O (subscribers, broadcaster, metrics server), Threaded for blocking work (frame pool, DB, GC, resyncer). careful segregation to avoid cross-Io mutex/futex incompatibilities.
DNS fallback: Io.Uring doesn't implement netLookup. subscribers resolve hostnames through pool_io (Threaded), then use the socket with Evented I/O for data transfer.
DbRequestQueue: MPSC queue bridging Evented fibers to Threaded DB workers, replacing attempts at running pg.Pool from fiber context.

what went wrong#

28 commits between 39134d1 (first evented backend) and b8ef148 (last fix), most of them crash fixes:

commit	issue
`6674812`	SIGSEGV: plain threads calling Evented Io.Mutex → NULL Thread.current()
`439c678`	startup deadlock: resyncer on Evented fiber blocked Uring thread
`2156d08`	heap corruption: GC using Threaded pg.Pool from Evented fiber
`3533416`	heap corruption: subscriber pg.Pool access from Evented fiber
`949e9a7`	heap corruption: remaining cross-Io pg.Pool paths
`72ba680`	use-after-free: broadcaster destroying consumers still referenced
`4a23671`	ReleaseSafe GPF: optimizer bug in fiber context-switch
`c3bc3be`	replace ev_db with DbRequestQueue (final cross-Io DB fix)

the fundamental problem: Io.Mutex, Io.Condition, and Io.Event use backend-specific futex operations. calling a Threaded mutex from an Evented fiber (or vice versa) dereferences a NULL threadlocal → SIGSEGV or silent heap corruption. this means any shared resource (pg.Pool, DiskPersist mutex) must be carefully routed to the right Io backend. we fixed all of these, but the architecture became fragile.

the final remaining bug: a probabilistic SIGSEGV in std.Io.fiber.contextSwitch (fiber.zig:29) that manifests every 30-90 minutes under sustained load with ~2,800 fibers. under ReleaseSafe it's an immediate GPF; under ReleaseFast it's probabilistic. the repro (scripts/repro_evented.zig) demonstrates the ReleaseSafe case. this is a zig stdlib bug — the inline asm that swaps rsp/rbp/rip between fiber contexts corrupts state under certain optimizer configurations.

upstream status (checked 2026-04-05)#

fiber.zig: unchanged between dev.3059 (our pin) and dev.3091 (latest)
uring networking: still fully stubbed — all six *Unavailable functions
zig team position: Evented is "experimental" with "important followup work to be done before they can be used reliably" (HN, feb 2026)
Io.Threaded is the recommended production backend

what we kept#

patches/uring-networking.patch — the networking implementation, for reference and to resume from when upstream catches up
scripts/repro_evented.zig — minimal reproduction of the fiber GPF
docs/stdlib-patches.md — catalog of all 6 workarounds we needed
this document

timeline#

2026-03-08: first Evented deploy (39134d1)
2026-03-08 to 2026-04-04: 28 commits fixing cross-Io crashes
2026-04-05: reverted to Threaded after SIGSEGV blamed on fiber machinery
2026-04-05: discovered real cause — websocket handshake TCP split bug
2026-04-05: fixed websocket.zig (9ac64da), re-enabled Evented + ReleaseSafe

fallback#

if Evented + ReleaseSafe hits the repro GPF in production, switch to ReleaseFast in the Dockerfile. if that also crashes, flip Backend back to Io.Threaded in src/main.zig:58 — one-line change.

Configure Feed