atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

Evented backend (2026-03 to present)#

log of getting zlay running on Io.Evented (io_uring fibers) instead of Io.Threaded (one OS thread per task). goal: drop from ~2,800 threads to ~35.

after 28 commits fixing cross-Io issues and a brief revert to Threaded, we discovered the production SIGSEGV was a websocket handshake bug (TCP split mid-CRLF, fixed in websocket.zig 9ac64da), not a fiber context-switch issue. Evented backend is back, now running ReleaseSafe.

what we built#

  • uring networking patch (patches/uring-networking.patch): implemented 6 networking operations that are stubbed as error.NetworkDown upstream — listen, accept, connect, send, read, write. all use proper io_uring opcodes except bind/listen (sync syscalls, kernel <6.11). tracked upstream as zig#31723.

  • dual-Io architecture: Evented for network I/O (subscribers, broadcaster, metrics server), Threaded for blocking work (frame pool, DB, GC, resyncer). careful segregation to avoid cross-Io mutex/futex incompatibilities.

  • DNS fallback: Io.Uring doesn't implement netLookup. subscribers resolve hostnames through pool_io (Threaded), then use the socket with Evented I/O for data transfer.

  • DbRequestQueue: MPSC queue bridging Evented fibers to Threaded DB workers, replacing attempts at running pg.Pool from fiber context.

what went wrong#

28 commits between 39134d1 (first evented backend) and b8ef148 (last fix), most of them crash fixes:

commit issue
6674812 SIGSEGV: plain threads calling Evented Io.Mutex → NULL Thread.current()
439c678 startup deadlock: resyncer on Evented fiber blocked Uring thread
2156d08 heap corruption: GC using Threaded pg.Pool from Evented fiber
3533416 heap corruption: subscriber pg.Pool access from Evented fiber
949e9a7 heap corruption: remaining cross-Io pg.Pool paths
72ba680 use-after-free: broadcaster destroying consumers still referenced
4a23671 ReleaseSafe GPF: optimizer bug in fiber context-switch
c3bc3be replace ev_db with DbRequestQueue (final cross-Io DB fix)

the fundamental problem: Io.Mutex, Io.Condition, and Io.Event use backend-specific futex operations. calling a Threaded mutex from an Evented fiber (or vice versa) dereferences a NULL threadlocal → SIGSEGV or silent heap corruption. this means any shared resource (pg.Pool, DiskPersist mutex) must be carefully routed to the right Io backend. we fixed all of these, but the architecture became fragile.

the final remaining bug: a probabilistic SIGSEGV in std.Io.fiber.contextSwitch (fiber.zig:29) that manifests every 30-90 minutes under sustained load with ~2,800 fibers. under ReleaseSafe it's an immediate GPF; under ReleaseFast it's probabilistic. the repro (scripts/repro_evented.zig) demonstrates the ReleaseSafe case. this is a zig stdlib bug — the inline asm that swaps rsp/rbp/rip between fiber contexts corrupts state under certain optimizer configurations.

upstream status (checked 2026-04-05)#

  • fiber.zig: unchanged between dev.3059 (our pin) and dev.3091 (latest)
  • uring networking: still fully stubbed — all six *Unavailable functions
  • zig team position: Evented is "experimental" with "important followup work to be done before they can be used reliably" (HN, feb 2026)
  • Io.Threaded is the recommended production backend

what we kept#

  • patches/uring-networking.patch — the networking implementation, for reference and to resume from when upstream catches up
  • scripts/repro_evented.zig — minimal reproduction of the fiber GPF
  • docs/stdlib-patches.md — catalog of all 6 workarounds we needed
  • this document

timeline#

  • 2026-03-08: first Evented deploy (39134d1)
  • 2026-03-08 to 2026-04-04: 28 commits fixing cross-Io crashes
  • 2026-04-05: reverted to Threaded after SIGSEGV blamed on fiber machinery
  • 2026-04-05: discovered real cause — websocket handshake TCP split bug
  • 2026-04-05: fixed websocket.zig (9ac64da), re-enabled Evented + ReleaseSafe

fallback#

if Evented + ReleaseSafe hits the repro GPF in production, switch to ReleaseFast in the Dockerfile. if that also crashes, flip Backend back to Io.Threaded in src/main.zig:58 — one-line change.