atproto relay implementation in zig
zlay.waow.tech
1# Evented backend (2026-03 to present)
2
3log of getting zlay running on `Io.Evented` (io_uring fibers) instead of
4`Io.Threaded` (one OS thread per task). goal: drop from ~2,800 threads to ~35.
5
6after 28 commits fixing cross-Io issues and a brief revert to Threaded, we
7discovered the production SIGSEGV was a websocket handshake bug (TCP split
8mid-CRLF, fixed in websocket.zig `9ac64da`), not a fiber context-switch
9issue. Evented backend is back, now running ReleaseSafe.
10
11## what we built
12
13- **uring networking patch** (`patches/uring-networking.patch`): implemented 6
14 networking operations that are stubbed as `error.NetworkDown` upstream —
15 listen, accept, connect, send, read, write. all use proper io_uring opcodes
16 except bind/listen (sync syscalls, kernel <6.11). tracked upstream as zig#31723.
17
18- **dual-Io architecture**: Evented for network I/O (subscribers, broadcaster,
19 metrics server), Threaded for blocking work (frame pool, DB, GC, resyncer).
20 careful segregation to avoid cross-Io mutex/futex incompatibilities.
21
22- **DNS fallback**: `Io.Uring` doesn't implement `netLookup`. subscribers
23 resolve hostnames through `pool_io` (Threaded), then use the socket with
24 Evented I/O for data transfer.
25
26- **DbRequestQueue**: MPSC queue bridging Evented fibers to Threaded DB
27 workers, replacing attempts at running pg.Pool from fiber context.
28
29## what went wrong
30
3128 commits between `39134d1` (first evented backend) and `b8ef148` (last fix),
32most of them crash fixes:
33
34| commit | issue |
35|--------|-------|
36| `6674812` | SIGSEGV: plain threads calling Evented Io.Mutex → NULL Thread.current() |
37| `439c678` | startup deadlock: resyncer on Evented fiber blocked Uring thread |
38| `2156d08` | heap corruption: GC using Threaded pg.Pool from Evented fiber |
39| `3533416` | heap corruption: subscriber pg.Pool access from Evented fiber |
40| `949e9a7` | heap corruption: remaining cross-Io pg.Pool paths |
41| `72ba680` | use-after-free: broadcaster destroying consumers still referenced |
42| `4a23671` | ReleaseSafe GPF: optimizer bug in fiber context-switch |
43| `c3bc3be` | replace ev_db with DbRequestQueue (final cross-Io DB fix) |
44
45the fundamental problem: `Io.Mutex`, `Io.Condition`, and `Io.Event` use
46backend-specific futex operations. calling a Threaded mutex from an Evented
47fiber (or vice versa) dereferences a NULL threadlocal → SIGSEGV or silent
48heap corruption. this means any shared resource (pg.Pool, DiskPersist mutex)
49must be carefully routed to the right Io backend. we fixed all of these, but
50the architecture became fragile.
51
52the final remaining bug: a probabilistic SIGSEGV in `std.Io.fiber.contextSwitch`
53(fiber.zig:29) that manifests every 30-90 minutes under sustained load with
54~2,800 fibers. under ReleaseSafe it's an immediate GPF; under ReleaseFast
55it's probabilistic. the repro (`scripts/repro_evented.zig`) demonstrates the
56ReleaseSafe case. this is a zig stdlib bug — the inline asm that swaps
57rsp/rbp/rip between fiber contexts corrupts state under certain optimizer
58configurations.
59
60## upstream status (checked 2026-04-05)
61
62- `fiber.zig`: **unchanged** between dev.3059 (our pin) and dev.3091 (latest)
63- uring networking: **still fully stubbed** — all six `*Unavailable` functions
64- zig team position: Evented is "experimental" with "important followup work
65 to be done before they can be used reliably" (HN, feb 2026)
66- `Io.Threaded` is the recommended production backend
67
68## what we kept
69
70- `patches/uring-networking.patch` — the networking implementation, for
71 reference and to resume from when upstream catches up
72- `scripts/repro_evented.zig` — minimal reproduction of the fiber GPF
73- `docs/stdlib-patches.md` — catalog of all 6 workarounds we needed
74- this document
75
76## timeline
77
78- 2026-03-08: first Evented deploy (`39134d1`)
79- 2026-03-08 to 2026-04-04: 28 commits fixing cross-Io crashes
80- 2026-04-05: reverted to Threaded after SIGSEGV blamed on fiber machinery
81- 2026-04-05: discovered real cause — websocket handshake TCP split bug
82- 2026-04-05: fixed websocket.zig (`9ac64da`), re-enabled Evented + ReleaseSafe
83
84## fallback
85
86if Evented + ReleaseSafe hits the repro GPF in production, switch to
87ReleaseFast in the Dockerfile. if that also crashes, flip `Backend` back
88to `Io.Threaded` in `src/main.zig:58` — one-line change.