atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 88 lines 4.3 kB view raw view rendered
1# Evented backend (2026-03 to present) 2 3log of getting zlay running on `Io.Evented` (io_uring fibers) instead of 4`Io.Threaded` (one OS thread per task). goal: drop from ~2,800 threads to ~35. 5 6after 28 commits fixing cross-Io issues and a brief revert to Threaded, we 7discovered the production SIGSEGV was a websocket handshake bug (TCP split 8mid-CRLF, fixed in websocket.zig `9ac64da`), not a fiber context-switch 9issue. Evented backend is back, now running ReleaseSafe. 10 11## what we built 12 13- **uring networking patch** (`patches/uring-networking.patch`): implemented 6 14 networking operations that are stubbed as `error.NetworkDown` upstream — 15 listen, accept, connect, send, read, write. all use proper io_uring opcodes 16 except bind/listen (sync syscalls, kernel <6.11). tracked upstream as zig#31723. 17 18- **dual-Io architecture**: Evented for network I/O (subscribers, broadcaster, 19 metrics server), Threaded for blocking work (frame pool, DB, GC, resyncer). 20 careful segregation to avoid cross-Io mutex/futex incompatibilities. 21 22- **DNS fallback**: `Io.Uring` doesn't implement `netLookup`. subscribers 23 resolve hostnames through `pool_io` (Threaded), then use the socket with 24 Evented I/O for data transfer. 25 26- **DbRequestQueue**: MPSC queue bridging Evented fibers to Threaded DB 27 workers, replacing attempts at running pg.Pool from fiber context. 28 29## what went wrong 30 3128 commits between `39134d1` (first evented backend) and `b8ef148` (last fix), 32most of them crash fixes: 33 34| commit | issue | 35|--------|-------| 36| `6674812` | SIGSEGV: plain threads calling Evented Io.Mutex → NULL Thread.current() | 37| `439c678` | startup deadlock: resyncer on Evented fiber blocked Uring thread | 38| `2156d08` | heap corruption: GC using Threaded pg.Pool from Evented fiber | 39| `3533416` | heap corruption: subscriber pg.Pool access from Evented fiber | 40| `949e9a7` | heap corruption: remaining cross-Io pg.Pool paths | 41| `72ba680` | use-after-free: broadcaster destroying consumers still referenced | 42| `4a23671` | ReleaseSafe GPF: optimizer bug in fiber context-switch | 43| `c3bc3be` | replace ev_db with DbRequestQueue (final cross-Io DB fix) | 44 45the fundamental problem: `Io.Mutex`, `Io.Condition`, and `Io.Event` use 46backend-specific futex operations. calling a Threaded mutex from an Evented 47fiber (or vice versa) dereferences a NULL threadlocal → SIGSEGV or silent 48heap corruption. this means any shared resource (pg.Pool, DiskPersist mutex) 49must be carefully routed to the right Io backend. we fixed all of these, but 50the architecture became fragile. 51 52the final remaining bug: a probabilistic SIGSEGV in `std.Io.fiber.contextSwitch` 53(fiber.zig:29) that manifests every 30-90 minutes under sustained load with 54~2,800 fibers. under ReleaseSafe it's an immediate GPF; under ReleaseFast 55it's probabilistic. the repro (`scripts/repro_evented.zig`) demonstrates the 56ReleaseSafe case. this is a zig stdlib bug — the inline asm that swaps 57rsp/rbp/rip between fiber contexts corrupts state under certain optimizer 58configurations. 59 60## upstream status (checked 2026-04-05) 61 62- `fiber.zig`: **unchanged** between dev.3059 (our pin) and dev.3091 (latest) 63- uring networking: **still fully stubbed** — all six `*Unavailable` functions 64- zig team position: Evented is "experimental" with "important followup work 65 to be done before they can be used reliably" (HN, feb 2026) 66- `Io.Threaded` is the recommended production backend 67 68## what we kept 69 70- `patches/uring-networking.patch` — the networking implementation, for 71 reference and to resume from when upstream catches up 72- `scripts/repro_evented.zig` — minimal reproduction of the fiber GPF 73- `docs/stdlib-patches.md` — catalog of all 6 workarounds we needed 74- this document 75 76## timeline 77 78- 2026-03-08: first Evented deploy (`39134d1`) 79- 2026-03-08 to 2026-04-04: 28 commits fixing cross-Io crashes 80- 2026-04-05: reverted to Threaded after SIGSEGV blamed on fiber machinery 81- 2026-04-05: discovered real cause — websocket handshake TCP split bug 82- 2026-04-05: fixed websocket.zig (`9ac64da`), re-enabled Evented + ReleaseSafe 83 84## fallback 85 86if Evented + ReleaseSafe hits the repro GPF in production, switch to 87ReleaseFast in the Dockerfile. if that also crashes, flip `Backend` back 88to `Io.Threaded` in `src/main.zig:58` — one-line change.