atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 192 lines 8.9 kB view raw view rendered
1# zig stdlib patches for the Evented backend 2 3zlay runs on `Io.Evented` (io_uring fiber scheduler) for network I/O. the 4upstream zig 0.16-dev stdlib (`0.16.0-dev.3059+42e33db9d`) ships several 5Uring networking operations as stubs that return `error.NetworkDown`. zlay 6patches these at build time and works around other stdlib limitations. 7 8this document tracks what we had to change, why, and what upstream work 9would let us drop each workaround. 10 11## patch 1: Uring networking (`patches/uring-networking.patch`) 12 13**applied in**: `Dockerfile` line 13, patches `lib/std/Io/Uring.zig` 14 15the upstream stdlib has these functions stubbed as `*Unavailable`: 16 17``` 18netListenIpUnavailable → return error.NetworkDown 19netAcceptUnavailable → return error.NetworkDown 20netConnectIpUnavailable → return error.NetworkDown 21netSendUnavailable → return error.NetworkDown 22netReadUnavailable → return error.NetworkDown 23netWriteUnavailable → return error.NetworkDown 24``` 25 26without these, `Io.Evented` can init but any TCP operation fails immediately. 27the patch replaces all six with working implementations: 28 29| function | io_uring opcode | notes | 30|---|---|---| 31| `netListenIp` | sync `bind()` + `listen()` | IORING_OP_BIND/LISTEN need kernel 6.11+, so we use sync syscalls | 32| `netAccept` | `IORING_OP_ACCEPT` | fiber yields until connection arrives | 33| `netConnectIp` | `IORING_OP_CONNECT` | + socket creation via existing `ev.socket()` | 34| `netSend` | `IORING_OP_SENDMSG` | iterates message array, one SENDMSG per message | 35| `netRead` | `IORING_OP_READV` or `IORING_OP_READ` | scatter read; single-buffer fast path | 36| `netWrite` | `IORING_OP_SENDMSG` | gather write with iovec assembly + splat pattern handling | 37 38the patch also adds two helpers: 39- `connect()` — submits `IORING_OP_CONNECT` SQE, handles retry on `EINTR`/`ECANCELED` 40- `netSendOne()` — sends a single `OutgoingMessage` via `IORING_OP_SENDMSG` 41 42### why sync bind/listen 43 44`IORING_OP_BIND` and `IORING_OP_LISTEN` were added in linux 6.11. production 45runs on bookworm (kernel 6.1). `bind()` and `listen()` are fast synchronous 46calls anyway — no benefit from async submission. the rest of the networking 47stack (accept, connect, read, write) uses proper io_uring async ops. 48 49### why not upstream yet 50 51tracked as zig issue #31723. the Uring networking layer is under active 52development. our patch makes pragmatic choices (sync bind/listen, specific 53error mappings) that may not match upstream's desired API shape. we'd want 54to align with whatever design decisions the zig team makes before submitting. 55 56### regenerating the patch 57 58the patch is pinned to zig `0.16.0-dev.3059+42e33db9d`. any zig version 59bump requires checking if the Uring.zig source changed and regenerating. 60 61```bash 62# to regenerate after a zig update: 63diff -u /path/to/old-zig/lib/std/Io/Uring.zig /path/to/patched/Uring.zig > patches/uring-networking.patch 64``` 65 66## workaround 2: DNS resolution via Threaded fallback 67 68**not patched** — worked around in application code. 69 70`Io.Uring` does not implement `netLookup` (DNS resolution). instead of 71patching it, subscribers route DNS through `pool_io` (Threaded): 72 73```zig 74// subscriber.zig:326-330 75// DNS + TCP connect through pool_io (Threaded — has working netLookup). 76const dns_io = self.pool_io orelse self.io; 77const net_stream = try host_name.connect(dns_io, 443, .{ .mode = .stream }); 78``` 79 80this works because `Io.Threaded.netLookup` uses `getaddrinfo` on a worker 81thread. the resulting socket handle is then used with Evented I/O for 82the actual data transfer (reads/writes go through the patched Uring ops). 83 84**upstream fix**: implement `netLookup` in Uring.zig, probably by submitting 85the blocking `getaddrinfo` call on an io_uring worker thread 86(`IORING_OP_POLL_ADD` + thread pool, or the newer `IORING_OP_GETXATTR` 87pattern). not blocking — the Threaded fallback is fine. 88 89## workaround 3: ReleaseSafe GPF in Uring fiber context 90 91**not patched** — worked around by building with `ReleaseFast`. 92 93```dockerfile 94# Dockerfile line 21-23 95# ReleaseFast (not ReleaseSafe): Io.Uring fiber context-switch GPFs under ReleaseSafe 96RUN zig build -Doptimize=ReleaseFast ... 97``` 98 99under `ReleaseSafe`, the optimizer's inlining interacts badly with Uring's 100fiber context-switch machinery. the result is a general protection fault 101during normal fiber yield/resume. `ReleaseFast` and `Debug` both work. 102`scripts/repro_evented.zig` reproduces this — three simple fiber 103tests (no-sleep, yield, sleep) that pass under Debug and ReleaseFast but 104GPF under ReleaseSafe. 105 106this is likely a zig codegen/optimizer bug. we haven't filed it yet because 107the reproduction is minimal but the root cause analysis is incomplete — 108could be a safety check reading stale fiber stack, or an inlining decision 109that breaks the stack-swap assumptions. 110 111**upstream fix**: file a bug with the repro. probably a zig compiler issue, 112not an Uring.zig issue. 113 114## workaround 4: `std_options.debug_io` single-threaded default 115 116**not patched** — worked around in `src/main.zig`. 117 118```zig 119// main.zig:62-64 120var debug_threaded_io: Io.Threaded = undefined; 121pub const std_options_debug_threaded_io: ?*Io.Threaded = &debug_threaded_io; 122``` 123 124`std.debug.print` internally uses an `Io`-managed lock for output 125serialization. the default (`debug_io = null`) assumes single-threaded 126execution. zlay has multiple OS threads (frame worker pool, GC thread, 127resyncer thread) that all call `std.debug.print` / `log.*`. without this 128override, concurrent debug prints corrupt each other or deadlock. 129 130**upstream fix**: arguably the default should be safe for multi-threaded 131programs. but explicit opt-in is reasonable — it requires initializing an 132`Io.Threaded` instance at startup which has a cost. 133 134## workaround 5: `Io.Event.reset()` single-waiter assumption 135 136**not patched** — worked around in pg.zig fork. 137 138`Io.Event` has a `reset()` method with a stdlib invariant (Io.zig:1857): 139it assumes no pending call to `wait`. when multiple threads contend for 140a pooled resource (pg.Pool connections), `set()` wakes all waiters, one 141calls `reset()`, and the others hit `unreachable`. 142 143the pg.zig fork (`5ce2355`, dev branch) replaced `Io.Event` with a 144monotonic `u32` futex counter: 145- `release()` increments counter + `futexWake(1)` (wake one) 146- `acquire()` snapshots counter under mutex + `futexWaitTimeout()` with snapshot 147- no `reset()`, no single-waiter constraint 148 149**upstream fix**: `Io.Event` could support multi-waiter reset, or provide a 150semaphore/condvar primitive. the futex counter pattern is well-known and 151could be upstreamed to pg.zig proper. 152 153## workaround 6: cross-Io Mutex/futex incompatibility 154 155**not patched** — worked around by careful Io segregation. 156 157`Io.Mutex` and `Io.Condition` use futex operations that are tied to their 158Io backend. calling `mutex.lockUncancelable(threaded_io)` from an Evented 159fiber dereferences `Thread.current()` — a threadlocal only set on Uring- 160managed threads. on Evented fibers it's NULL → SIGSEGV or heap corruption. 161 162this caused three separate crashes during the migration (crashes 1, 6, 8 in 163docs/notes.md). the fix pattern is always the same: components that use Threaded 164resources (mutexes initialized with `pool_io`, pg.Pool) must run on plain 165`std.Thread`, not as Evented `io.concurrent()` fibers. 166 167current segregation: 168 169| component | runs on | why | 170|---|---|---| 171| GC loop | `std.Thread` + `pool_io` | uses DiskPersist mutex + pg.Pool | 172| resyncer | `std.Thread` + `pool_io` | uses DiskPersist + HTTP client | 173| frame workers | `std.Thread` + `pool_io` | uses Io.Mutex/Condition for queue sync | 174| subscribers | `io.concurrent` (Evented) | pure network I/O, no shared mutexes | 175| broadcast loop | `io.concurrent` (Evented) | lock-free ring buffer + atomics | 176| health checks | Evented handlers | use atomic `last_db_success`, not pg.Pool | 177 178**upstream fix**: there's no obvious stdlib fix here — this is architectural. 179either Mutex/Condition need to detect and handle cross-Io calls, or the docs 180need to clearly state the constraint. a `pg.Pool` that accepts an `Io` 181parameter per-call (rather than at init) would also help. 182 183## summary table 184 185| # | issue | fix type | status | drops when | 186|---|---|---|---|---| 187| 1 | Uring networking stubs | patch | `patches/uring-networking.patch` | upstream implements (zig#31723) | 188| 2 | DNS resolution missing | app workaround | Threaded fallback in subscriber | upstream implements netLookup | 189| 3 | ReleaseSafe GPF | build flag | `-Doptimize=ReleaseFast` | upstream fixes codegen bug | 190| 4 | debug_io single-threaded | app workaround | `std_options_debug_threaded_io` | upstream changes default or n/a | 191| 5 | Io.Event single-waiter | dep fork | pg.zig futex counter | upstream adds multi-waiter Event | 192| 6 | cross-Io Mutex | app architecture | Io segregation | upstream makes Mutex cross-Io safe |