docs/stdlib-patches.md at main · zzstoatzz.io/zlay

zzstoatzz.io / zlay
fork
atproto relay implementation in zig zlay.waow.tech
fork
zlay / docs / stdlib-patches.md
at main 192 lines 8.9 kB view raw view rendered
wrap content
zzstoatzz tidy repo root: move notes to docs/, repro to scripts/ 4w ago
ed90a1b2
  1# zig stdlib patches for the Evented backend
  2
  3zlay runs on `Io.Evented` (io_uring fiber scheduler) for network I/O. the
  4upstream zig 0.16-dev stdlib (`0.16.0-dev.3059+42e33db9d`) ships several
  5Uring networking operations as stubs that return `error.NetworkDown`. zlay
  6patches these at build time and works around other stdlib limitations.
  7
  8this document tracks what we had to change, why, and what upstream work
  9would let us drop each workaround.
 10
 11## patch 1: Uring networking (`patches/uring-networking.patch`)
 12
 13**applied in**: `Dockerfile` line 13, patches `lib/std/Io/Uring.zig`
 14
 15the upstream stdlib has these functions stubbed as `*Unavailable`:
 16
 17```
 18netListenIpUnavailable    → return error.NetworkDown
 19netAcceptUnavailable      → return error.NetworkDown
 20netConnectIpUnavailable   → return error.NetworkDown
 21netSendUnavailable        → return error.NetworkDown
 22netReadUnavailable        → return error.NetworkDown
 23netWriteUnavailable       → return error.NetworkDown
 24```
 25
 26without these, `Io.Evented` can init but any TCP operation fails immediately.
 27the patch replaces all six with working implementations:
 28
 29| function | io_uring opcode | notes |
 30|---|---|---|
 31| `netListenIp` | sync `bind()` + `listen()` | IORING_OP_BIND/LISTEN need kernel 6.11+, so we use sync syscalls |
 32| `netAccept` | `IORING_OP_ACCEPT` | fiber yields until connection arrives |
 33| `netConnectIp` | `IORING_OP_CONNECT` | + socket creation via existing `ev.socket()` |
 34| `netSend` | `IORING_OP_SENDMSG` | iterates message array, one SENDMSG per message |
 35| `netRead` | `IORING_OP_READV` or `IORING_OP_READ` | scatter read; single-buffer fast path |
 36| `netWrite` | `IORING_OP_SENDMSG` | gather write with iovec assembly + splat pattern handling |
 37
 38the patch also adds two helpers:
 39- `connect()` — submits `IORING_OP_CONNECT` SQE, handles retry on `EINTR`/`ECANCELED`
 40- `netSendOne()` — sends a single `OutgoingMessage` via `IORING_OP_SENDMSG`
 41
 42### why sync bind/listen
 43
 44`IORING_OP_BIND` and `IORING_OP_LISTEN` were added in linux 6.11. production
 45runs on bookworm (kernel 6.1). `bind()` and `listen()` are fast synchronous
 46calls anyway — no benefit from async submission. the rest of the networking
 47stack (accept, connect, read, write) uses proper io_uring async ops.
 48
 49### why not upstream yet
 50
 51tracked as zig issue #31723. the Uring networking layer is under active
 52development. our patch makes pragmatic choices (sync bind/listen, specific
 53error mappings) that may not match upstream's desired API shape. we'd want
 54to align with whatever design decisions the zig team makes before submitting.
 55
 56### regenerating the patch
 57
 58the patch is pinned to zig `0.16.0-dev.3059+42e33db9d`. any zig version
 59bump requires checking if the Uring.zig source changed and regenerating.
 60
 61```bash
 62# to regenerate after a zig update:
 63diff -u /path/to/old-zig/lib/std/Io/Uring.zig /path/to/patched/Uring.zig > patches/uring-networking.patch
 64```
 65
 66## workaround 2: DNS resolution via Threaded fallback
 67
 68**not patched** — worked around in application code.
 69
 70`Io.Uring` does not implement `netLookup` (DNS resolution). instead of
 71patching it, subscribers route DNS through `pool_io` (Threaded):
 72
 73```zig
 74// subscriber.zig:326-330
 75// DNS + TCP connect through pool_io (Threaded — has working netLookup).
 76const dns_io = self.pool_io orelse self.io;
 77const net_stream = try host_name.connect(dns_io, 443, .{ .mode = .stream });
 78```
 79
 80this works because `Io.Threaded.netLookup` uses `getaddrinfo` on a worker
 81thread. the resulting socket handle is then used with Evented I/O for
 82the actual data transfer (reads/writes go through the patched Uring ops).
 83
 84**upstream fix**: implement `netLookup` in Uring.zig, probably by submitting
 85the blocking `getaddrinfo` call on an io_uring worker thread
 86(`IORING_OP_POLL_ADD` + thread pool, or the newer `IORING_OP_GETXATTR`
 87pattern). not blocking — the Threaded fallback is fine.
 88
 89## workaround 3: ReleaseSafe GPF in Uring fiber context
 90
 91**not patched** — worked around by building with `ReleaseFast`.
 92
 93```dockerfile
 94# Dockerfile line 21-23
 95# ReleaseFast (not ReleaseSafe): Io.Uring fiber context-switch GPFs under ReleaseSafe
 96RUN zig build -Doptimize=ReleaseFast ...
 97```
 98
 99under `ReleaseSafe`, the optimizer's inlining interacts badly with Uring's
100fiber context-switch machinery. the result is a general protection fault
101during normal fiber yield/resume. `ReleaseFast` and `Debug` both work.
102`scripts/repro_evented.zig` reproduces this — three simple fiber
103tests (no-sleep, yield, sleep) that pass under Debug and ReleaseFast but
104GPF under ReleaseSafe.
105
106this is likely a zig codegen/optimizer bug. we haven't filed it yet because
107the reproduction is minimal but the root cause analysis is incomplete —
108could be a safety check reading stale fiber stack, or an inlining decision
109that breaks the stack-swap assumptions.
110
111**upstream fix**: file a bug with the repro. probably a zig compiler issue,
112not an Uring.zig issue.
113
114## workaround 4: `std_options.debug_io` single-threaded default
115
116**not patched** — worked around in `src/main.zig`.
117
118```zig
119// main.zig:62-64
120var debug_threaded_io: Io.Threaded = undefined;
121pub const std_options_debug_threaded_io: ?*Io.Threaded = &debug_threaded_io;
122```
123
124`std.debug.print` internally uses an `Io`-managed lock for output
125serialization. the default (`debug_io = null`) assumes single-threaded
126execution. zlay has multiple OS threads (frame worker pool, GC thread,
127resyncer thread) that all call `std.debug.print` / `log.*`. without this
128override, concurrent debug prints corrupt each other or deadlock.
129
130**upstream fix**: arguably the default should be safe for multi-threaded
131programs. but explicit opt-in is reasonable — it requires initializing an
132`Io.Threaded` instance at startup which has a cost.
133
134## workaround 5: `Io.Event.reset()` single-waiter assumption
135
136**not patched** — worked around in pg.zig fork.
137
138`Io.Event` has a `reset()` method with a stdlib invariant (Io.zig:1857):
139it assumes no pending call to `wait`. when multiple threads contend for
140a pooled resource (pg.Pool connections), `set()` wakes all waiters, one
141calls `reset()`, and the others hit `unreachable`.
142
143the pg.zig fork (`5ce2355`, dev branch) replaced `Io.Event` with a
144monotonic `u32` futex counter:
145- `release()` increments counter + `futexWake(1)` (wake one)
146- `acquire()` snapshots counter under mutex + `futexWaitTimeout()` with snapshot
147- no `reset()`, no single-waiter constraint
148
149**upstream fix**: `Io.Event` could support multi-waiter reset, or provide a
150semaphore/condvar primitive. the futex counter pattern is well-known and
151could be upstreamed to pg.zig proper.
152
153## workaround 6: cross-Io Mutex/futex incompatibility
154
155**not patched** — worked around by careful Io segregation.
156
157`Io.Mutex` and `Io.Condition` use futex operations that are tied to their
158Io backend. calling `mutex.lockUncancelable(threaded_io)` from an Evented
159fiber dereferences `Thread.current()` — a threadlocal only set on Uring-
160managed threads. on Evented fibers it's NULL → SIGSEGV or heap corruption.
161
162this caused three separate crashes during the migration (crashes 1, 6, 8 in
163docs/notes.md). the fix pattern is always the same: components that use Threaded
164resources (mutexes initialized with `pool_io`, pg.Pool) must run on plain
165`std.Thread`, not as Evented `io.concurrent()` fibers.
166
167current segregation:
168
169| component | runs on | why |
170|---|---|---|
171| GC loop | `std.Thread` + `pool_io` | uses DiskPersist mutex + pg.Pool |
172| resyncer | `std.Thread` + `pool_io` | uses DiskPersist + HTTP client |
173| frame workers | `std.Thread` + `pool_io` | uses Io.Mutex/Condition for queue sync |
174| subscribers | `io.concurrent` (Evented) | pure network I/O, no shared mutexes |
175| broadcast loop | `io.concurrent` (Evented) | lock-free ring buffer + atomics |
176| health checks | Evented handlers | use atomic `last_db_success`, not pg.Pool |
177
178**upstream fix**: there's no obvious stdlib fix here — this is architectural.
179either Mutex/Condition need to detect and handle cross-Io calls, or the docs
180need to clearly state the constraint. a `pg.Pool` that accepts an `Io`
181parameter per-call (rather than at init) would also help.
182
183## summary table
184
185| # | issue | fix type | status | drops when |
186|---|---|---|---|---|
187| 1 | Uring networking stubs | patch | `patches/uring-networking.patch` | upstream implements (zig#31723) |
188| 2 | DNS resolution missing | app workaround | Threaded fallback in subscriber | upstream implements netLookup |
189| 3 | ReleaseSafe GPF | build flag | `-Doptimize=ReleaseFast` | upstream fixes codegen bug |
190| 4 | debug_io single-threaded | app workaround | `std_options_debug_threaded_io` | upstream changes default or n/a |
191| 5 | Io.Event single-waiter | dep fork | pg.zig futex counter | upstream adds multi-waiter Event |
192| 6 | cross-Io Mutex | app architecture | Io segregation | upstream makes Mutex cross-Io safe |
Configure Feed

Configure Feed