zig stdlib patches for the Evented backend#
zlay runs on Io.Evented (io_uring fiber scheduler) for network I/O. the
upstream zig 0.16-dev stdlib (0.16.0-dev.3059+42e33db9d) ships several
Uring networking operations as stubs that return error.NetworkDown. zlay
patches these at build time and works around other stdlib limitations.
this document tracks what we had to change, why, and what upstream work would let us drop each workaround.
patch 1: Uring networking (patches/uring-networking.patch)#
applied in: Dockerfile line 13, patches lib/std/Io/Uring.zig
the upstream stdlib has these functions stubbed as *Unavailable:
netListenIpUnavailable → return error.NetworkDown
netAcceptUnavailable → return error.NetworkDown
netConnectIpUnavailable → return error.NetworkDown
netSendUnavailable → return error.NetworkDown
netReadUnavailable → return error.NetworkDown
netWriteUnavailable → return error.NetworkDown
without these, Io.Evented can init but any TCP operation fails immediately.
the patch replaces all six with working implementations:
| function | io_uring opcode | notes |
|---|---|---|
netListenIp |
sync bind() + listen() |
IORING_OP_BIND/LISTEN need kernel 6.11+, so we use sync syscalls |
netAccept |
IORING_OP_ACCEPT |
fiber yields until connection arrives |
netConnectIp |
IORING_OP_CONNECT |
+ socket creation via existing ev.socket() |
netSend |
IORING_OP_SENDMSG |
iterates message array, one SENDMSG per message |
netRead |
IORING_OP_READV or IORING_OP_READ |
scatter read; single-buffer fast path |
netWrite |
IORING_OP_SENDMSG |
gather write with iovec assembly + splat pattern handling |
the patch also adds two helpers:
connect()— submitsIORING_OP_CONNECTSQE, handles retry onEINTR/ECANCELEDnetSendOne()— sends a singleOutgoingMessageviaIORING_OP_SENDMSG
why sync bind/listen#
IORING_OP_BIND and IORING_OP_LISTEN were added in linux 6.11. production
runs on bookworm (kernel 6.1). bind() and listen() are fast synchronous
calls anyway — no benefit from async submission. the rest of the networking
stack (accept, connect, read, write) uses proper io_uring async ops.
why not upstream yet#
tracked as zig issue #31723. the Uring networking layer is under active development. our patch makes pragmatic choices (sync bind/listen, specific error mappings) that may not match upstream's desired API shape. we'd want to align with whatever design decisions the zig team makes before submitting.
regenerating the patch#
the patch is pinned to zig 0.16.0-dev.3059+42e33db9d. any zig version
bump requires checking if the Uring.zig source changed and regenerating.
# to regenerate after a zig update:
diff -u /path/to/old-zig/lib/std/Io/Uring.zig /path/to/patched/Uring.zig > patches/uring-networking.patch
workaround 2: DNS resolution via Threaded fallback#
not patched — worked around in application code.
Io.Uring does not implement netLookup (DNS resolution). instead of
patching it, subscribers route DNS through pool_io (Threaded):
// subscriber.zig:326-330
// DNS + TCP connect through pool_io (Threaded — has working netLookup).
const dns_io = self.pool_io orelse self.io;
const net_stream = try host_name.connect(dns_io, 443, .{ .mode = .stream });
this works because Io.Threaded.netLookup uses getaddrinfo on a worker
thread. the resulting socket handle is then used with Evented I/O for
the actual data transfer (reads/writes go through the patched Uring ops).
upstream fix: implement netLookup in Uring.zig, probably by submitting
the blocking getaddrinfo call on an io_uring worker thread
(IORING_OP_POLL_ADD + thread pool, or the newer IORING_OP_GETXATTR
pattern). not blocking — the Threaded fallback is fine.
workaround 3: ReleaseSafe GPF in Uring fiber context#
not patched — worked around by building with ReleaseFast.
# Dockerfile line 21-23
# ReleaseFast (not ReleaseSafe): Io.Uring fiber context-switch GPFs under ReleaseSafe
RUN zig build -Doptimize=ReleaseFast ...
under ReleaseSafe, the optimizer's inlining interacts badly with Uring's
fiber context-switch machinery. the result is a general protection fault
during normal fiber yield/resume. ReleaseFast and Debug both work.
scripts/repro_evented.zig reproduces this — three simple fiber
tests (no-sleep, yield, sleep) that pass under Debug and ReleaseFast but
GPF under ReleaseSafe.
this is likely a zig codegen/optimizer bug. we haven't filed it yet because the reproduction is minimal but the root cause analysis is incomplete — could be a safety check reading stale fiber stack, or an inlining decision that breaks the stack-swap assumptions.
upstream fix: file a bug with the repro. probably a zig compiler issue, not an Uring.zig issue.
workaround 4: std_options.debug_io single-threaded default#
not patched — worked around in src/main.zig.
// main.zig:62-64
var debug_threaded_io: Io.Threaded = undefined;
pub const std_options_debug_threaded_io: ?*Io.Threaded = &debug_threaded_io;
std.debug.print internally uses an Io-managed lock for output
serialization. the default (debug_io = null) assumes single-threaded
execution. zlay has multiple OS threads (frame worker pool, GC thread,
resyncer thread) that all call std.debug.print / log.*. without this
override, concurrent debug prints corrupt each other or deadlock.
upstream fix: arguably the default should be safe for multi-threaded
programs. but explicit opt-in is reasonable — it requires initializing an
Io.Threaded instance at startup which has a cost.
workaround 5: Io.Event.reset() single-waiter assumption#
not patched — worked around in pg.zig fork.
Io.Event has a reset() method with a stdlib invariant (Io.zig:1857):
it assumes no pending call to wait. when multiple threads contend for
a pooled resource (pg.Pool connections), set() wakes all waiters, one
calls reset(), and the others hit unreachable.
the pg.zig fork (5ce2355, dev branch) replaced Io.Event with a
monotonic u32 futex counter:
release()increments counter +futexWake(1)(wake one)acquire()snapshots counter under mutex +futexWaitTimeout()with snapshot- no
reset(), no single-waiter constraint
upstream fix: Io.Event could support multi-waiter reset, or provide a
semaphore/condvar primitive. the futex counter pattern is well-known and
could be upstreamed to pg.zig proper.
workaround 6: cross-Io Mutex/futex incompatibility#
not patched — worked around by careful Io segregation.
Io.Mutex and Io.Condition use futex operations that are tied to their
Io backend. calling mutex.lockUncancelable(threaded_io) from an Evented
fiber dereferences Thread.current() — a threadlocal only set on Uring-
managed threads. on Evented fibers it's NULL → SIGSEGV or heap corruption.
this caused three separate crashes during the migration (crashes 1, 6, 8 in
docs/notes.md). the fix pattern is always the same: components that use Threaded
resources (mutexes initialized with pool_io, pg.Pool) must run on plain
std.Thread, not as Evented io.concurrent() fibers.
current segregation:
| component | runs on | why |
|---|---|---|
| GC loop | std.Thread + pool_io |
uses DiskPersist mutex + pg.Pool |
| resyncer | std.Thread + pool_io |
uses DiskPersist + HTTP client |
| frame workers | std.Thread + pool_io |
uses Io.Mutex/Condition for queue sync |
| subscribers | io.concurrent (Evented) |
pure network I/O, no shared mutexes |
| broadcast loop | io.concurrent (Evented) |
lock-free ring buffer + atomics |
| health checks | Evented handlers | use atomic last_db_success, not pg.Pool |
upstream fix: there's no obvious stdlib fix here — this is architectural.
either Mutex/Condition need to detect and handle cross-Io calls, or the docs
need to clearly state the constraint. a pg.Pool that accepts an Io
parameter per-call (rather than at init) would also help.
summary table#
| # | issue | fix type | status | drops when |
|---|---|---|---|---|
| 1 | Uring networking stubs | patch | patches/uring-networking.patch |
upstream implements (zig#31723) |
| 2 | DNS resolution missing | app workaround | Threaded fallback in subscriber | upstream implements netLookup |
| 3 | ReleaseSafe GPF | build flag | -Doptimize=ReleaseFast |
upstream fixes codegen bug |
| 4 | debug_io single-threaded | app workaround | std_options_debug_threaded_io |
upstream changes default or n/a |
| 5 | Io.Event single-waiter | dep fork | pg.zig futex counter | upstream adds multi-waiter Event |
| 6 | cross-Io Mutex | app architecture | Io segregation | upstream makes Mutex cross-Io safe |