atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

zig stdlib patches for the Evented backend#

zlay runs on Io.Evented (io_uring fiber scheduler) for network I/O. the upstream zig 0.16-dev stdlib (0.16.0-dev.3059+42e33db9d) ships several Uring networking operations as stubs that return error.NetworkDown. zlay patches these at build time and works around other stdlib limitations.

this document tracks what we had to change, why, and what upstream work would let us drop each workaround.

patch 1: Uring networking (patches/uring-networking.patch)#

applied in: Dockerfile line 13, patches lib/std/Io/Uring.zig

the upstream stdlib has these functions stubbed as *Unavailable:

netListenIpUnavailable    → return error.NetworkDown
netAcceptUnavailable      → return error.NetworkDown
netConnectIpUnavailable   → return error.NetworkDown
netSendUnavailable        → return error.NetworkDown
netReadUnavailable        → return error.NetworkDown
netWriteUnavailable       → return error.NetworkDown

without these, Io.Evented can init but any TCP operation fails immediately. the patch replaces all six with working implementations:

function io_uring opcode notes
netListenIp sync bind() + listen() IORING_OP_BIND/LISTEN need kernel 6.11+, so we use sync syscalls
netAccept IORING_OP_ACCEPT fiber yields until connection arrives
netConnectIp IORING_OP_CONNECT + socket creation via existing ev.socket()
netSend IORING_OP_SENDMSG iterates message array, one SENDMSG per message
netRead IORING_OP_READV or IORING_OP_READ scatter read; single-buffer fast path
netWrite IORING_OP_SENDMSG gather write with iovec assembly + splat pattern handling

the patch also adds two helpers:

  • connect() — submits IORING_OP_CONNECT SQE, handles retry on EINTR/ECANCELED
  • netSendOne() — sends a single OutgoingMessage via IORING_OP_SENDMSG

why sync bind/listen#

IORING_OP_BIND and IORING_OP_LISTEN were added in linux 6.11. production runs on bookworm (kernel 6.1). bind() and listen() are fast synchronous calls anyway — no benefit from async submission. the rest of the networking stack (accept, connect, read, write) uses proper io_uring async ops.

why not upstream yet#

tracked as zig issue #31723. the Uring networking layer is under active development. our patch makes pragmatic choices (sync bind/listen, specific error mappings) that may not match upstream's desired API shape. we'd want to align with whatever design decisions the zig team makes before submitting.

regenerating the patch#

the patch is pinned to zig 0.16.0-dev.3059+42e33db9d. any zig version bump requires checking if the Uring.zig source changed and regenerating.

# to regenerate after a zig update:
diff -u /path/to/old-zig/lib/std/Io/Uring.zig /path/to/patched/Uring.zig > patches/uring-networking.patch

workaround 2: DNS resolution via Threaded fallback#

not patched — worked around in application code.

Io.Uring does not implement netLookup (DNS resolution). instead of patching it, subscribers route DNS through pool_io (Threaded):

// subscriber.zig:326-330
// DNS + TCP connect through pool_io (Threaded — has working netLookup).
const dns_io = self.pool_io orelse self.io;
const net_stream = try host_name.connect(dns_io, 443, .{ .mode = .stream });

this works because Io.Threaded.netLookup uses getaddrinfo on a worker thread. the resulting socket handle is then used with Evented I/O for the actual data transfer (reads/writes go through the patched Uring ops).

upstream fix: implement netLookup in Uring.zig, probably by submitting the blocking getaddrinfo call on an io_uring worker thread (IORING_OP_POLL_ADD + thread pool, or the newer IORING_OP_GETXATTR pattern). not blocking — the Threaded fallback is fine.

workaround 3: ReleaseSafe GPF in Uring fiber context#

not patched — worked around by building with ReleaseFast.

# Dockerfile line 21-23
# ReleaseFast (not ReleaseSafe): Io.Uring fiber context-switch GPFs under ReleaseSafe
RUN zig build -Doptimize=ReleaseFast ...

under ReleaseSafe, the optimizer's inlining interacts badly with Uring's fiber context-switch machinery. the result is a general protection fault during normal fiber yield/resume. ReleaseFast and Debug both work. scripts/repro_evented.zig reproduces this — three simple fiber tests (no-sleep, yield, sleep) that pass under Debug and ReleaseFast but GPF under ReleaseSafe.

this is likely a zig codegen/optimizer bug. we haven't filed it yet because the reproduction is minimal but the root cause analysis is incomplete — could be a safety check reading stale fiber stack, or an inlining decision that breaks the stack-swap assumptions.

upstream fix: file a bug with the repro. probably a zig compiler issue, not an Uring.zig issue.

workaround 4: std_options.debug_io single-threaded default#

not patched — worked around in src/main.zig.

// main.zig:62-64
var debug_threaded_io: Io.Threaded = undefined;
pub const std_options_debug_threaded_io: ?*Io.Threaded = &debug_threaded_io;

std.debug.print internally uses an Io-managed lock for output serialization. the default (debug_io = null) assumes single-threaded execution. zlay has multiple OS threads (frame worker pool, GC thread, resyncer thread) that all call std.debug.print / log.*. without this override, concurrent debug prints corrupt each other or deadlock.

upstream fix: arguably the default should be safe for multi-threaded programs. but explicit opt-in is reasonable — it requires initializing an Io.Threaded instance at startup which has a cost.

workaround 5: Io.Event.reset() single-waiter assumption#

not patched — worked around in pg.zig fork.

Io.Event has a reset() method with a stdlib invariant (Io.zig:1857): it assumes no pending call to wait. when multiple threads contend for a pooled resource (pg.Pool connections), set() wakes all waiters, one calls reset(), and the others hit unreachable.

the pg.zig fork (5ce2355, dev branch) replaced Io.Event with a monotonic u32 futex counter:

  • release() increments counter + futexWake(1) (wake one)
  • acquire() snapshots counter under mutex + futexWaitTimeout() with snapshot
  • no reset(), no single-waiter constraint

upstream fix: Io.Event could support multi-waiter reset, or provide a semaphore/condvar primitive. the futex counter pattern is well-known and could be upstreamed to pg.zig proper.

workaround 6: cross-Io Mutex/futex incompatibility#

not patched — worked around by careful Io segregation.

Io.Mutex and Io.Condition use futex operations that are tied to their Io backend. calling mutex.lockUncancelable(threaded_io) from an Evented fiber dereferences Thread.current() — a threadlocal only set on Uring- managed threads. on Evented fibers it's NULL → SIGSEGV or heap corruption.

this caused three separate crashes during the migration (crashes 1, 6, 8 in docs/notes.md). the fix pattern is always the same: components that use Threaded resources (mutexes initialized with pool_io, pg.Pool) must run on plain std.Thread, not as Evented io.concurrent() fibers.

current segregation:

component runs on why
GC loop std.Thread + pool_io uses DiskPersist mutex + pg.Pool
resyncer std.Thread + pool_io uses DiskPersist + HTTP client
frame workers std.Thread + pool_io uses Io.Mutex/Condition for queue sync
subscribers io.concurrent (Evented) pure network I/O, no shared mutexes
broadcast loop io.concurrent (Evented) lock-free ring buffer + atomics
health checks Evented handlers use atomic last_db_success, not pg.Pool

upstream fix: there's no obvious stdlib fix here — this is architectural. either Mutex/Condition need to detect and handle cross-Io calls, or the docs need to clearly state the constraint. a pg.Pool that accepts an Io parameter per-call (rather than at init) would also help.

summary table#

# issue fix type status drops when
1 Uring networking stubs patch patches/uring-networking.patch upstream implements (zig#31723)
2 DNS resolution missing app workaround Threaded fallback in subscriber upstream implements netLookup
3 ReleaseSafe GPF build flag -Doptimize=ReleaseFast upstream fixes codegen bug
4 debug_io single-threaded app workaround std_options_debug_threaded_io upstream changes default or n/a
5 Io.Event single-waiter dep fork pg.zig futex counter upstream adds multi-waiter Event
6 cross-Io Mutex app architecture Io segregation upstream makes Mutex cross-Io safe