···11# zlay 0.16 migration — status and known issues
2233-last updated: 2026-04-02
33+last updated: 2026-04-04
44stable production build: `a931853` (zig 0.15)
55-latest 0.16 build: `0f11cfc` (not yet deployed successfully)
55+latest 0.16 build: `b433403` (not yet deployed — includes all fixes through crash 8)
6677## what happened
88···1111consumers over WebSocket. the zig 0.16 migration rewrote all I/O to use
1212`std.Io` primitives.
13131414-three separate crashes were found and fixed during deployment attempts:
1414+eight separate crashes/bugs were found and fixed during deployment attempts:
15151616### crash 1: SIGSEGV on startup (fixed in `f996812`)
1717···109109`hasFn` check). other handshake errors (e.g. `InvalidVersion`) still get 400.
11011010 tests cover all request patterns.
111111112112+### crash 6: SIGSEGV — plain threads calling Evented Io.Mutex (fixed in `6674812`)
113113+114114+**symptom**: SIGSEGV at startup after switching to `Io.Evented` backend. crash in
115115+`Uring.zig` at `Thread` struct field offset from NULL.
116116+117117+**cause**: the resyncer was spawned via `io.concurrent()` which creates an Evented
118118+fiber. but resync work calls `DiskPersist` methods that take
119119+`self.mutex.lockUncancelable(self.io)` where `self.io` is `pool_io` (Threaded).
120120+Threaded futex from an Evented fiber dereferences `Thread.current()` which is a
121121+threadlocal only set on Uring-managed threads — NULL on Evented fibers.
122122+123123+**fix**: run resyncer on a plain `std.Thread` with `pool_io`, not as an Evented
124124+fiber. the thread checks `shutdown_flag` to exit.
125125+126126+### fix 7: startup connection ramp throttling (fixed in `f9bf515`)
127127+128128+**symptom**: event loop starvation during initial PDS connection burst (~2,750
129129+simultaneous connects).
130130+131131+**fix**: throttled startup to connect in batches, preventing io_uring submission
132132+queue overflow.
133133+134134+### crash 8: cross-Io heap corruption in GC + health checks (fixed in `2156d08` + `b433403`)
135135+136136+**symptom**: steady-state heap corruption / SIGSEGV with zero downstream consumers.
137137+dmesg shows crash addresses in `Uring.zig` at `Thread` struct field offsets from
138138+NULL — same signature as crash 6 but in the GC and health check paths.
139139+140140+**cause**: two cross-Io violations active during steady-state operation:
141141+142142+1. **GC loop** ran as an Evented fiber (`io.concurrent(gcLoop, ...)`) but called
143143+ `dp.gc()` which takes `self.mutex.lockUncancelable(self.io)` where `self.io`
144144+ is `pool_io` (Threaded), and queries `pg.Pool` (also Threaded). Threaded futex
145145+ from Evented fiber → NULL `Thread.current()` → heap corruption.
146146+147147+2. **health check endpoints** (`/_readyz`, `/_health`, `/xrpc/_health`) on both the
148148+ metrics server and API router executed `db.exec("SELECT 1")` through `pg.Pool`
149149+ (Threaded) from Evented HTTP handler context. same cross-Io violation.
150150+151151+**fix**:
152152+- GC loop moved from `io.concurrent()` to `std.Thread.spawn(.{}, gcLoop, .{&dp, pool_io})`.
153153+ the thread is joined during shutdown before `dp.deinit()` runs (dp is stack-owned).
154154+- health checks replaced with `isDbHealthy()` — reads an atomic `last_db_success`
155155+ timestamp set by Threaded workers after successful DB queries. safe from any Io.
156156+- `markDbSuccess()` called from `uidForDid()` (every incoming event) and `gc()`
157157+ (every 10 minutes).
158158+159159+### fix 8: broadcaster double-destroy (fixed in `72ba680`)
160160+161161+**symptom**: use-after-free in broadcaster. `broadcast()` was destroying consumers
162162+that `Handler.close()` still referenced.
163163+164164+**cause**: `broadcast()` detected dead consumers (via `alive` atomic) and called
165165+`consumer.shutdown()` + `self.allocator.destroy(consumer)` inline. but the
166166+consumer's `Handler.close()` callback was still running and would later call
167167+`removeConsumer()` on the already-freed pointer.
168168+169169+**fix**: `broadcast()` now only does `swapRemove` + count decrement for dead
170170+consumers. `removeConsumer()` (called from `Handler.close()`) is the sole owner
171171+of `shutdown()` + `destroy()`.
172172+173173+## the cross-Io rule
174174+175175+the single most important lesson from this migration:
176176+177177+**`Io.Mutex`, `Io.Condition`, `io.sleep()`, and any `pg.Pool` operation must be
178178+called from the same Io type they were initialized with.**
179179+180180+- Threaded futex on Evented fiber → dereferences NULL `Thread.current()` threadlocal
181181+ → SIGSEGV or heap corruption (crashes 1, 6, 8)
182182+- Evented futex on plain thread → same NULL deref in the other direction
183183+184184+**safe cross-Io patterns**: raw atomics (`Value`, `fetchAdd`, CAS), `tryLock`
185185+(non-blocking CAS, no futex), MPSC ring buffers with atomic spinlocks.
186186+187187+**unsafe cross-Io patterns**: `Io.Mutex.lockUncancelable(wrong_io)`,
188188+`Io.Condition`, `io.sleep(wrong_io, ...)`, `pg.Pool` queries from wrong context.
189189+190190+**fix pattern**: components that use Threaded resources (mutexes, pg.Pool) must
191191+run on plain `std.Thread`, not Evented `io.concurrent()`. examples: resyncer
192192+(`439c678`), GC loop (`2156d08`).
193193+194194+**known remaining**: XRPC and admin API handlers run on Evented fibers and access
195195+`pg.Pool` (Threaded) on external HTTP requests. not steady-state, but real.
196196+fixing requires either running API handlers on pool_io or making pg.Pool Io-agnostic.
197197+198198+## stdlib patches
199199+200200+zlay patches `lib/std/Io/Uring.zig` at build time (see `patches/uring-networking.patch`,
201201+applied in `Dockerfile`). this is necessary because the upstream zig 0.16 stdlib ships
202202+these networking operations as `*Unavailable` stubs that return `error.NetworkDown`.
203203+204204+### what's patched
205205+206206+| function | opcode | why stubbed upstream |
207207+|---|---|---|
208208+| `netListenIp` | sync `bind()` + `listen()` | IORING_OP_BIND/LISTEN require kernel 6.11+ |
209209+| `netAccept` | `IORING_OP_ACCEPT` | was unimplemented |
210210+| `netConnectIp` | `IORING_OP_CONNECT` | was unimplemented |
211211+| `netSend` | `IORING_OP_SENDMSG` | was unimplemented |
212212+| `netRead` | `IORING_OP_READV` / `IORING_OP_READ` | was unimplemented |
213213+| `netWrite` | `IORING_OP_SENDMSG` | was unimplemented |
214214+215215+also adds a `connect()` helper and `netSendOne()` for individual message sending.
216216+217217+### why not upstream yet
218218+219219+the Uring networking layer is under active development (see zig issue #31723).
220220+the patch uses sync syscalls for `bind`/`listen` (not io_uring opcodes) because
221221+those opcodes require kernel 6.11+ and production runs on older kernels. this is
222222+a pragmatic choice that may not match upstream's desired API shape.
223223+224224+### how it's applied
225225+226226+```dockerfile
227227+# Dockerfile, line 11-13
228228+COPY patches/ patches/
229229+RUN patch /opt/zig-.../lib/std/Io/Uring.zig < patches/uring-networking.patch
230230+```
231231+232232+pinned to zig `0.16.0-dev.3059+42e33db9d`. any zig version bump requires
233233+regenerating the patch.
234234+235235+### other stdlib issues hit (not patched, worked around)
236236+237237+| issue | workaround |
238238+|---|---|
239239+| `Io.Event.reset()` assumes no pending waiters — panics under contention | replaced with futex counter in pg.zig fork (crash 2) |
240240+| `Io.Uring` GPFs under `ReleaseSafe` (aggressive inlining + fiber context) | build with `ReleaseFast` only (`Dockerfile` line 21) |
241241+| `Io.Mutex` / futex cannot cross Io types | run Threaded workloads on plain `std.Thread` (crashes 1, 6, 8) |
242242+| `std_options.debug_io` is single-threaded by default | override in root source file for multi-threaded contexts |
243243+112244## where things live
113245114246### repos and their roles
···147279148280### concurrency architecture
149281150150-two layers, by design:
282282+three layers, shaped by the cross-Io constraint:
151283152152-**network I/O → `io.concurrent` tasks (Io.Threaded fibers)**
153153-- upstream PDS subscribers (subscriber read loops, ping loops)
284284+**Evented fibers (`io.concurrent` on `Io.Evented`)**
285285+- upstream PDS subscribers (read loops, ping loops)
154286- downstream consumer write loops
155287- DID resolver loops (validator)
156156-- background tasks: backfill, resync, cleaner, GC, metrics server
288288+- broadcast loop (drains queue, fans out to consumers)
157289- slurper coordination
290290+- metrics server, backfill, cleaner
291291+292292+**plain `std.Thread` with `pool_io` (`Io.Threaded`)**
293293+- GC loop — uses `DiskPersist.mutex` + `pg.Pool` (crash 8)
294294+- resyncer — uses `DiskPersist` + HTTP client (crash 6)
295295+- these MUST NOT run as Evented fibers (see "the cross-Io rule")
158296159297**CPU-bound ordered processing → explicit `std.Thread` workers**
160298- `thread_pool.zig`: `workers[host_id % N]` ensures per-key FIFO ordering
161299- `frame_worker.zig`: CBOR decode, DID resolution, signature verify, DB persist
162300- bounded backpressure: blocking submit when queue full → TCP backpressure
301301+- uses `Io.Mutex` / `Io.Condition` with `pool_io` (Threaded futex)
163302164164-the frame pool workers use `Io.Mutex` / `Io.Condition` for synchronization.
165165-under `Io.Threaded`, these use direct kernel futex syscalls — safe from plain
166166-threads. under `Io.Evented`, they would segfault (see crash 1).
167167-168168-### dependency versions (current `6d6c832`)
303303+### dependency versions (current `b433403`)
169304170305```
171171-zat v0.3.0-alpha.11 (tangled.org)
172172-websocket.zig 104608b (github, master)
306306+zat v0.3.0-alpha.16 (tangled.org)
307307+websocket.zig 80c6434 (github, master)
173308pg.zig 5ce2355 (github, dev branch)
174309rocksdb-zig cdef67b (github)
175175-zig 0.16.0-dev.3059+42e33db9d
310310+zig 0.16.0-dev.3059+42e33db9d (patched Uring networking)
176311```
177312178313### key env vars
···192327193328## what needs to happen next
194329195195-1. **deploy `6d6c832`** — all four crashes are fixed, httpFallback dispatch
196196- is in. health probes on port 3000 should work. native build, linux
197197- cross-compile, fmt all pass.
330330+1. **deploy `b433403`** — all eight crashes/fixes are in. the steady-state
331331+ heap corruption (cross-Io GC + health checks) is fixed. broadcaster
332332+ double-destroy is fixed. health probes use atomic flag, not cross-Io DB query.
1983331993342. **monitor after deploy** — compare against 0.15 baseline:
200335 - thread count (2,903 on 0.15 — should be similar under Threaded)
201336 - memory (24.9 GiB VmSize, 1.44 GiB RSS on 0.15)
202337 - throughput, reconnect behavior, ConsumerTooSlow rate
203203- - verify no new crashes after extended run (hours, not seconds)
204204- - specifically watch for any remaining GPF — if crash 4 fix is correct,
205205- there should be zero GPFs even after hours of operation
338338+ - verify zero crashes after extended run (hours). the GC cross-Io bug
339339+ fired every 10 minutes, so 30+ minutes clean = strong signal.
340340+ - watch `/_health` — should report healthy once `uidForDid` or `gc()` succeeds
341341+ (within 10 minutes of startup at latest)
342342+343343+3. **known issue — XRPC/admin cross-Io** (not blocking deploy):
344344+ - API handlers run on Evented fibers but some query `pg.Pool` (Threaded)
345345+ - only triggered by external HTTP requests, not steady-state
346346+ - fix: either run API handlers on pool_io, or make pg.Pool Io-agnostic
206347207207-3. **follow-up work** (not blocking deploy):
348348+4. **follow-up work**:
208349 - investigate Evented backend viability (frame workers → io.concurrent?)
209350 - consider upstreaming the client write lock to karlseguin/websocket.zig
351351+ - consider upstreaming Uring networking patch (zig#31723)
352352+ - evaluate whether pg.Pool can be made Io-agnostic (would eliminate cross-Io issues)