···7171since zat also depends on websocket.zig, a new zat alpha (`v0.3.0-alpha.9`) was
7272cut to resolve the diamond dependency.
73737474-## known issue: health probes on port 3000 return 400
7474+### crash 4: GPF in pingLoop after connection teardown (fixed in `6d6c832`)
75757676-**not yet fixed.** this will cause k8s rollout to fail even if the relay itself
7777-is healthy.
7676+**symptom**: GPF in `memcpy → Writer.zig → writeAll → client.zig writeFrame`
7777+every ~60-90s of processing. same stack trace as crash 3 but different root
7878+cause — the write lock from crash 3 was necessary but not sufficient.
78797979-### what's happening
8080+**cause**: `pingLoop` runs as an `io.concurrent` task and sleeps in 1s
8181+increments. when `readLoop` returns (connection dies), the defer chain runs:
8282+1. `ping_future.cancel(self.io)` — requests cancellation
8383+2. `client.deinit()` — frees stream/TLS buffers
80848181-zlay serves the firehose WebSocket + HTTP API on port 3000 via the websocket
8282-library's `Server`. plain HTTP requests (health probes, XRPC endpoints) are
8383-supposed to reach zlay through an `httpFallback` mechanism:
8585+but `pingLoop` did `self.io.sleep(...) catch {}` — swallowing all errors
8686+including `error.Canceled`. so `cancel()` could not stop the task. `deinit()`
8787+then freed the client while `pingLoop` was still running. next iteration's
8888+`writeFrame(.ping, ...)` hit freed memory → GPF.
84898585-```
8686-main.zig:275 → bc.http_fallback = api.handleHttpRequest
8787- broadcaster.Handler.httpFallback() delegates to it
8888-```
8989-9090-but the websocket server never calls `httpFallback`. when a non-upgrade HTTP
9191-request arrives:
9292-9393-1. `server.zig:1738` — `Handshake.parse()` fails (no websocket headers)
9494-2. `server.zig:1738` — `respondToHandshakeError()` sends `400 missingheaders`
9595-3. connection closed. the app handler never sees the request.
9090+**fix (zlay `6d6c832` + websocket.zig `104608b`)**:
9191+- `io.sleep(...) catch {}` → `catch return` — makes pingLoop
9292+ cancellation-cooperative. cancel() can now stop the task before deinit() runs.
9393+- added `Client.isClosed()` check before `writeFrame` — defense-in-depth
9494+ against writes to dead connections.
96959797-the server code (`server.zig`) has **no code path** that detects "valid HTTP
9898-but not a websocket upgrade" and dispatches to `Handler.httpFallback()`. the
9999-method signature exists on the Handler, the router is fully implemented
100100-(`api/router.zig`), but the wiring inside the websocket library is missing.
101101-102102-this worked on zig 0.15 — the old websocket library version had this path. it
103103-was lost during the 0.16 fork migration.
104104-105105-### workaround options
9696+### fix 5: httpFallback dispatch for health probes (fixed in websocket.zig `4222f98`)
10697107107-1. **switch k8s probes to port 3001** — the MetricsServer (`main.zig:67-142`)
108108- serves `/_healthz` and `/_readyz` directly via `std.http.Server`, no
109109- websocket library involved. the handlers are identical to the router ones.
110110- this is the fastest path to a working deploy.
9898+**symptom**: health probes on port 3000 return 400 instead of reaching the
9999+application handler. k8s rollout fails.
111100112112-2. **fix the websocket server** — add a code path in `server.zig` between
113113- handshake parse failure and error response that checks for valid HTTP,
114114- parses method/url/headers/body, and calls `H.httpFallback()` if it exists
115115- (using `comptime std.meta.hasFn`). this is the correct long-term fix.
101101+**cause**: the websocket server's `_handleHandshake` sent 400 on any
102102+non-upgrade HTTP request. there was no code path to detect "valid HTTP but
103103+not a websocket upgrade" and dispatch to `Handler.httpFallback()`.
116104117117-the probes were on port 3000 since initial deploy (commit `e111e47` split them
118118-to `/_healthz` / `/_readyz`). they worked fine on 0.15. the 400 is a 0.16
119119-regression in the websocket fork.
105105+**fix (websocket.zig `4222f98`)**: intercepts `MissingHeaders`,
106106+`InvalidConnection`, and `InvalidUpgrade` errors from `Handshake.parse()`.
107107+for these (and only these), a standalone `parseHttpRequest()` re-parses the
108108+raw buffer and dispatches to `H.httpFallback()` if it exists (comptime
109109+`hasFn` check). other handshake errors (e.g. `InvalidVersion`) still get 400.
110110+10 tests cover all request patterns.
120111121112## where things live
122113···174165under `Io.Threaded`, these use direct kernel futex syscalls — safe from plain
175166threads. under `Io.Evented`, they would segfault (see crash 1).
176167177177-### dependency versions (current `0f11cfc`)
168168+### dependency versions (current `6d6c832`)
178169179170```
180180-zat v0.3.0-alpha.9 (tangled.org)
181181-websocket.zig 0261b7d (github, master)
171171+zat v0.3.0-alpha.11 (tangled.org)
172172+websocket.zig 104608b (github, master)
182173pg.zig 5ce2355 (github, dev branch)
183174rocksdb-zig cdef67b (github)
184175zig 0.16.0-dev.3059+42e33db9d
···201192202193## what needs to happen next
203194204204-1. **decide on health probes** — either switch k8s probes to port 3001
205205- (immediate, works today) or fix the websocket server's httpFallback dispatch
206206- (correct but more work). both are valid. check ops repo at
207207- `relay/zlay/deploy/zlay-values.yaml` lines 28-45.
195195+1. **deploy `6d6c832`** — all four crashes are fixed, httpFallback dispatch
196196+ is in. health probes on port 3000 should work. native build, linux
197197+ cross-compile, fmt all pass.
208198209209-2. **deploy `0f11cfc`** — once probe issue is addressed. all three crashes are
210210- fixed. native build, linux cross-compile, fmt, tests all pass.
211211-212212-3. **monitor after deploy** — compare against 0.15 baseline:
199199+2. **monitor after deploy** — compare against 0.15 baseline:
213200 - thread count (2,903 on 0.15 — should be similar under Threaded)
214201 - memory (24.9 GiB VmSize, 1.44 GiB RSS on 0.15)
215202 - throughput, reconnect behavior, ConsumerTooSlow rate
216203 - verify no new crashes after extended run (hours, not seconds)
204204+ - specifically watch for any remaining GPF — if crash 4 fix is correct,
205205+ there should be zero GPFs even after hours of operation
217206218218-4. **follow-up work** (not blocking deploy):
219219- - fix websocket server httpFallback dispatch for port 3000 HTTP
207207+3. **follow-up work** (not blocking deploy):
220208 - investigate Evented backend viability (frame workers → io.concurrent?)
221209 - consider upstreaming the client write lock to karlseguin/websocket.zig