update NOTES.md: document crash 4 fix and httpFallback dispatch

+39 -51

1 changed file

expand all

NOTES.md

+39 -51

NOTES.md

··· 71 71 since zat also depends on websocket.zig, a new zat alpha (`v0.3.0-alpha.9`) was 72 72 cut to resolve the diamond dependency. 73 73 74 - ## known issue: health probes on port 3000 return 400 74 + ### crash 4: GPF in pingLoop after connection teardown (fixed in `6d6c832`) 75 75 76 - **not yet fixed.** this will cause k8s rollout to fail even if the relay itself 77 - is healthy. 76 + **symptom**: GPF in `memcpy → Writer.zig → writeAll → client.zig writeFrame` 77 + every ~60-90s of processing. same stack trace as crash 3 but different root 78 + cause — the write lock from crash 3 was necessary but not sufficient. 78 79 79 - ### what's happening 80 + **cause**: `pingLoop` runs as an `io.concurrent` task and sleeps in 1s 81 + increments. when `readLoop` returns (connection dies), the defer chain runs: 82 + 1. `ping_future.cancel(self.io)` — requests cancellation 83 + 2. `client.deinit()` — frees stream/TLS buffers 80 84 81 - zlay serves the firehose WebSocket + HTTP API on port 3000 via the websocket 82 - library's `Server`. plain HTTP requests (health probes, XRPC endpoints) are 83 - supposed to reach zlay through an `httpFallback` mechanism: 85 + but `pingLoop` did `self.io.sleep(...) catch {}` — swallowing all errors 86 + including `error.Canceled`. so `cancel()` could not stop the task. `deinit()` 87 + then freed the client while `pingLoop` was still running. next iteration's 88 + `writeFrame(.ping, ...)` hit freed memory → GPF. 84 89 85 - ``` 86 - main.zig:275 → bc.http_fallback = api.handleHttpRequest 87 - broadcaster.Handler.httpFallback() delegates to it 88 - ``` 89 - 90 - but the websocket server never calls `httpFallback`. when a non-upgrade HTTP 91 - request arrives: 92 - 93 - 1. `server.zig:1738` — `Handshake.parse()` fails (no websocket headers) 94 - 2. `server.zig:1738` — `respondToHandshakeError()` sends `400 missingheaders` 95 - 3. connection closed. the app handler never sees the request. 90 + **fix (zlay `6d6c832` + websocket.zig `104608b`)**: 91 + - `io.sleep(...) catch {}` → `catch return` — makes pingLoop 92 + cancellation-cooperative. cancel() can now stop the task before deinit() runs. 93 + - added `Client.isClosed()` check before `writeFrame` — defense-in-depth 94 + against writes to dead connections. 96 95 97 - the server code (`server.zig`) has **no code path** that detects "valid HTTP 98 - but not a websocket upgrade" and dispatches to `Handler.httpFallback()`. the 99 - method signature exists on the Handler, the router is fully implemented 100 - (`api/router.zig`), but the wiring inside the websocket library is missing. 101 - 102 - this worked on zig 0.15 — the old websocket library version had this path. it 103 - was lost during the 0.16 fork migration. 104 - 105 - ### workaround options 96 + ### fix 5: httpFallback dispatch for health probes (fixed in websocket.zig `4222f98`) 106 97 107 - 1. **switch k8s probes to port 3001** — the MetricsServer (`main.zig:67-142`) 108 - serves `/_healthz` and `/_readyz` directly via `std.http.Server`, no 109 - websocket library involved. the handlers are identical to the router ones. 110 - this is the fastest path to a working deploy. 98 + **symptom**: health probes on port 3000 return 400 instead of reaching the 99 + application handler. k8s rollout fails. 111 100 112 - 2. **fix the websocket server** — add a code path in `server.zig` between 113 - handshake parse failure and error response that checks for valid HTTP, 114 - parses method/url/headers/body, and calls `H.httpFallback()` if it exists 115 - (using `comptime std.meta.hasFn`). this is the correct long-term fix. 101 + **cause**: the websocket server's `_handleHandshake` sent 400 on any 102 + non-upgrade HTTP request. there was no code path to detect "valid HTTP but 103 + not a websocket upgrade" and dispatch to `Handler.httpFallback()`. 116 104 117 - the probes were on port 3000 since initial deploy (commit `e111e47` split them 118 - to `/_healthz` / `/_readyz`). they worked fine on 0.15. the 400 is a 0.16 119 - regression in the websocket fork. 105 + **fix (websocket.zig `4222f98`)**: intercepts `MissingHeaders`, 106 + `InvalidConnection`, and `InvalidUpgrade` errors from `Handshake.parse()`. 107 + for these (and only these), a standalone `parseHttpRequest()` re-parses the 108 + raw buffer and dispatches to `H.httpFallback()` if it exists (comptime 109 + `hasFn` check). other handshake errors (e.g. `InvalidVersion`) still get 400. 110 + 10 tests cover all request patterns. 120 111 121 112 ## where things live 122 113 ··· 174 165 under `Io.Threaded`, these use direct kernel futex syscalls — safe from plain 175 166 threads. under `Io.Evented`, they would segfault (see crash 1). 176 167 177 - ### dependency versions (current `0f11cfc`) 168 + ### dependency versions (current `6d6c832`) 178 169 179 170 ``` 180 - zat v0.3.0-alpha.9 (tangled.org) 181 - websocket.zig 0261b7d (github, master) 171 + zat v0.3.0-alpha.11 (tangled.org) 172 + websocket.zig 104608b (github, master) 182 173 pg.zig 5ce2355 (github, dev branch) 183 174 rocksdb-zig cdef67b (github) 184 175 zig 0.16.0-dev.3059+42e33db9d ··· 201 192 202 193 ## what needs to happen next 203 194 204 - 1. **decide on health probes** — either switch k8s probes to port 3001 205 - (immediate, works today) or fix the websocket server's httpFallback dispatch 206 - (correct but more work). both are valid. check ops repo at 207 - `relay/zlay/deploy/zlay-values.yaml` lines 28-45. 195 + 1. **deploy `6d6c832`** — all four crashes are fixed, httpFallback dispatch 196 + is in. health probes on port 3000 should work. native build, linux 197 + cross-compile, fmt all pass. 208 198 209 - 2. **deploy `0f11cfc`** — once probe issue is addressed. all three crashes are 210 - fixed. native build, linux cross-compile, fmt, tests all pass. 211 - 212 - 3. **monitor after deploy** — compare against 0.15 baseline: 199 + 2. **monitor after deploy** — compare against 0.15 baseline: 213 200 - thread count (2,903 on 0.15 — should be similar under Threaded) 214 201 - memory (24.9 GiB VmSize, 1.44 GiB RSS on 0.15) 215 202 - throughput, reconnect behavior, ConsumerTooSlow rate 216 203 - verify no new crashes after extended run (hours, not seconds) 204 + - specifically watch for any remaining GPF — if crash 4 fix is correct, 205 + there should be zero GPFs even after hours of operation 217 206 218 - 4. **follow-up work** (not blocking deploy): 219 - - fix websocket server httpFallback dispatch for port 3000 HTTP 207 + 3. **follow-up work** (not blocking deploy): 220 208 - investigate Evented backend viability (frame workers → io.concurrent?) 221 209 - consider upstreaming the client write lock to karlseguin/websocket.zig

Configure Feed

Configure Feed