experiments in a post-browser web
10
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs(pubsub): state machine design + task-queue updates

+581
+569
docs/pubsub-state-machine.md
··· 1 + # PubSub State Machine 2 + 3 + Status: design, pre-implementation. Drafted 2026-04-23 in response to three 4 + distinct manifestations of the same "lost publish" bug class during Phases 5 + E/F of the tile-lifecycle FSM rollout. Layers on top of 6 + [tile-lifecycle-fsm.md](tile-lifecycle-fsm.md) — that doc governs *when a 7 + tile is alive*; this doc governs *what messages can flow between tiles (and 8 + the core) in each state*. 9 + 10 + Goal: make the class of bug "a publish was dropped somewhere in the fabric" 11 + **impossible**. If a message is dropped, the state machine identifies 12 + exactly which transition was invalid. Runtime enforcement + bisectable 13 + tests derive from the machine. 14 + 15 + ## Non-goals 16 + 17 + - Re-implementing the tile lifecycle FSM (that lives in tile-fsm.ts / 18 + tile-lifecycle.ts). This doc references it. 19 + - Changing the renderer-visible API surface (`api.publish`, `api.subscribe`, 20 + `api.commands.register`). Semantics tighten; shape stays. 21 + - Solving cross-machine / cross-process pubsub beyond the existing 22 + main-process hub. The fabric remains one hub. 23 + 24 + ## Participants 25 + 26 + Every pubsub participant is in exactly one of these roles. Each role has a 27 + fixed identity (source address) and a fixed set of legal capabilities. 28 + 29 + | Role | Identity | Notes | 30 + |---|---|---| 31 + | **Core** | `peek://app/...` | bgWindow (`app/background.html` → `app/index.js` runs cmd registry / page / hud) + cmd panel UI window (`app/cmd/panel.html`). Both carry `trustedBuiltin`; pubsub doesn't distinguish them. First-class publisher AND subscriber; every publish must reach bgWindow. | 32 + | **Tile** | `peek://{tileId}/{entry}` | One per tile BrowserWindow. Capability-gated via tile token. | 33 + | **System** | `peek://system/` | Main-process code publishing on its own behalf (e.g., `tag:item-added` from datastore IPC handlers, `window:focus-changed` from app events). Not a window, not a renderer. In-process call to `publish()`; no token, no IPC hop. | 34 + | **Webview guest** | `peek://{feature}/...` webContents inside a `<webview>` | Hosted inside tile windows. Receives broadcasts via the same pubsub forwarding path. | 35 + 36 + ### The token is the capability handle 37 + 38 + A tile's **token is the capability handle** for all pubsub and command 39 + operations. The token is minted on `loading → ready` and revoked on 40 + `→ unloading` / `→ crashed` by `tile-lifecycle.ts`. This makes the token 41 + the atomic, race-free signal of "tile is in `ready` or `visible`": 42 + 43 + - If `validateToken(t)` returns a grant, the tile is alive. 44 + - If it returns null, the tile is not alive — publish/subscribe rejected. 45 + 46 + **Consequence**: the pubsub enforcement path does **not** read lifecycle 47 + state. It validates the token and consults the grant. Token validity and 48 + lifecycle state are kept in sync by `tile-lifecycle.ts` as the sole 49 + owner of both. This removes a hot-path cross-module dependency between 50 + `pubsub.ts` and `tile-lifecycle.ts`, eliminates the transition-race 51 + question ("was the publish checked before or after the state change?"), 52 + and keeps the pubsub FSM testable as a pure function (see §Module 53 + layout). 54 + 55 + **Deleted concepts**: 56 + - `peek://{id}/lazy-stub` pseudo-source — no longer a distinct participant. 57 + Lazy declaration is a *state* (`registered` per the lifecycle FSM), not a 58 + fake source address. The load-on-dispatch hook reads lifecycle state 59 + directly (it lives on the lifecycle side of the boundary). 60 + - V1 extension-host iframes — gone as of 2026-04-21. 61 + - Legacy `IPC_CHANNELS.SUBSCRIBE` / `PUBLISH` paths — folded into tile 62 + pubsub. Core uses the same capability-gated path as every tile, with a 63 + `trustedBuiltin` grant. 64 + 65 + ## Transport layers 66 + 67 + All paths converge at `publish()` in `pubsub.ts`. There is exactly one hub. 68 + 69 + ``` 70 + Core/Tile renderer 71 + 72 + ▼ api.publish / api.subscribe (from tile-preload.cts) 73 + 74 + ipcMain.on('tile:pubsub:publish') ──► capability-gated validate 75 + ipcMain.on('tile:pubsub:subscribe') │ 76 + 77 + publish() in pubsub.ts 78 + 79 + ┌──────────────────────────────┼──────────────────────────────┐ 80 + ▼ ▼ ▼ 81 + in-proc callbacks pubsubBroadcaster (hook) 82 + (topics Map) ├─ core window (bgWindow) pre-publish 83 + — subscribers registered ├─ tile windows can skip/defer 84 + via tile:pubsub:subscribe └─ webview guests 85 + ``` 86 + 87 + **Binding rules**: 88 + - No component calls `webContents.send('pubsub:...')` directly. The 89 + broadcaster is the only sender of pubsub IPC frames. 90 + - The broadcaster iterates **bgWindow + all tile windows + qualifying 91 + webview guests**. Omitting bgWindow is a bug (historical; see 92 + `project_bgwindow_is_core.md`). 93 + - Main-process code that wants to publish as System uses 94 + `publish(SYSTEM_ADDRESS, ..., topic, data)` directly — no IPC hop. 95 + 96 + **Broadcaster fan-out — subscriber-indexed delivery** 97 + 98 + The broadcaster must **not** blanket-iterate every renderer on every 99 + publish. It consults the per-topic subscriber set maintained by 100 + `tile:pubsub:subscribe`: a window that never subscribed to topic T 101 + never receives frames for T. This turns fan-out from O(N windows) to 102 + O(S subscribers-of-T), typically much smaller. 103 + 104 + This is the single biggest perf lever in the fabric — at high widget 105 + counts (50+ tile windows, multiple webview guests each), blanket 106 + iteration at 60Hz (e.g., for `page:scroll`-style topics) will saturate 107 + IPC. Subscriber-indexed delivery keeps it bounded by actual interest. 108 + 109 + For topics with genuinely many subscribers (e.g., `tag:item-added` 110 + during bulk sync), the fan-out is still proportional to how many 111 + renderers actually care — which is the minimum unavoidable cost. 112 + 113 + ## Topic taxonomy 114 + 115 + Topics live in flat namespaces separated by `:`. Every topic belongs to one 116 + of these classes. The class determines who may publish, who may subscribe, 117 + and what capability gating applies. 118 + 119 + | Class | Pattern | Publishers | Subscribers | Gate | 120 + |---|---|---|---|---| 121 + | **Command dispatch** | `cmd:execute:{name}` | Core (panel + chains), other tiles (chained execution) | Owning tile's handler (via `api.commands.register`) | Capability: tile must have `commands`. Load-on-dispatch hook may defer. | 122 + | **Command result** | `cmd:execute:{name}:result:{uuid}` | Owning tile | Dispatcher that set `expectResult:true` | Capability: same `commands` grant (infra carve-out). | 123 + | **Command registry** | `cmd:register`, `cmd:register-batch`, `cmd:unregister` | Tiles with valid token (= state ≥ `ready`) | Core (cmd registry) | Dynamic registrations only. `cmd:register-batch` is the bulk variant (used by noun expansion). *Declared* commands come from manifest cache (not pubsub) per tile-lifecycle-fsm.md §Invariants. | 124 + | **Noun registry** | `noun:register-batch`, `noun:unregister` | Tiles with valid token | Core (noun registry) | Same as cmd:register. | 125 + | **Noun dispatch** | `noun:browse:{name}`, `noun:query:{name}`, `noun:open:{name}` | Core (panel) | Noun handler tile | Capability: `commands` (nouns auto-generate commands). | 126 + | **Lifecycle observer** | `tile:state-changed` (mirror of lifecycle transitions) | `tile-lifecycle.ts` (System) | Observers (drift detector, HUD widgets) with `lifecycle-observer` capability | Read-only observer topic. Publishers must not publish this directly. `lifecycle-observer` is a new capability declared in the tile manifest (typical grantees: drift, HUD widgets that display tile state). | 127 + | **Domain events** | `tag:*`, `item:*`, `sync:*`, `editor:*`, `entities:*`, `page:*`, `window:*`, etc. | System (datastore IPC handlers) + tiles whose manifest declares ownership of the namespace (e.g., tags feature owns `tag:*`) | Any tile whose capability grant allows subscription to the namespace | Capability allowlist on BOTH publish and subscribe sides. Namespace ownership is a manifest field: `owns: ['tag']`. | 128 + | **Feature-scoped** | `{feature}:{verb}` (e.g., `websearch:engine-request`) | That feature's tile(s) | That feature's tile(s) | Capability allowlist. Cross-feature publish is a violation — see §Cross-feature rule below. | 129 + | **Settings** | `topic:core:prefs`, `settings:changed:{feature}`, `settings:navigate` | Core (settings UI) | Any tile whose capability grant allows subscription to the topic | Publishers are Core/System; subscribers pass the normal capability allowlist through the gate. | 130 + 131 + ### Cross-feature rule 132 + 133 + A feature tile may **not** publish to another feature's topic namespace. If 134 + A needs data from B, it calls B's registered command (`api.commands.call` 135 + → `cmd:execute:{b-command}` round-trip), which goes through the full 136 + dispatch state machine. This rule exists because ad-hoc cross-feature 137 + topics (see tasks.md: websearch bg↔home round-trip) are the most common 138 + source of fragile pubsub. 139 + 140 + ### Topics that are NOT pubsub: private lifecycle IPC 141 + 142 + `tile:ready` and `tile:shutdown` are **not** pubsub topics. They are 143 + private IPC signals between a tile's preload and `tile-lifecycle.ts`: 144 + 145 + - `tile:lifecycle:ready` — preload → main, sent once after capability 146 + token validated. Causes lifecycle transition `loading → ready`. 147 + - `tile:lifecycle:shutdown` — main → preload, sent during unload grace 148 + window. 149 + 150 + **Why private**: `tile:ready` is the signal that *admits a tile as a 151 + publisher*. If it flowed through pubsub itself, the tile would be 152 + publishing before it has been admitted — a bootstrap circularity that 153 + either requires a carve-out or introduces a race. Private IPC removes 154 + the circularity entirely. 155 + 156 + Observers that need to react to lifecycle transitions (drift detector, 157 + HUD widgets) subscribe to the mirrored `tile:state-changed` pubsub 158 + topic, which `tile-lifecycle.ts` publishes as System after a transition 159 + lands. One direction, after the fact. 160 + 161 + ### Deleted topics 162 + 163 + - `cmd:request-registers` — race workaround where Core on boot asks all 164 + tiles to replay their cached `cmd:register` payloads. Exists only because 165 + subscribers could land after publishers. With the FSM's invariant that 166 + Core's subscribers are live before any tile reaches `ready`, this topic 167 + has no job. **Delete.** 168 + - `ext:ready`, `ext:all-loaded` — v1 lifecycle handshake. Replaced entirely 169 + by the private `tile:lifecycle:ready` signal + cache-backed declared 170 + commands. **Delete.** 171 + 172 + ## Authorization rules (not state lookups) 173 + 174 + The pubsub enforcement layer does not inspect lifecycle state. It runs 175 + one check: `validateToken(t) → grant | null`, then consults the grant 176 + to answer "is this (publisher, topic, op) allowed?" 177 + 178 + Because `tile-lifecycle.ts` mints the token on `loading → ready` and 179 + revokes it on `→ unloading` / `→ crashed`, this single check implicitly 180 + enforces the full state-based policy. Rules: 181 + 182 + | Publisher | Operation | Check | 183 + |---|---|---| 184 + | Core (`trustedBuiltin` grant) | any publish, any subscribe | always allow | 185 + | System (in-process main) | any publish | always allow; never subscribes | 186 + | Tile with valid token | publish to topic T | grant includes `publish` for T's class (subject to per-class rules below) | 187 + | Tile with valid token | subscribe to topic T | grant includes `subscribe` for T's class | 188 + | Tile with revoked/missing token | any op | reject — logged as `tile:drift` telemetry, never silently dropped | 189 + 190 + Per-class rules (the capability-grant shape implements these): 191 + 192 + | Topic class | Who may publish | Who may subscribe | 193 + |---|---|---| 194 + | `cmd:execute:{name}` | Any with `commands` grant (dispatcher) | Only the declared owner of `{name}` | 195 + | `cmd:execute:{name}:result:{uuid}` | Only the owner of `{name}` | Only the dispatcher that set `resultTopic` (private-by-uuid) | 196 + | `cmd:register` / `cmd:unregister` | Any with `commands` grant | Core only | 197 + | `noun:register-batch` / `noun:unregister` | Any with `commands` grant | Core only | 198 + | `noun:browse:{n}` / `noun:query:{n}` / `noun:open:{n}` | Core (panel) | Owner of noun `{n}` | 199 + | `tile:state-changed` | System only (`tile-lifecycle.ts`) | Any with `lifecycle-observer` grant | 200 + | Domain event `{ns}:{verb}` | System OR tile with `{ns}` ownership | Any tile whose capability allowlist matches | 201 + | Feature-scoped `{feature}:{verb}` | Only the `{feature}` tile(s) | Only the `{feature}` tile(s) (cross-feature = violation) | 202 + 203 + **Lifecycle state is not consulted at the enforcement point.** If the 204 + token is valid, the tile is in `ready` or `visible` by definition. The 205 + two FSMs couple through the token's lifetime, not through shared state 206 + reads — this is the pure-function boundary that makes the pubsub FSM 207 + testable in isolation. 208 + 209 + `tile:lifecycle:ready` and `tile:lifecycle:shutdown` are private IPC 210 + (§Topics that are NOT pubsub), so they never hit this table. 211 + 212 + ### The IPC chokepoint 213 + 214 + All `tile:*` IPC frames pass through a single main-process gate 215 + (`tile-ipc-gate.ts`, see §Module layout). The gate runs a fixed 216 + sequence before any handler executes: 217 + 218 + 1. **Channel allowlisted?** `registerTileIpc(channel, handler, descriptor)` 219 + is the only way to attach a `tile:*` listener. An unregistered 220 + channel receives a default handler that logs `tile:drift` and drops. 221 + 2. **Sender identity verified?** `event.sender` (the `WebContents` that 222 + sent the frame) must match the `WebContents` that owns the 223 + `payload.token`. This closes the "forged token" hole — a tile with 224 + XSS cannot smuggle out another tile's leaked token because the 225 + sender frame wouldn't match. 226 + 3. **Payload schema valid?** Each channel descriptor declares its 227 + expected shape. Malformed frames log `tile:drift` and drop. 228 + 4. **Token valid & grant consistent with channel?** The channel 229 + descriptor names the capabilities required; absent ones → reject. 230 + 5. **State-at-receive matches channel's expected transition window?** 231 + E.g., `tile:lifecycle:ready` may only arrive while the tile is in 232 + `loading`; arrival in any other state → reject + `tile:drift`. 233 + 6. **Sender role allowlisted for this channel?** E.g., 234 + `tile:lifecycle:ready` must come from a tile's own preload, never 235 + from Core or System. 236 + 237 + Only after all six pass does the handler run. Rejections are never 238 + silent; every drop emits `tile:drift` with a structured reason so CI 239 + can fail on regressions. 240 + 241 + **Performance budget**: schema validation (step 3) uses hand-rolled 242 + shape checks (`typeof x.foo === 'string' && Array.isArray(x.bar)`), 243 + not a general-purpose schema library (Zod / Ajv). Total gate cost 244 + stays under ~5μs per frame — negligible next to Electron's intrinsic 245 + IPC serialization + cross-process handoff (~10-50μs). 246 + 247 + This chokepoint is the main-process analog of the authorization-rules 248 + table above — it guarantees that the rules can't be bypassed by a 249 + handler that forgets to validate, and it adds the sender-frame check 250 + that the pure FSM alone can't express. 251 + 252 + ## Command dispatch & result — one path 253 + 254 + **Rule**: a command result returns to its dispatcher by publishing to the 255 + result topic through the same capability-gated pubsub path that every 256 + other publish uses. Specifically: 257 + 258 + - Dispatcher publishes `cmd:execute:{name}` with `{expectResult: true, 259 + resultTopic: 'cmd:execute:{name}:result:{uuid}'}`. 260 + - Handler in the owning tile publishes to that result topic when it 261 + completes. 262 + - Both publishes go through `tile:pubsub:publish` → capability allowlist 263 + → `publish()`. 264 + 265 + The allowlist's `cmd:execute:*` infra carve-out for tiles holding the 266 + `commands` grant (tile-ipc.ts, landed in commit `f32063db` on 267 + 2026-04-23) is what makes result-topic publishes legal without requiring 268 + every tile to explicitly declare every possible UUID-suffixed topic in 269 + its manifest. 270 + 271 + **Legacy side-channel to remove**: tile-preload currently also sends a 272 + parallel `ipcRenderer.send('tile:command:result', ...)` frame 273 + (tile-preload.cts:411) that hits an unrestricted main-process publish 274 + handler (tile-ipc.ts `ipcMain.on('tile:command:result', ...)`). This 275 + duplicate path predates the allowlist carve-out and bypasses the gate 276 + entirely. It is redundant with the pubsub publish above and must be 277 + deleted — one auth path, one code path, one place to reason about who 278 + can emit a command result. 279 + 280 + ``` 281 + Dispatcher Owning tile 282 + (Core/panel or another tile) (capability: commands) 283 + │ │ 284 + │ publish('cmd:execute:X', │ 285 + │ { ..., expectResult:true, │ 286 + │ resultTopic:'cmd:execute:X: │ 287 + │ result:<uuid>' }) │ 288 + │ ───────────────────────────────────────►│ 289 + │ │ 290 + │ [load-on-dispatch hook 291 + │ ensures state≥ready] 292 + │ │ 293 + │ [handler runs] 294 + │ │ 295 + │ publish('cmd:execute:X: 296 + │ result:<uuid>', 297 + │ { data } | { error }) 298 + │ ◄───────────────────────────────────────│ 299 + │ │ 300 + resolve proxy promise 301 + ``` 302 + 303 + **Concrete removals**: 304 + - `ipcRenderer.send('tile:command:result', ...)` call in 305 + `api.commands.register` handler (tile-preload.cts:411). Keep the 306 + `tile:pubsub:publish` call at line 401. 307 + - `ipcMain.on('tile:command:result', ...)` handler in tile-ipc.ts. 308 + 309 + ## Scope semantics — delete 310 + 311 + The `scope` argument (`SYSTEM=1 | SELF=2 | GLOBAL=3`) is vestigial. 312 + 313 + Survey results (2026-04-23): 314 + - `api.scopes.GLOBAL` — ~100 sites across `app/**`. 315 + - `api.scopes.SYSTEM` — 4 sites (3 in `app/index.js`, 1 in 316 + `app/settings/settings.js`). 317 + - `api.scopes.SELF` — **zero sites**. 318 + 319 + Preload's `publishImpl` defaults missing scope to `SELF`, but since no 320 + caller passes SELF and SELF delivery is restricted to same-pseudo-host 321 + subscribers (which for a one-window core = same-window), the SELF code 322 + path is unreachable. 323 + 324 + **Plan** (single commit, no backcompat grace period): 325 + - Delete the `scopes` constant and the `scopeCheck()` function in 326 + `pubsub.ts`. All `publish()` calls route to both in-proc callbacks 327 + (filtered by topic match only) and the broadcaster. 328 + - Migrate the 4 `api.scopes.SYSTEM` sites to regular publishes on 329 + system-privileged topics (e.g., `topic:core:prefs`). Privilege moves 330 + from the scope argument onto the topic — the topic's capability 331 + allowlist controls who may publish and who may subscribe. 332 + - Remove the `scope` parameter from `api.publish` and `api.subscribe`. 333 + Migrate all ~100 call sites in `app/**` in the same commit. No 334 + backcompat shim, no no-op tolerance. 335 + 336 + ## Subscribe-before-publish invariant 337 + 338 + This is the missing rule that explains every "lost publish at boot" bug. 339 + 340 + **Statement**: For any topic T and any publisher P, at the moment P 341 + publishes T, every subscriber S that will ever receive T during this boot 342 + must already be registered. 343 + 344 + **Enforcement**: 345 + - **Core subscribers land first.** `app/index.js` subscribes to 346 + `cmd:register`, `cmd:register-batch`, `noun:register-batch`, and all 347 + core-managed domain topics during its synchronous init, BEFORE any tile 348 + is transitioned to `loading`. The main process waits for bgWindow's 349 + `tile:lifecycle:ready` (private IPC) before any `registered → loading` 350 + transition for other tiles. 351 + - **Tile publishes are gated by token existence.** A tile in `loading` 352 + has no token yet (the token is minted on `loading → ready`). Any 353 + publish attempt from a `loading` tile fails the chokepoint's token 354 + check and is rejected + logged as `tile:drift`. No capability check 355 + is needed — there's no token to consult. 356 + - **No replay needed.** Because subscribers exist before any publisher 357 + fires, `cmd:request-registers` and the `registeredPayloads` cache in 358 + tile-preload.cts become unnecessary. **Delete both.** 359 + 360 + This is the single rule that kills the class of bug "only hello-world's 361 + commands visible in the cmd panel." 362 + 363 + ## Load-on-dispatch — timeout UX 364 + 365 + When Core dispatches `cmd:execute:X` to a tile in `registered`, the 366 + pre-publish hook forces a `registered → loading → ready` transition before 367 + delivering. `LAZY_LOAD_TIMEOUT_MS` (currently 10s) bounds the wait. 368 + 369 + **On timeout** (today: reject the pending promise with an error that the 370 + cmd panel surfaces as a spinner hang): 371 + - Dispatcher promise rejects with a structured error. 372 + - Core publishes `notification:show` with `{ type: 'error', title: 'Tile 373 + didn't load', body: 'Command \"{name}\" couldn't run because its tile 374 + failed to load within 10 seconds.' }`. 375 + - Cmd panel drops back to IDLE — no lingering spinner. 376 + 377 + **On render-process-gone during load** (transition `loading → crashed`): 378 + - Same notification, different body (`crashed while loading`). 379 + - Dispatcher rejects immediately — no 10s wait. 380 + 381 + ## Bypass detection 382 + 383 + Everything routed through the FSM is enforced at the gate; rejections 384 + emit `tile:drift` telemetry (see §Authorization rules). The only way 385 + for a message to escape the FSM is for code to call a lower-level 386 + Electron API directly, sidestepping the gate entirely. The FSM cannot 387 + prevent this — anyone writing main-process code can do anything — 388 + so we need complementary bypass detection. 389 + 390 + Two bypass categories: 391 + 392 + - **Direct-send bypass** — `webContents.send('pubsub:...')` called 393 + outside the broadcaster. Mitigation: lint rule against 394 + `webContents.send` with a `pubsub:` prefix, plus a dev-mode wrap of 395 + `webContents.send` that throws on the same pattern. 396 + - **Off-path window creation** — `new BrowserWindow()` with a 397 + `peek://{tileId}/...` URL outside the tile launcher. Mitigation: lint 398 + rule + dev-mode assertion in the FSM that every BrowserWindow with a 399 + tile URL has a corresponding `registered → loading` transition. 400 + (Inherited from tile-lifecycle-fsm.md §Drift detectors.) 401 + 402 + Both are dev/CI-only assertions. Production has nothing to check — 403 + correct code can't trip them, and incorrect code is caught at review 404 + or in dev. 405 + 406 + What is explicitly **not** a drift detector: 407 + - Gate rejections (enforcement, already covered). 408 + - Startup races (if they occur, the FSM design is wrong — fix the design, 409 + don't paper over it with a detector). 410 + - Unrouted publishes (legitimate for domain events with no listeners). 411 + 412 + **Drift emission rate limit**: `tile:drift` publishes are themselves 413 + rate-limited to one event per `(tileId, reason)` tuple per second, with 414 + a dropped-count aggregator. A buggy or malicious tile spamming rejected 415 + frames cannot amplify into a drift-publish storm that saturates the 416 + broadcaster. 417 + 418 + ## Module layout 419 + 420 + Dependencies are one-way: `tile-lifecycle` → `pubsub`. `pubsub` never 421 + imports from `tile-lifecycle`. This is what makes the pubsub FSM 422 + testable in isolation and eliminates the race between state lookup and 423 + state transition. 424 + 425 + - **`backend/electron/pubsub-fsm.ts`** (pure, new) — the authorization 426 + matrix as a pure function: `allow({role, grant, topic, op}) → 427 + 'allow' | {violation: reason}`. No Electron / Node imports. Topic 428 + classification lives here. Testable without spinning up main. 429 + - **`backend/electron/tile-fsm.ts`** (pure, exists) — lifecycle 430 + transition table. Unchanged. 431 + - **`backend/electron/pubsub.ts`** — the hub: delivery, broadcaster 432 + wiring, pre-publish hooks. Imports `pubsub-fsm.ts` only. Exposes 433 + `unsubscribeAllByPrefix(tileId)` for lifecycle cleanup. 434 + - **`backend/electron/tile-ipc-gate.ts`** (new) — the single IPC 435 + chokepoint (see §The IPC chokepoint). Exposes `registerTileIpc(channel, 436 + handler, descriptor)`. Runs channel allowlist check, sender-frame 437 + cross-check against token owner, payload schema validation, token 438 + validation, state-at-receive assertion, sender-role allowlist. 439 + Every drop emits `tile:drift`. No `tile:*` IPC handler is attached 440 + except through this gate. 441 + - **`backend/electron/tile-ipc.ts`** — individual channel handlers, 442 + registered via `registerTileIpc`. On `tile:pubsub:publish` / 443 + `tile:pubsub:subscribe`: the gate has already validated the frame, so 444 + the handler just calls `pubsubFsm.allow(...)` for per-topic 445 + authorization → publish or reject+drift. This is where the 446 + lifecycle↔pubsub boundary sits. 447 + - **`backend/electron/tile-lifecycle.ts`** — state store + token 448 + lifecycle. On `loading → ready`: mint token. On `→ unloading` / 449 + `→ crashed`: revoke token, call `pubsub.unsubscribeAllByPrefix()`. 450 + Publishes `tile:state-changed` for observers. Imports from 451 + `pubsub.ts` (one-way); never imported by it. 452 + - **`backend/electron/tile-drift.ts`** (new) — bypass detectors (the 453 + two from §Bypass detection) + dev-mode wrap of `webContents.send`. 454 + Owns the `tile:drift` GLOBAL topic used by the gate's rejection 455 + telemetry. 456 + - **`backend/electron/tile-lazy.ts`** — load-on-dispatch pre-publish 457 + hook. Already lives on the lifecycle side (it triggers 458 + `registered → loading`). Continues to register as a pre-publish hook 459 + on `pubsub.ts`; the hook body calls into lifecycle. 460 + 461 + Test boundaries: 462 + - `pubsub-fsm.ts` — pure unit tests over the authorization matrix. 463 + - `tile-fsm.ts` — pure unit tests over the transition table (exists). 464 + - `tile-ipc.ts` — integration tests with real tokens + real grants. 465 + - Lifecycle cleanup — integration test that `→ crashed` clears the 466 + tile's subscriptions. 467 + 468 + ## Phased implementation 469 + 470 + 1. **Phase 1 — Fix bgWindow broadcast + rename broadcaster.** Add 471 + bgWindow to the broadcaster iteration. Rename 472 + `extensionBroadcaster` → `pubsubBroadcaster` (the "extension" label 473 + is a stale v1 term — Peek now reserves "extension" for bundled 474 + Chromium extensions; features/tiles are the current vocabulary). 475 + This alone unblocks the "only hello-world visible" and "v2 result 476 + doesn't reach subscribers" bugs. No semantic changes. 477 + **Test**: tag/untag/widget-update smoke tests go green. 478 + 2. **Phase 2 — Sender-frame cross-check (security hardening).** 479 + Independently shippable. In every `tile:*` handler, verify 480 + `event.sender` matches the `WebContents` that owns 481 + `payload.token`. Closes the "tile with XSS forges another tile's 482 + token" hole. Does not require the full IPC chokepoint (Phase 8) but 483 + prepares for it — this check is worth landing early because it's 484 + security, not cleanup. 485 + **Test**: new unit test that simulates a mismatched sender + valid 486 + token → rejected with `tile:drift`. 487 + 3. **Phase 3 — Collapse to one command-result path.** Delete the 488 + `tile:command:result` IPC (preload send site + main-process 489 + handler). All command results flow through capability-gated pubsub 490 + publish of the result topic. **Test**: every test exercising command 491 + results stays green. 492 + 4. **Phase 4 — Private lifecycle IPC + subscribe-before-publish.** 493 + Split `tile:ready` / `tile:shutdown` off the pubsub bus onto private 494 + `tile:lifecycle:*` IPC channels handled directly in 495 + `tile-lifecycle.ts`. `app/index.js` wires all core subscribers 496 + during synchronous init; main process gates first tile 497 + `registered → loading` on bgWindow's private lifecycle-ready IPC. 498 + Add `tile:state-changed` as the public observer mirror. **Test**: 499 + unit test that simulates tile emitting `cmd:register` at `t=0` boot 500 + — subscriber in core gets it. No pubsub-level `tile:ready` topic 501 + remains. 502 + 5. **Phase 5 — Delete replay machinery.** Remove `cmd:request-registers` 503 + topic + `registeredPayloads` cache + `ensureCmdRequestRegistersListener` 504 + in tile-preload.cts. **Test**: cmd registry has full contents after 505 + cold boot without replay. 506 + 6. **Phase 6 — Delete scope.** Remove `scopes` constant from `pubsub.ts` 507 + + `api.scopes` surface in tile-preload. `api.publish`/`api.subscribe` 508 + lose their scope parameter. Migrate all ~100 call sites in `app/**` 509 + + the 4 SYSTEM sites in the same commit; no backcompat shim. 510 + **Test**: existing pubsub tests green; grep for `api.scopes` returns zero. 511 + 7. **Phase 7 — Timeout UX.** Wire notification publish on 512 + `LAZY_LOAD_TIMEOUT_MS` expiry and on `loading→crashed` during 513 + load-on-dispatch. **Test**: dispatcher to a tile whose preload throws 514 + → user-visible notification within 10s. 515 + 8. **Phase 8 — IPC chokepoint + bypass detectors.** Introduce 516 + `tile-ipc-gate.ts` with `registerTileIpc(channel, handler, 517 + descriptor)`. Migrate every existing `tile:*` `ipcMain.on` through 518 + the gate. Gate runs the six-step validation sequence (§The IPC 519 + chokepoint). Add lint rules + dev-mode wraps for the two bypass 520 + categories (direct `webContents.send('pubsub:...')`, off-path 521 + `new BrowserWindow` for tile URLs). Wire gate rejections to the 522 + `tile:drift` topic with structured reasons so CI can fail on any 523 + drift event. 524 + **Test**: regression suite stays green; a deliberately-seeded 525 + bypass in a test fixture trips the detector; every `tile:*` 526 + channel has at least one passing + one rejecting gate test. 527 + 528 + Each phase is independently shippable and each has a narrow failure mode. 529 + 530 + ## Test plan 531 + 532 + - **Pure FSM unit tests**: state × operation matrix — each legal op 533 + returns `allow`, each illegal op returns `violation`. 534 + - **Integration**: cold boot → cmd panel has all declared commands before 535 + any tile window exists (manifest-cache path, already scoped by 536 + tile-lifecycle-fsm.md). 537 + - **Integration**: cold boot → dispatch cmd to `registered` tile, assert 538 + load triggers, handler runs, result topic reaches dispatcher. No 539 + spinner-hang. 540 + - **Integration**: dispatch to tile whose renderer throws during init → 541 + notification appears, panel returns to IDLE within 10s. 542 + - **Regression**: the 2026-04-23 repro `tests/desktop/cmd-execute-twice. 543 + spec.ts` still passes. Add companion that fires 5x in a row from the 544 + panel UI (per tasks.md item). 545 + - **Regression**: full desktop suite stays green after each phase. 546 + 547 + ## References 548 + 549 + - [tile-lifecycle-fsm.md](tile-lifecycle-fsm.md) — tile state machine; 550 + this doc layers on top. 551 + - `backend/electron/pubsub.ts` — the hub. 552 + - `backend/electron/pubsub-fsm.ts` (new, Phase 8) — pure authorization 553 + matrix. 554 + - `backend/electron/tile-ipc.ts` — individual `tile:*` channel handlers. 555 + - `backend/electron/tile-ipc-gate.ts` (new, Phase 8) — single IPC 556 + chokepoint. 557 + - `backend/electron/tile-drift.ts` (new, Phase 8) — bypass detectors + 558 + `tile:drift` topic owner. 559 + - `backend/electron/tile-preload.cts` — renderer-side publish/subscribe 560 + API + command registration. 561 + - `backend/electron/main.ts:216` — broadcaster registration site 562 + (`setExtensionBroadcaster` → renamed to `setPubsubBroadcaster` in 563 + Phase 1; this is the bug site for bgWindow exclusion). 564 + - `backend/electron/tile-lazy.ts` — pre-publish hook for 565 + load-on-dispatch. 566 + - `app/background.html`, `app/index.js`, `app/cmd/background.js` — core 567 + renderer entry points + subscribers. 568 + - `docs/tasks.md` Current-Priority section — bugs this machine makes 569 + impossible.
+12
docs/tasks.md
··· 8 8 9 9 --- 10 10 11 + ## Current priority (drop everything) 12 + 13 + - [ ] **Message-passing / pubsub state machine.** Cmd dispatch + pubsub + tile lifecycle have too many intersecting paths; Phase E/F surfaced three different manifestations of the same class of bug (see below). Write a formal state machine describing every legal transition: who can publish what topic from what source in what tile-state, who can subscribe to what, how results return, how lazy tiles mount. Runtime enforcement + bisectable tests derive from the machine. Goal: the "lost publish somewhere in the fabric" class of bug becomes impossible — if a message is dropped, the machine tells us exactly which transition was invalid. Explicit user directive 2026-04-23: "we're going to drop everything and write a state machine for the message passing and pubsub, so never have to worry about this again." Pre-requisite for tasks below. 14 + 15 + - [ ] **Root-cause: only hello-world commands visible in cmd panel.** After Phases E/F, only hello-world's commands (`hello`, "hello world trace") appear in the cmd panel. Every other lazy tile's commands — tag, untag, kagi, google, ddg, lists, peeks, slides, and dozens more — are missing. Agent hypothesis: Phase E's `registerLazyTile()` publishes `cmd:register-batch` at boot before the cmd panel's core/background subscription lands. Do NOT band-aid; fix this only after the state machine is in place so the fix comes from the right abstraction. Manual test case: `yarn start`, open cmd panel, type a few characters, should see many matches — currently only sees hello-world's. 16 + 17 + - [ ] **Consolidate command-result paths into one.** Two paths exist today: (A) `tile:command:result` IPC from tile-preload's `api.commands.register` wrapper → main-process unrestricted `publish()`; (B) handlers that manually `tile:pubsub:publish` to `cmd:execute:X:result:...`. This duality is why the Phase E bug hid for so long — pre-Phase-E tests only exercised path (A), which bypassed the capability allowlist; path (B) was silently rejected until the 2026-04-23 fix (`tolwnovr f32063db`). Collapse to one (probably A). Pre-req: state machine task above. 18 + 19 + - [ ] **UI-level tests for cmd-panel repeat invocation.** The 2026-04-23 repro (`tests/desktop/cmd-execute-twice.spec.ts`) fires on the pubsub bus directly and passes 3/3. Manual testing caught a 3rd-invocation stall that the bus test misses — the difference must be panel UI state (`state.executionTimeout`, `urlSearchTimer`, subscription lifetime in `ensureCmdRequestRegistersListener`) leaking across repeats. Add Playwright tests that drive Cmd+K → type → Enter 3-5x in a row and assert panel closes + no spinner. 20 + 21 + --- 22 + 11 23 ## Tile architecture cleanup (post-conversion) 12 24 13 25 - [ ] **Merge websearch's separate background tile into the home window.** Manifest has two tile entries: `background.html` (lazy:true) for settings/engine state, `home.html` for the UI. They communicate via pubsub round-trips (`websearch:engine-request` → `websearch:engines-list`) across `peek://ext/websearch/background.html` vs `peek://websearch/home.html`. Cross-window pubsub within one feature is fragile (see 2026-04-20 session — broadcaster echo-prevention bug + cluster 3 regression risk). The round-trip also blocks 3 websearch tests from passing. With the single-file tiles model shipping (`resident: true`), websearch can collapse to one tile whose home window owns both UI and engine state directly. No IPC needed. Applies also to any feature tile with a bg + window pair that pubsubs between itself — audit other candidates.