···11+# Convey error handling — Wave 2 design
22+33+See `scratch/recon-convey-error-wave2.md` for the call-site inventory and backend contracts. This doc records the migration choices you will implement.
44+55+## Scope (recap)
66+77+Wave 2 migrates 17 Tier-2 sites plus 1 prerequisite scaffold upgrade across 8 app files.
88+99+D9 Sol updated-days is deferred. Do not touch that client path in Wave 2. See Out-of-scope follow-ups.
1010+1111+## Conventions
1212+1313+- Keep `window.logError(err, { context: ... })` alongside every owner-visible surface. Logging and UI are separate requirements.
1414+- Do not add retry buttons. Owner action is reload unless the site already has its own mutation button.
1515+- Reuse existing Wave 1 helpers in-place. Do not add overlapping helpers for home, settings, or speakers.
1616+- Use `window.apiJson` for every migrated HTTP path. Raw `fetch(...).then(r => r.json())` leaves non-2xx and malformed JSON silent.
1717+- Use `window.SurfaceState.errorCard(...)` for first-paint section failures and `surface-state-refresh-error` only when stale content is intentionally preserved.
1818+- Use `window.appEvents.listen(..., options, handler)` for websocket timeout/drop handling. The overload contract is at `convey/static/websocket.js:300-322`.
1919+- **UI patience, not backend SLA.** Client-side timeouts on long-running async work (websocket listeners, polling loops) are owner-patience thresholds — the point at which the UI stops claiming "still working" and tells the owner to reload. They are not statements about how long the backend should take. Backend work may continue past the UI timeout; the client simply stops lying about its state. This framing is reusable for Wave 3 background-task health.
2020+2121+## Site-by-site decisions
2222+2323+### D1 — Home refreshVitals + refreshNarrative
2424+2525+Targets: `apps/home/workspace.html:1582-1630`
2626+2727+Use `window.apiJson('/app/home/api/pulse')` in both refreshers. Keep `malformedHomeResponse()` for 200-shape failures instead of inventing a second parse helper.
2828+2929+For `refreshVitals`, preserve the existing `#pulse-vitals` content on failure and append a singleton refresh-error sibling after `#pulse-vitals`. Copy: `Couldn't refresh vitals — showing last known state.` Keep the current render path untouched on success.
3030+3131+For `refreshNarrative`, preserve the existing `#pulse-narrative` content on failure and append a singleton refresh-error sibling after `#pulse-narrative`. Copy: `Couldn't refresh narrative — showing last known state.` Do not blank the rendered markdown on refresh failure.
3232+3333+These are refresh-only paths. There is already first-paint content in the HTML, so this is a stale-preserving migration, not a loading-scaffold migration.
3434+3535+### D2 — Link refreshStatus + refreshDevices
3636+3737+Targets: `apps/link/workspace.html:171-199`
3838+3939+Add a stable container id to the existing status row: use `#link-status-panel` on the current `.link-status-row`. `refreshStatus()` should call `window.apiJson('/app/link/api/status', ...)`, shape-validate `typeof data.enrolled === 'boolean'`, and throw `new window.ApiError({ cause: 'parse', status: 200, serverMessage: 'Unexpected status shape' })` on mismatch.
4040+4141+On `refreshStatus()` failure, do not call `setStatus(...)`. Leave the last visible status text and LAN nudge state unchanged. Render a refresh-error sibling after `#link-status-panel` with copy `Can't reach pairing service — reload to try again.` This is PD9.
4242+4343+`refreshDevices()` should also move to `window.apiJson('/app/link/api/devices', ...)`.
4444+4545+Use a split surface for devices:
4646+4747+- First failure before the first successful device render: replace `#link-devices-list` contents with `SurfaceState.errorCard({ heading: "Couldn't load paired devices", desc: "Reload to try again.", serverMessage: err.serverMessage })`.
4848+- Later refresh failure after at least one successful render: leave the stale device list visible and append a `surface-state-refresh-error` sibling after `#link-devices-list` with copy `Couldn't refresh paired devices — showing last known state.`
4949+5050+Track first-success with a small local boolean. Do not treat transport failure as `offline`.
5151+5252+### D3 — Chat listener
5353+5454+Target: `apps/chat/workspace.html:60-167`
5555+5656+Keep the live listener only for `today`, but switch it to the options overload:
5757+5858+- `schema: ['kind']`
5959+- `correlationKey: 'use_id'`
6060+- `timeout: 3 * 60 * 1000`
6161+- `onDrop: window.logError`
6262+- `onTimeout: handleChatTalentTimeout`
6363+6464+PD3 applies. Do not implement an owner-reply watchdog in this wave.
6565+6666+Only track spawned talent cards. Call `cleanup.pending.track(useId)` inside `appendEventFromLive()` when `kind === 'talent_spawned'` and `use_id` exists. Do not track `owner_message`, `sol_message`, or `reflection_ready`.
6767+6868+Add a stalled visual state for talent cards:
6969+7070+- New variant class: `chat-talent-card--stalled`
7171+- New status value: `data-talent-status="stalled"`
7272+- New copy: `Talent stopped responding — reload to retry`
7373+7474+On timeout, find the matching active talent card by `data-talent-use-id`, convert it in place to the stalled variant, and leave the rest of the transcript unchanged. If the card is already finished or errored, the timeout callback is a no-op.
7575+7676+### D4 — Entities cortex listener
7777+7878+Target: `apps/entities/workspace.html:3216-3312`
7979+8080+Switch `window.appEvents.listen('cortex', ...)` to the options overload with:
8181+8282+- `schema: ['event', 'use_id']`
8383+- `timeout: 2 * 60 * 1000`
8484+- `correlationKey: 'use_id'`
8585+- `onDrop: window.logError`
8686+- `onTimeout: handleEntitiesTimeout`
8787+8888+Use 2 minutes as the UI stall threshold. There is no shorter backend contract in `apps/entities/routes.py:498-574`, and cortex still allows much longer work by default (`think/cortex.py:335-339` defaults to 10 minutes). Two minutes is a UI patience threshold for these owner-triggered, single-agent actions, not a backend SLA.
8989+9090+Track ids in both paths:
9191+9292+- `submitEntityAssist()` tracks the real `use_id` immediately after the POST succeeds and the temp id is remapped.
9393+- `listenForAgentCompletion()` tracks the `agentId` when description generation starts.
9494+9595+Timeout behavior must fail both maps cleanly and only once:
9696+9797+- If `pendingEntities.has(useId)`, call the existing failure path with copy `Entity assist timed out — reload to retry`.
9898+- If `pendingAgentCallbacks.has(useId)`, remove the callback entry first, then invoke the callback with a timeout-shaped failure so the textarea/button unlock and the description inline error renders.
9999+100100+Ignore late websocket `finish`/`error` events after timeout by deleting the pending entry before surfacing the timeout.
101101+102102+### D5 — Settings saveField
103103+104104+Targets: `apps/settings/workspace.html:3270-3512` plus all `[data-section][data-key]` autosave controls
105105+106106+Keep `saveField()` as the debounced orchestrator. Do not move debounce into `saveControl()`.
107107+108108+PD5: when `el` is null, or when the caller is the auto-detected `identity.timezone` path, bypass `saveControl()` and call `window.apiJson('api/config', ...)` directly inside the existing debounce. On failure, log with `window.logError(...)`. There is no UI control to revert in that path.
109109+110110+For normal controls, migrate `saveField()` to `prepareFieldErrorHost(el)` plus `window.saveControl(...)`.
111111+112112+PD7 sub-choice: use a local last-known-good snapshot per element. This is required by `saveControl()` internals:
113113+114114+- `convey/static/api.js:208-213` reads `readValue` before the request and uses that as `previousValue`.
115115+- `convey/static/api.js:220` writes `savedControlValues` from the live DOM value after success.
116116+- `convey/static/api.js:227-228` writes `previousValue` back on failure.
117117+118118+That ordering means:
119119+120120+- default `getInitialControlValue()` is wrong for JS-populated settings fields because it falls back to `defaultValue/defaultChecked` (`convey/static/api.js:98-113`)
121121+- a per-call snapshot of the changed DOM value would also be wrong, because `previousValue` is captured before the fetch starts
122122+123123+Design:
124124+125125+- Seed `el.__lastKnownValue` for every `saveField`-managed control during `populateFields()`, after the DOM has been filled from config.
126126+- Pass `readValue: () => el.__lastKnownValue` for all non-env controls.
127127+- Update `el.__lastKnownValue` in `onSuccess` to the control’s current logical value.
128128+- Keep the default `writeValue`. The field types in scope are text, textarea, select, checkbox, and password, so the stock writer is enough.
129129+130130+PD6 env-key nuance (revised): **do not route env fields through `saveControl()` at all.** Reason: pre-clearing the field before `saveControl()` starts is a UX regression — on failure the user's typed secret is lost and they must re-type. Post-success clear is current UX; a pre-save clear is new behavior. Instead, env fields use a direct path:
131131+132132+- `prepareFieldErrorHost(el)` to seed the `<small>` host
133133+- `window.apiJson('api/config', { method: 'PUT', ... })` with the debounced `value`
134134+- On success: clear `el.value = ''` (preserve current post-success UX), run existing key-validation and env-status refresh, call `showFieldStatus(el, 'saved')`
135135+- On failure: render the inline error directly into the `<small>` host using the same `[data-control-save-error]` markup shape that `saveControl()` produces (copy exactly so `.settings-field small .control-save-error` CSS applies), keep the typed secret in `el.value`, call `window.logError(...)`
136136+- No weakmap interaction anywhere
137137+138138+This sidesteps the secret-caching problem entirely while preserving retry-without-retype for env saves. Non-env controls still use the `saveControl()` path per PD7.
139139+140140+Error behavior:
141141+142142+- Success path still calls `showFieldStatus(el, 'saved')`
143143+- Failure path relies on `saveControl()` to render the field-local error in the prepared `<small>`
144144+- Every catch still logs via `window.logError`
145145+146146+### D6 — Entities confirmEntityDelete
147147+148148+Targets: `apps/entities/workspace.html:3436-3465`
149149+150150+Drop the optimistic delete. The order becomes:
151151+152152+1. Disable the confirm button.
153153+2. Call `window.apiJson(...)` on the DELETE route.
154154+3. On success, clear `detected-action-error`, close the modal, then call `removeEntityFromUI(entityName)`.
155155+4. On failure, re-enable the confirm button, close the modal, and show `showInlineError('detected-action-error', err.serverMessage || "Couldn't delete entity")`.
156156+157157+Do not fall back to `loadEntities()` as the primary repair path. Preserve the current local success path by keeping `removeEntityFromUI()` as the post-success side effect.
158158+159159+### D7 — Entities loadEntities + loadJournalEntities
160160+161161+Targets: `apps/entities/workspace.html:1258-1261`, `2636-2703`
162162+163163+PD4 applies. Upgrade the loading scaffold first so `SurfaceState.replaceLoading()` can replace in place.
164164+165165+Before:
166166+167167+```html
168168+<div id="entities-loading" class="entities-loading">
169169+ <div class="spinner"></div>
170170+ <p>Loading entities...</p>
171171+</div>
172172+```
173173+174174+After:
175175+176176+```html
177177+<div id="entities-loading" class="entities-loading">
178178+ <div class="surface-state surface-state--loading" role="status" aria-busy="true">
179179+ <div class="surface-state-spinner" aria-hidden="true"></div>
180180+ <span class="surface-state-text" data-role="loading-status">Loading entities...</span>
181181+ </div>
182182+</div>
183183+```
184184+185185+Keep `#entities-loading` as the outer id so callers do not change. The important change is the inner `.surface-state--loading` child, because `replaceLoading()` only replaces in place when that marker exists (`convey/static/app.js:1606-1635`).
186186+187187+Then migrate both loaders to `window.apiJson(...)` and replace the catch-time scaffold pollution with:
188188+189189+- `window.logError(...)`
190190+- `window.SurfaceState.replaceLoading('entities-loading', window.SurfaceState.errorCard({ heading: "Couldn't load entities", desc: "Reload to try again.", serverMessage: err.serverMessage }))`
191191+192192+Use the same pattern for `loadJournalEntities()` with journal-specific copy only if it materially improves clarity. Otherwise keep one consistent `Couldn't load entities` surface.
193193+194194+### D8 — Settings four loaders
195195+196196+Targets:
197197+198198+- `loadTranscribeBackends()` at `apps/settings/workspace.html:4531-4554`
199199+- `loadObserve()` at `4675-4683`
200200+- `loadSync()` at `4760-4768`
201201+- `loadStorage()` at `5800-5808`
202202+203203+These sections do not have loading scaffolds, so `replaceLoading()` is the wrong primitive. Add one top-of-section status slot per section, immediately under the section description:
204204+205205+- `#transcriptionLoadState` inside `#section-transcription`
206206+- `#observerLoadState` inside `#section-observer`
207207+- `#syncLoadState` inside `#section-sync`
208208+- `#storageLoadState` inside `#section-storage`
209209+210210+Use `SurfaceState.errorCard(...)` in those slots.
211211+212212+Decisions by loader:
213213+214214+- `loadTranscribeBackends()`: targeted transcription-only surface. Render `Couldn't load transcription backends` in `#transcriptionLoadState`, disable the backend selector until success, and do not let this failure abort `loadConfig()`. `loadConfig()` should still populate the rest of the settings page.
215215+- `loadObserve()`: render `Couldn't load observer settings` in `#observerLoadState` and disable `#field-tmux-enabled` plus `#field-tmux-capture-interval` until success.
216216+- `loadSync()`: render `Couldn't load sync settings` in `#syncLoadState` and disable the Plaud, Granola, and Obsidian toggles until success.
217217+- `loadStorage()`: render `Couldn't load storage settings` in `#storageLoadState`. On first paint, disable retention controls until success. On later refreshes after cleanup actions, keep stale storage numbers visible and use the same slot for refresh-error copy instead of blanking the section.
218218+219219+All four loaders still log failures. Clear the section slot on the next successful load.
220220+221221+### D9 — Sol updated-days — DEFERRED
222222+223223+Target: `apps/sol/workspace.html:1363-1385`
224224+225225+PD1 applies. Do not change this client path in Wave 2.
226226+227227+Reason: `apps/sol/routes.py:637-644` converts backend failure into `[]`, so the client cannot distinguish empty from error. Leave the existing `.catch(() => { banner.style.display = 'none'; })` unchanged.
228228+229229+Flag the backend-contract follow-up in Out-of-scope follow-ups and in implementation-stage gate output.
230230+231231+### D10 — Sol loadIdentity
232232+233233+Target: `apps/sol/workspace.html:1197-1231`
234234+235235+PD2 applies.
236236+237237+Move to `window.apiJson('/app/sol/api/identity')`. Remove the dead `if (data.error)` branch, because the route returns real `500 + {error: ...}` on failure and has no 200-side disabled contract.
238238+239239+On failure, replace `#sol-identity` contents with:
240240+241241+- heading: `Couldn't load identity`
242242+- desc: `Reload to try again.`
243243+- server message: `err.serverMessage`
244244+245245+Do not hide the section anymore. A missing identity card should stop meaning “backend failed silently.”
246246+247247+### D11 — Health retry-import
248248+249249+Target: `apps/health/workspace.html:1739-1759`
250250+251251+PD8 applies.
252252+253253+Move the click handler to `window.apiJson('/app/health/api/retry-import', ...)`. Remove the `data.status === 'not_implemented'` branch entirely.
254254+255255+Use the existing row-local importer card as the surface:
256256+257257+- while pending, keep the current button-disabled `Retrying...` state
258258+- on success, clear the row error text and show `Retry sent`
259259+- on failure, restore the button label, re-enable the button, and write `err.serverMessage || 'Retry failed'` into that card’s `.activity-card-error`
260260+261261+This keeps the failure local to the affected import row and lets the server’s 501 message surface unchanged.
262262+263263+## Additional sites
264264+265265+### Speakers checkOwnerStatus + submitOwnerChoice
266266+267267+Targets: `apps/speakers/workspace.html:1259-1364`
268268+269269+Use `window.apiJson(...)` for the owner-status GET, the nested owner-detect POST, and both confirm/reject POSTs.
270270+271271+Keep `resolveSpeakerError()` as the message normalizer. Do not add a second speakers error helper.
272272+273273+Surface choices:
274274+275275+- `checkOwnerStatus()` top-level failure: render an owner-banner error state instead of hiding both banners. Copy baseline: `Couldn't load owner status`.
276276+- nested detect failure: clear `ownerDetectionInFlight`, keep the owner area visible, and render `Couldn't analyze voice patterns` through the same banner surface.
277277+- `submitOwnerChoice()` failure: re-enable the buttons and render the normalized server message in the owner banner.
278278+279279+On success, preserve the existing `checkOwnerStatus()` refresh behavior.
280280+281281+### Speakers loadReview
282282+283283+Target: `apps/speakers/workspace.html:1725-1752`
284284+285285+PD10 applies.
286286+287287+Keep first-paint and refresh separate:
288288+289289+- First-paint path: when `#spkSentences` still contains the `Loading...` placeholder and no `.spk-sentence`, replace that area with `SurfaceState.errorCard({ heading: "Couldn't load review", desc: "Reload to try again.", serverMessage: err.serverMessage })`.
290290+- Refresh path: when `.spk-sentence` rows already exist, preserve them and render one refresh-error slot at the top of the review panel with copy `Couldn't reload review — showing last known state.`
291291+292292+Because there is no existing top-of-panel status slot, add one in the detail render above `#spkSentences`: `#spkReviewStatus`. Clear it on the next successful `loadReview()`.
293293+294294+### Speakers loadUntilFound
295295+296296+Target: `apps/speakers/workspace.html:1461-1474`
297297+298298+Move to `window.apiJson(...)`. Add a catch. Do not let the recursive hash-hunt die as an unhandled rejection.
299299+300300+On failure:
301301+302302+- log the error
303303+- stop recursion
304304+- preserve the current segment list
305305+- render a segment-list refresh error in a new lightweight slot above the list, not a global modal
306306+307307+This is a refresh/deep-link recovery failure, not a first-paint full-panel failure.
308308+309309+### Entities generate-description fetch
310310+311311+Target: `apps/entities/workspace.html:2416-2444`
312312+313313+Move the kickoff POST to `window.apiJson(...)`. Validate that the response includes a usable `use_id`; otherwise throw a parse-style `ApiError`.
314314+315315+On kickoff failure:
316316+317317+- re-enable the textarea
318318+- remove `.generating`
319319+- call `showInlineError('description-save-error', err.serverMessage || "Couldn't start description generation")`
320320+321321+On websocket timeout/error from D4, route the failure through the same inline surface so the owner sees one consistent description-generation error path.
322322+323323+### Entities previewEntityDelete
324324+325325+Target: `apps/entities/workspace.html:3402-3434`
326326+327327+Move the preview GET to `window.apiJson(...)`. Keep `showInlineError('detected-action-error', ...)` as the local surface.
328328+329329+Preserve the existing modal population flow on success. On failure, keep the modal closed and show the server message inline. Do not add a second delete-preview helper.
330330+331331+## Implementation order
332332+333333+1. Entities scaffold upgrade + `loadEntities()` + `loadJournalEntities()`
334334+2. Entities `previewEntityDelete()` + `confirmEntityDelete()` + generate-description kickoff
335335+3. Entities cortex listener timeout/drop handling
336336+4. Settings `saveField()`
337337+5. Settings four loaders
338338+6. Speakers `checkOwnerStatus()` + `submitOwnerChoice()`
339339+7. Speakers `loadReview()` + `loadUntilFound()`
340340+8. Home `refreshVitals()` + `refreshNarrative()`
341341+9. Link `refreshStatus()` + `refreshDevices()`
342342+10. Sol `loadIdentity()`
343343+11. Health retry-import
344344+12. Chat websocket listener
345345+13. Audit and finalize
346346+347347+## Decision log table
348348+349349+| Decision | Choice | Status | Rationale |
350350+| --- | --- | --- | --- |
351351+| PD1 / D9 | Defer Sol updated-days | Shipped 2026-04-23 | Backend collapses failure into `[]`; client cannot distinguish empty from error |
352352+| PD2 / D10 | `apiJson` + identity `errorCard`; remove `data.error` branch | Shipped 2026-04-23 | Route only has success or real 500 failure |
353353+| PD3 / D3 | Track only `talent_spawned`; 3-minute chat stall state | Shipped 2026-04-23 | `owner_message` has no `use_id`; owner watchdog is separate work |
354354+| D4 timeout | 2-minute entities UI stall threshold | Shipped 2026-04-23 | Interactive entity actions should not hang forever; backend still has a longer timeout |
355355+| PD4 / D7 | Retrofit `#entities-loading` with `.surface-state--loading` child | Shipped 2026-04-23 | `replaceLoading()` only replaces in place when that marker exists |
356356+| PD5 / D5 | Keep `saveField()` debounce; null-el timezone uses direct `apiJson` | Shipped 2026-04-23 | `saveControl()` requires an element |
357357+| PD6 / D5 | Env fields bypass `saveControl()`; use `apiJson` + `prepareFieldErrorHost` directly | Shipped 2026-04-23 | Avoids weakmap secret caching and preserves retry-without-retype UX on failure |
358358+| PD7 / D5 | Per-element `__lastKnownValue` snapshots | Shipped 2026-04-23 | Default snapshot is wrong for JS-populated controls |
359359+| PD8 / D11 | Treat 501 like any other error and surface server message | Shipped 2026-04-23 | Simpler and consistent; backend message already has the right copy |
360360+| PD9 / D2 | Preserve stale link status; do not synthesize `offline` on transport failure | Shipped 2026-04-23 | `offline` is not the same as “pairing service failed” |
361361+| PD10 / speakers | Split `loadReview()` by first paint vs refresh | Shipped 2026-04-23 | First-paint can be replaced; mutation refresh should preserve current sentences |
362362+| PD11 / all | Keep `logError` with every visible surface | Shipped 2026-04-23 | Telemetry and owner-facing state serve different purposes |
363363+364364+## As-implemented deviations
365365+366366+- Bundle 4: `saveField()` uses a local `saveConfigValue()` wrapper and passes `fetchArgs` as a function so `saveControl()` treats `200 { success: false, error: ... }` config saves as failures instead of false positives.
367367+- Bundle 4: the request body stays on the live route contract, `{ section, data: { [key]: value } }`, rather than the speculative `{ section, key, value, runtime_env }` shape from the early draft.
368368+- Bundle 7: first-paint vs refresh detection keys off the actual rendered `.spk-sentence` rows plus the literal `Loading...` placeholder; the design draft’s `.spk-sentence-row` selector does not exist in the template.
369369+- Bundle 7: `loadReview()` still converts legacy `200 { error: ... }` payloads into thrown errors so the same first-paint / refresh error surfaces handle both transport failures and envelope failures.
370370+- Bundle 8: home adds a local `clearPulseRefreshError()` helper because `SurfaceState.replaceLoading()` appends refresh-error siblings but does not remove them after a later successful refresh.
371371+- Bundle 11: no behavioral deviation from the approved design; the importer retry handler kept the existing delegated click-handler shape.
372372+- Bundle 12: chat uses a local `<style>` block in `apps/chat/workspace.html` for the stalled-card visuals and normalizes the server-rendered finished-card `data-talent-status` in `apps/chat/_chat_event.html` so the live and initial DOM states match.
373373+374374+## Out-of-scope follow-ups
375375+376376+- D9 Sol updated-days banner. Backend must return a detectable error envelope or non-2xx status before the client can migrate.
377377+- Chat owner-reply watchdog. `owner_message` has no `use_id`, so this needs a separate correlation design.
378378+- Any `convey/static/api.js` primitive expansion. Wave 2 must stay within existing primitives.
379379+- Shared helper consolidation across app files. Reuse local helpers now; consolidate later if the pattern repeats again.
380380+381381+## Open questions needing Jer's confirmation (flag these at gate time)
382382+383383+1. D5 env-field UX (resolved by senior): env fields bypass `saveControl()`. Use `apiJson` + `prepareFieldErrorHost` directly. Clear on success; preserve the typed secret on failure so the owner can retry without re-typing. See revised D5 env-key section.
384384+2. D4 timeout constant: 2 minutes is the proposed UI stall threshold for entity assist and description generation. Owner-facing threshold, not a backend SLA. Senior approves; flag to Jer at gate for visibility only.