feat(infra): split docket worker onto its own fly process group (#1359)
* feat(infra): split docket worker onto its own fly process group
before: `relay-api` ran a single process group with uvicorn AND a
docket Worker in the same uvicorn process. an OOM in any background
task — most acutely `run_track_upload`, which holds full audio bytes
in RAM — killed uvicorn with it, producing the 502s and silent
logouts in the 2026-04-30 incident (#1357).
after: two fly process groups.
- `app`: just uvicorn. opens a Docket *client* (no Worker) so request
handlers can enqueue tasks. sized for HTTP fan-out only.
- `worker`: just the docket Worker. dedicated entrypoint at
`backend.worker`. sized for in-memory upload-pipeline work (2GB).
split structure in `_internal/background.py`:
- `docket_client_lifespan()`: opens Docket, registers tasks, no Worker.
used by main.py.
- `docket_worker_lifespan()`: opens Docket, registers tasks, runs
Worker. used by worker.py.
`backend/worker.py` is intentionally a thin Python entrypoint rather
than execing the upstream `docket` CLI: `configure_observability()`
must run before any task module imports so logfire spans cover the
worker's full lifecycle.
deploy notes (in PR body): after this lands, ops needs to set machine
counts per group via `fly scale count app=2 worker=1 -a relay-api`.
the toml only declares the groups; counts are managed out-of-band.
structural fixes still tracked in #1357.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
* fix(worker): set up notification_service before tasks run
reviewer caught a regression: registered docket tasks
(`tasks/hooks._send_track_notification`,
`tasks/moderation.scan_image_moderation`, and
`_internal/moderation.scan_track_for_copyright` invoked from
`tasks/copyright`) call into `notification_service` for admin DMs.
without `setup()` running on the worker process, the global instance
keeps `recipient_did = None`, so each `send_*_notification` early-returns
None — and `_send_track_notification` then sets
`track.notification_sent = True` regardless, which would have made the
loss permanent and silent on every new upload.
`queue_service` and `jam_service` are HTTP-only — they stay in main.py's
lifespan and intentionally don't run on the worker.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
* chore(staging): mirror process group split in fly.staging.toml
without this, the staging deploy that follows the merge would still run
the old single-process toml, so the split structure wouldn't actually
be exercised — and worse, the new main.py would only open a docket
client (no Worker), leaving staging background tasks stranded in Redis.
staging sized at 1GB per group (vs prod's app=1GB, worker=2GB) since
staging doesn't carry meaningful upload load.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com>
authored by