fix(uploads): stage audio + image to shared storage before enqueueing docket (#1336)
* fix(uploads): stage audio + image to shared storage before enqueueing docket
PR #1331 moved POST /tracks/ + PUT /tracks/{id}/audio onto docket
to fix a connection-pool problem, but mechanically forwarded the same
request-handler `/tmp/...` paths over Redis. on production fly.io,
`relay-api` runs multiple machines per process group; the docket worker
frequently lands on a different machine than the request handler. that
machine has its own /tmp, so the upload silently fails:
`FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpXXXX.wav'`.
evidence (prod, 2026-04-25 darkhart.bsky.social, 7 jobs):
4 failed at varied phases (`upload`, `pds_upload`, `atproto`) — all with
the same FileNotFoundError. the 3 that succeeded all hit the same
`atproto` phase. pure luck of which worker grabbed the job. the
successful tracks also had `image_id IS NULL` in `tracks` because
`_save_image_to_storage` reads `image_path` and silently swallows the
exception (returns `(None, None, None)` on failure). that's the
"cover art shows in the player bar but not on the track page" symptom.
shape of the fix:
HTTP handler:
1. stream client upload to a request-local temp file (size enforce)
2. extract duration once, while bytes are still local
3. `storage.save(file, filename)` -> audio_file_id
4. stream image to memory, `storage.save` -> image_id, image_url, thumb_url
5. delete request-local temp file
6. enqueue docket task with file_id / image_id / URLs ONLY
worker (`run_track_upload`, `run_track_audio_replace`):
- signatures take `audio_file_id`, never a `*_path`
- `_validate_audio` reads duration from the context (no I/O)
- `_store_audio` reuses the staged id directly for web-playable
formats; for lossless, downloads from storage, transcodes via a
worker-local /tmp (single-task, never crosses machine boundary),
saves transcoded result back to storage
- `_upload_to_pds` downloads bytes from storage when not transcoded
- `_store_image` is a no-op forward (URLs already resolved in handler)
this preserves PR #1331's connection-pool win (handler returns once
storage is durable + docket task is enqueued) and removes the
multi-machine fragility entirely.
- drops aiofiles use on this path; uses `storage.get_file_data`
- removes the temp-file cleanup in `_process_upload_background` —
there's nothing local to clean
- audio_replace handler also captures `support_gate` up front so the
staged bytes land in the right bucket (private vs public) before
the worker sees them
regression coverage:
the structural change (`UploadContext` no longer has `file_path`,
docket task signatures no longer have `*_path` args) is the contract.
existing tests (`test_upload_session_reload`, `test_upload_phases`,
`track_audio_replace/test_pipeline.py`) exercise the orchestrator
end-to-end through the new context shape and pass green (46 tests).
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
* fix(uploads): clean up staged storage on handler-side + pre-DB worker aborts
addresses three orphan-cleanup gaps reviewer flagged on the staging refactor:
1. **handler-side**: any abort between `stage_audio_to_storage` and a
successful schedule call left staged storage objects orphaned and
the job stuck in PROCESSING. wrap staging+enqueue in try/except;
on failure delete staged audio (private if gated, public otherwise)
and image, mark the job FAILED.
2. **replace orchestrator**: `new_file_id_for_rollback` was None until
`_store_audio` returned. the gated-FLAC path (handler stages new
bytes to private bucket → `_store_audio` raises "supporter-gated
tracks cannot use lossless formats yet") left those bytes stranded.
initialize from `ctx.audio_file_id` upfront, thread the playable-
file extension through `_rollback_new_files`. add `is_gated: bool`
to ReplaceContext (handler-time decision) so rollback selects the
bucket the bytes ACTUALLY live in even under a concurrent PATCH
that flips support_gate between request and worker.
3. **upload orchestrator**: phases 1-5 raise UploadPhaseError without
releasing staged bytes. add `_cleanup_staged_media_pre_db` and a
`db_row_owns_media` boundary flag — orchestrator cleans up only
before `_create_records`, deferring to its existing reserve-then-
publish cleanup past that. covers the transcoded-sibling case.
session-expired path on both workers also deletes the staged bytes
(no recovery without a fresh sign-in; orphans serve nothing).
regression tests:
- `tests/api/test_upload_storage_cleanup.py` (4 tests)
- `track_audio_replace/test_pipeline.py` (1 test):
early-abort rolls back staged file from the right bucket per
`ctx.is_gated`
370/370 tests pass locally; ruff + ty clean.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
* chore: drop stray backend/loq.toml
* chore(uploads): consolidate cleanup helper, drop redundant deferred import
once-over after CI green:
- removed redundant `from backend._internal import get_session` deferred
re-import inside `_process_upload_background` — the symbol is already
imported at module scope. updated `test_upload_session_reload` to
patch where the symbol is used (`backend.api.tracks.uploads.get_session`)
rather than where it's defined, which is the right pattern anyway.
- audio_replace's handler + session-expired path were inlining the
same `delete_gated if gated else delete` pattern that uploads exposes
as `_delete_staged_audio`. import + reuse instead of duplicating.
no behavior change; 370/370 tests pass.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com>
authored by