observe/transcribe: replace Linux Parakeet NeMo backend with onnx-asr
Swaps the Linux x86_64 Parakeet ASR backend from nemo_toolkit[asr] +
torch (~3GB of CUDA-shaped deps) to onnx-asr + onnxruntime per the
spike report's verdict at vpe/workspace/onnx-asr-spike-260424/. Mac
CoreML arm is unchanged. Auto quantization defaults to FP32 on both
CPU and CUDA EPs as recommended by the spike (median 100x RTFx);
INT8 is opt-in only.
The original scope's "remove base onnxruntime" framing collided with
faster-whisper's transitive onnxruntime dep, so packaging uses a
PEP 508 sys_platform marker on base ORT (Mac-only), with the Linux
ORT exclusively coming from the chosen parakeet-onnx-{cpu,cuda} extra.
For CUDA installs we force-reinstall onnxruntime-gpu after sync so it
wins the onnxruntime/capi/ file-overwrite race against faster-whisper's
cpu wheel, then validate with preload_dlls + a real provider check.
This is documented in INSTALL.md so a future ORT-version-pin lode sees
the rationale.
The deprecated `precision` config key is kept for one release as an
alias for `quantization` (canonical: auto|fp32|int8). User journals in
the wild may carry it; it will be removed in a follow-up lode tracked
under vpe/workspace/parakeet-precision-alias-removal.
The `runtime_label` field in the settings API baseline is now
environment-dependent (CPU vs CUDA based on installed ORT), so the
baseline-comparison helper strips it before diffing — same field is
asserted directly in test_settings_routes.py instead. Long-audio CUDA
benchmark on hopper's GTX 1660 Ti hits 75.36x RTFx, well above the
spike's 50x sanity floor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>