add Fly HTTP health checks to detect unresponsive machines (#1214)
after the 2026-04-02 outage, a machine became unresponsive but Fly
had no way to detect it — no health checks were configured. this left
the API down for ~6 minutes until manual intervention.
adds `[[http_service.checks]]` hitting GET /health every 10s with a
5s timeout and 30s grace period on startup, for both prod and staging.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
authored by