docs: document Neon scale-to-zero configuration for production (#747)

+50 -5

2 changed files

expand all

STATUS.md

docs

backend

database

connection-pooling.md

+17 -1

STATUS.md

··· 47 47 48 48 ### January 2026 49 49 50 + #### Neon cold start fix (Jan 11) 51 + 52 + **why**: first requests after idle periods would fail with 500 errors due to Neon serverless scaling to zero after 5 minutes of inactivity. previous mitigations (larger pool, longer timeouts) helped but didn't eliminate the problem. 53 + 54 + **fix**: disabled scale-to-zero on `plyr-prd` via Neon console. this is the [recommended approach](https://neon.com/blog/6-best-practices-for-running-neon-in-production) for production workloads. 55 + 56 + **configuration**: 57 + - `plyr-prd`: scale-to-zero **disabled** (`suspend_timeout_seconds: -1`) 58 + - `plyr-stg`, `plyr-dev`: scale-to-zero enabled (cold starts acceptable) 59 + 60 + **docs**: updated [connection-pooling.md](docs/backend/database/connection-pooling.md) with production guidance and how to verify settings via Neon MCP. 61 + 62 + closes #733 63 + 64 + --- 65 + 50 66 #### multi-account experience (PRs #707, #710, #712-714, Jan 3-5) 51 67 52 68 **why**: many users have multiple ATProto identities (personal, artist, label). forcing re-authentication to switch was friction that discouraged uploads from secondary accounts. ··· 407 423 408 424 --- 409 425 410 - this is a living document. last updated 2026-01-09. 426 + this is a living document. last updated 2026-01-11.

+33 -4

docs/backend/database/connection-pooling.md

··· 76 76 77 77 ## Neon serverless considerations 78 78 79 - plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces **cold start latency** that affects connection pooling: 79 + plyr.fm uses Neon PostgreSQL, which can scale to zero after periods of inactivity. this introduces **cold start latency** that affects connection pooling. 80 80 81 81 ### the cold start problem 82 82 83 83 1. site is idle for several minutes → Neon scales down 84 - 2. first request arrives → Neon needs 5-10s to wake up 84 + 2. first request arrives → Neon needs 500ms-10s to wake up 85 85 3. if pool_size is too small, all connections hang waiting for Neon 86 86 4. new requests can't get connections → 500 errors 87 87 88 - ### how we mitigate this 88 + ### production solution: disable scale-to-zero 89 + 90 + **for production workloads, disable scale-to-zero in Neon console.** this is the recommended approach per [Neon's production best practices](https://neon.com/blog/6-best-practices-for-running-neon-in-production). 91 + 92 + **current configuration (Jan 2026):** 93 + 94 + | project | scale-to-zero | reason | 95 + |---------|---------------|--------| 96 + | plyr-prd | **disabled** | customer-facing, no cold starts | 97 + | plyr-stg | enabled | ok to have cold starts on staging | 98 + | plyr-dev | enabled | ok to have cold starts on dev | 99 + 100 + **to verify via Neon MCP:** 101 + 102 + ``` 103 + mcp__neon__list_branch_computes({ "projectId": "cold-butterfly-11920742" }) 104 + ``` 105 + 106 + check `suspend_timeout_seconds` on the compute: 107 + - `-1` = scale-to-zero disabled (never suspend) 108 + - `0` = scale-to-zero enabled (uses default 5 min timeout) 109 + - `>0` = custom suspend timeout in seconds 110 + 111 + **to change:** Neon Console → Project → Computes → Edit → Scale to zero toggle 112 + 113 + ### fallback mitigations (if scale-to-zero is enabled) 114 + 115 + if you must keep scale-to-zero enabled, these settings help survive cold starts: 89 116 90 117 **larger connection pool (pool_size=10, max_overflow=5):** 91 118 - allows 15 concurrent requests to wait for Neon wake-up ··· 96 123 - short enough to fail fast on true database outages 97 124 98 125 **queue listener heartbeat:** 99 - - background task pings database every 5s 126 + - background task pings database every 5s via separate asyncpg connection 100 127 - detects connection death before user requests fail 101 128 - triggers reconnection with exponential backoff 129 + - note: this keeps the queue listener's connection warm, not the SQLAlchemy pool 102 130 103 131 ### incident history 104 132 105 133 - **2025-11-17**: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service. 106 134 - **2025-12-02**: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s. 135 + - **2026-01-11**: disabled scale-to-zero on plyr-prd to eliminate cold starts entirely. kept enabled on dev/staging where cold starts are acceptable. 107 136 108 137 ## production best practices 109 138

Configure Feed

Configure Feed