Social Annotations in the Atmosphere
15
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: add proxy stability plan for sure.seams.so

Documents root cause of proxy outage (pywb requests hanging indefinitely)
and proposed solution using uwsgi with harakiri timeout instead of
wayback CLI. Includes success criteria and testing checklist.

+170
+170
history/PROXY_STABILITY_PLAN.md
··· 1 + # Proxy Stability Plan: sure.seams.so 2 + 3 + ## Problem Statement 4 + 5 + On 2025-12-06, the `sure.seams.so` proxy became unresponsive. Users saw: 6 + ``` 7 + [PR03] could not find a good candidate within 1 attempts at load balancing. 8 + last error: [PU03] unreachable worker host. the host may be unhealthy. this is a Fly issue. 9 + ``` 10 + 11 + ### Root Cause Analysis 12 + 13 + Investigation revealed: 14 + 1. **pywb requests were hanging indefinitely** - Logs showed requests taking 500+ seconds with `status: 0` (connection closed without response) 15 + 2. **Accumulated stuck connections** - `/proc/net/tcp` showed many connections in CLOSE_WAIT state 16 + 3. **No request timeouts** - pywb's `wayback` CLI has no built-in timeout for upstream fetches 17 + 4. **No health checks** - Fly.io couldn't detect pywb was stuck (only Caddy was responding) 18 + 5. **No process supervision** - `wayback` ran in background with `&`, no restart on failure 19 + 20 + ### Immediate Fix 21 + A redeployment (`flyctl deploy -c fly.proxy.toml --no-cache`) cleared the stuck state. 22 + 23 + --- 24 + 25 + ## Current Architecture 26 + 27 + ``` 28 + Internet → Fly.io (TLS) → Caddy (:8082) → pywb/wayback (:8081) 29 + ``` 30 + 31 + ### Current Dockerfile.proxy 32 + - Runs `wayback -p 8081` in background 33 + - Runs Caddy in foreground on :8082 34 + - Caddy proxies `/proxy/*` and `/static/*` to pywb 35 + - Caddy serves static files for OAuth callback, .well-known, etc. 36 + 37 + ### What Caddy Currently Does 38 + 1. **Proxies to pywb**: `/proxy/*`, `/static/*` 39 + 2. **Serves static files**: OAuth callback, .well-known (TWA), CSS/JS 40 + 3. **Redirects**: Root `/` → seams.so 41 + 4. **Strips CSP headers** from proxied responses 42 + 43 + ### OAuth Flow on sure.seams.so 44 + - Redirect URI: `https://sure.seams.so/oauth-callback.html` 45 + - Callback page: `proxy/static/oauth-callback.html` (built from `entrypoints/via-client/oauth-callback.ts`) 46 + - Uses same client_id as main extension: `https://seams.so/oauth/client-metadata.json` 47 + 48 + --- 49 + 50 + ## Proposed Solution 51 + 52 + ### Key Insight from pywb Documentation 53 + pywb recommends **uwsgi with gevent** for production, not the `wayback` CLI: 54 + 55 + > "For larger scale production deployments, running with uwsgi server application is recommended... 56 + > pywb must be run with the Gevent Loop Engine." 57 + 58 + ### Changes Required 59 + 60 + #### 1. Replace `wayback` CLI with `uwsgi` 61 + **Why**: uwsgi provides worker management, request timeouts, and automatic recycling. 62 + 63 + **uwsgi.ini settings needed**: 64 + ```ini 65 + [uwsgi] 66 + http-socket = :8080 67 + master = true 68 + gevent = 100 69 + harakiri = 30 # Kill requests stuck >30 seconds 70 + harakiri-verbose = true 71 + die-on-term = true 72 + env = PYWB_CONFIG_FILE=config.yaml 73 + wsgi = pywb.apps.wayback 74 + ``` 75 + 76 + #### 2. Remove Caddy (Evaluate First) 77 + **Why**: Simplify architecture, fewer moving parts. 78 + 79 + **Caddy's current responsibilities that must be handled**: 80 + | Route | Current Handler | Replacement | 81 + |-------|-----------------|-------------| 82 + | `/proxy/*` | Caddy → pywb | uwsgi directly | 83 + | `/static/*` | Caddy → pywb | pywb `static_routes` | 84 + | `/oauth-callback.html` | Caddy file_server | pywb `static_routes` | 85 + | `/.well-known/*` | Caddy file_server | pywb `static_routes` | 86 + | `/` | Caddy redirect | pywb or uwsgi config | 87 + | CSP header stripping | Caddy `header_down` | pywb config `enable_content_security_policy: false` | 88 + 89 + **Risk**: OAuth callback must work. This is critical path for user login. 90 + 91 + #### 3. Add Fly.io Health Check 92 + **Why**: Detect and restart unhealthy machines automatically. 93 + 94 + ```toml 95 + # fly.proxy.toml 96 + [[services.http_checks]] 97 + interval = "10s" 98 + timeout = "5s" 99 + path = "/proxy/https://example.com/" 100 + method = "HEAD" 101 + ``` 102 + 103 + --- 104 + 105 + ## Success Criteria 106 + 107 + ### Functional Requirements 108 + - [ ] Proxy serves pages: `https://sure.seams.so/proxy/https://example.com/` returns 200 109 + - [ ] OAuth callback works: User can log in via sure.seams.so sidebar 110 + - [ ] Static files served: `/static/*` routes work for seams-client.js, CSS 111 + - [ ] .well-known served: TWA asset links accessible 112 + - [ ] Seams sidebar injects and functions on proxied pages 113 + 114 + ### Stability Requirements 115 + - [ ] Requests timeout after 30 seconds (not hang indefinitely) 116 + - [ ] Health check detects stuck pywb and triggers restart 117 + - [ ] No zombie process accumulation over 24+ hours 118 + - [ ] Clean shutdown on SIGTERM (Fly.io deployments) 119 + 120 + ### Testing Checklist 121 + 1. **Basic proxy**: `curl -I https://sure.seams.so/proxy/https://example.com/` 122 + 2. **OAuth flow**: Log in via sure.seams.so sidebar, verify session persists 123 + 3. **Static files**: `curl https://sure.seams.so/static/seams-client.js` 124 + 4. **TWA links**: `curl https://sure.seams.so/.well-known/assetlinks.json` 125 + 5. **Timeout behavior**: Request a slow/hanging URL, verify 30s timeout 126 + 6. **Health check**: `flyctl checks list -a sure-seams-so` 127 + 128 + --- 129 + 130 + ## Implementation Steps 131 + 132 + ### Phase 1: Prepare (No Deployment) 133 + 1. Clean up git history (commit current OAuth callback fixes) 134 + 2. Create new branch for proxy stability work 135 + 3. Update pywb config.yaml to serve all static routes 136 + 4. Write uwsgi.ini with harakiri timeout 137 + 5. Update Dockerfile.proxy to use uwsgi 138 + 139 + ### Phase 2: Test Locally 140 + 1. Build and run container locally 141 + 2. Test all routes from checklist 142 + 3. Verify OAuth callback flow works 143 + 144 + ### Phase 3: Deploy to Staging (Optional) 145 + 1. Deploy to separate Fly.io app for testing 146 + 2. Full test checklist 147 + 148 + ### Phase 4: Production Deploy 149 + 1. Deploy to sure-seams-so 150 + 2. Monitor logs for 24 hours 151 + 3. Verify no stuck processes accumulate 152 + 153 + --- 154 + 155 + ## Rollback Plan 156 + 157 + If issues arise after deployment: 158 + ```bash 159 + # Revert to previous Docker image 160 + flyctl releases -a sure-seams-so 161 + flyctl deploy -a sure-seams-so --image registry.fly.io/sure-seams-so:deployment-<PREVIOUS_ID> 162 + ``` 163 + 164 + --- 165 + 166 + ## References 167 + - [pywb Deployment Guide](https://pywb.readthedocs.io/en/latest/manual/usage.html#deployment) 168 + - [pywb Docker Image](https://github.com/webrecorder/pywb/blob/main/Dockerfile) 169 + - [uwsgi.ini reference](https://github.com/webrecorder/pywb/blob/main/uwsgi.ini) 170 + - [Fly.io Health Checks](https://fly.io/docs/reference/configuration/#services-http_checks)