Cooperative email for PDS operators
7
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 432 lines 17 kB view raw view rendered
1# Atmosphere Mail Operator Runbook 2 3Runbook for operators of an Atmosphere Mail relay. Covers routine 4maintenance, common incidents, and the "known weirdness" that has bit us 5during development so future sessions don't waste time re-discovering it. 6 7Scope: the relay service itself, the Osprey worker, the labeler, the 8relay's supporting DNS, and the three provider FBL integrations. Does 9not cover atproto PDS operation (separate stack). 10 11--- 12 13## Topology one-liner 14 15``` 16member SMTP client (swaks, app) 17 │ 587/STARTTLS + SASL PLAIN 18 19atmos-relay (Hetzner VPS, 87.99.138.77, Tailscale) 20 │ signs mail with d=<member-domain> (alignment) + d=atmos.email (pool FBL) 21 │ stamps X-Atmos-Member-Did + Feedback-ID 22 │ emits relay_attempt → Kafka 23 24fan-out to recipient MX 25 26 └─► Kafka (big-nix) ──► Osprey worker (SML rules) ──► Kafka (execution_results) 27 28 29 atmos-relay consumer goroutine 30 31 32 relay.sqlite.relay_events 33 34 35 /admin/events, /admin/members/{did} 36``` 37 38FBL complaints (ARF) and bounces (DSN) arrive on smtp.atmos.email MX, 39classified, written to `relay.sqlite.inbound_messages`, attributed to 40the originating member via X-Atmos-Member-Did, and fired as 41`complaint_received` / `bounce_received` Osprey events. 42 43--- 44 45## Routine tasks 46 47### 1. Rotate operator DKIM key (atmos.email) 48 49The operator keypair signs every outbound message with d=atmos.email. A 50rotation is warranted on suspected private-key compromise, the ~annual 51cadence recommended by M3AAWG, or if you simply want to force receivers 52to re-evaluate the key. 53 54```bash 55# 1. On atmos-relay, rotate the key file. The relay regenerates on next 56# startup if the file is missing. 57ssh root@atmos-relay rm /var/lib/atmos-relay/operator-dkim-keys.json 58ssh root@atmos-relay systemctl restart atmos-relay 59 60# 2. Grep the journal for the new public halves. 61ssh root@atmos-relay "journalctl -u atmos-relay --since '2 min ago' | grep operator_dkim.first_boot" 62 63# 3. Publish the new records at atmos<YYYYMMDD>{r,e}._domainkey.atmos.email 64# in Bunny (edit dns-infrastructure/domains.tf, open PR). 65 66# 4. Leave the OLD records in DNS for ~48 h. Any mail signed with the 67# prior key still in flight needs its public half to verify. After 68# 48 h, drop the old records in a follow-up PR. 69``` 70 71### 2. Rotate member DKIM key 72 73Triggered by member re-enrollment or explicit rotation request. 74 75```bash 76# Member re-enrolls via /enroll wizard. The relay generates a fresh 77# RSA + Ed25519 pair, returns the public halves on the success page, 78# hashes the API key, and stores the private halves in relay.sqlite 79# member_domains.dkim_{rsa,ed}_privkey columns. 80# 81# Operator publishes the two TXT records at 82# atmos<YYYYMMDD>r._domainkey.<member-domain> 83# atmos<YYYYMMDD>e._domainkey.<member-domain> 84# 85# Labeler (on little-nix) verifies DKIM-in-DNS before issuing 86# verified-mail-operator. It retries on failure, so publish → ≤ 60 s 87# → labeled. 88``` 89 90### 3. Approve a pending member 91 92Pending members exist but cannot send (rejection: `535 5.7.8 pending 93operator approval`). Approval is manual — shared-IP reputation requires 94operator scrutiny of every new sender. 95 96```bash 97# Via admin API (Tailscale-only) 98ADMIN=<admin-token-from-sops> 99curl -X POST -H "Authorization: Bearer $ADMIN" \ 100 http://atmos-relay.internal.example:8080/admin/member/<did>/approve 101 102# Or via the /admin/members dashboard, click Approve on the pending row. 103``` 104 105Before approving: check `/admin/members/<did>` (the rich detail page). 106Green on DKIM, attestation, labels, reasonable send-activity history → 107approve. Anything suspicious (scraped-looking domain, no attestation 108record, broken DKIM) → hold and investigate. 109 110### 4. Emergency suspend a member 111 112```bash 113curl -X POST -H "Authorization: Bearer $ADMIN" \ 114 "http://atmos-relay.internal.example:8080/admin/member/<did>/suspend?reason=<url-encoded>" 115``` 116 117SMTP submission from that DID starts rejecting immediately with 118`535 5.7.8 account suspended`. Osprey event `relay_rejected` with 119`reject_reason=suspended` fires for each attempt, so the Osprey UI 120surfaces suspension activity. 121 122### 5. Tune warming tier caps 123 124Defaults live in `internal/relay/warming.go` `DefaultWarmingConfig`: 125 126| Tier | Age | Hourly | Daily | 127|---|---|---|---| 128| warming | 0..7 d | 5 | 20 | 129| ramping | 7..14 d | 20 | 100 | 130| warmed | 14+ d | member's configured HourlyLimit/DailyLimit | " | 131 132Tier caps compose with per-member limits via `min()` — a tighter 133per-member configured cap always wins; tier caps never raise a 134member's limit. 135 136Two labels override the tier: 137- `highly_trusted` (issued by Osprey's reputation rules) skips 138 warming entirely. 139- `burst_warming` halves the effective hourly cap during the current 140 window even for warmed senders. 141 142To change the defaults globally, edit `DefaultWarmingConfig()` and 143redeploy. Per-member exceptions: stay with the existing 144`HourlyLimit`/`DailyLimit` columns; those compose. 145 146### 6. Route operator-classified mail to an external inbox 147 148Mail addressed to operator aliases on `atmos.email` (`postmaster@`, 149`abuse@`, `fbl@`, `support@`, `hostmaster@`) is normally accepted and 150dropped at the relay — the inbound classifier logs it and moves on. 151When `operatorForwardTo` is set in the relay config, those messages 152are instead re-sent through the same reply-forwarder used for 153per-member reply routing, landing in whatever inbox you configure. 154 155This is the escape hatch for provider authorization flows: Microsoft 156SNDS enrollment, Google Postmaster domain ownership emails, abuse 157contact probes from CERTs, and anything else that expects a human to 158read mail sent to `postmaster`. 159 160```bash 161# Current forward destination (NixOS-managed) 162ssh root@atmos-relay "cat /var/lib/atmos-relay/config.json | jq -r .operatorForwardTo" 163 164# Change it: edit infra/nixos/default.nix, commit, let CI deploy. 165# The field is config.json `operatorForwardTo`, set from the NixOS 166# module. Restart is not zero-downtime for in-flight SMTP but the 167# queue drains on boot. 168``` 169 170Wiring lives in `internal/relay/inbound.go` (`processOperator` 171handler) and `cmd/relay/main.go` (`SetOperatorForwarding` call). When 172the field is empty, the old accept-and-log behaviour is preserved. 173Verified by `TestInbound_OperatorForward_DeliversPostmasterMail` and 174`TestInbound_OperatorForward_DisabledPreservesAcceptAndDrop`. 175 176Testing the forward end-to-end: 177 178```bash 179# Send a test to any operator alias. 180swaks --server smtp.atmos.email --ehlo smtp.atmos.email \ 181 --from ops-probe@example.com --to postmaster@atmos.email \ 182 --header "Subject: operator-forward smoke test" --body "ping" 183 184# Confirm the forward ran. 185ssh root@atmos-relay "journalctl -u atmos-relay --since '2 min ago' \ 186 | grep -E 'inbound\.operator|reply_forwarder\.sent'" 187``` 188 189The forwarded message preserves the original From: and Subject: 190headers so the downstream inbox sees the provider's original sender 191(e.g. `notifications@microsoft.com`), not the relay. 192 193--- 194 195## FBL integrations 196 197Atmosphere Mail runs one pool-level FBL registration for each of the 198three providers that offer one, anchored to the operator-DKIM domain 199(`atmos.email`) and the Feedback-ID `mailer` field 200(`atmosphere-mail`). Complaints flow to `fbl@atmospheremail.com` 201(Porkbun forwarder → Proton alias `atmosphere.support@scottlanoue.com`). 202 203Every outbound message carries: 204- DKIM d=member-domain (DMARC alignment with From:) 205- DKIM d=atmos.email (pool FBL routing) 206- X-Atmos-Member-Did: <did> (attribution) 207- Feedback-ID: transactional:<did>:<member-domain>:atmosphere-mail (Gmail) 208 209### Current registration status 210 211| Provider | Status | Evidence / next step | 212|---|---|---| 213| Gmail Postmaster Tools | Verified | TXT token published for `atmos.email`; dashboard live at postmaster.google.com. Reputation score needs ~48 h of sending volume to populate. | 214| Microsoft SNDS | IP registered, authorization email landed via operator-forwarder | The enrollment flow required receiving a verification mail at `postmaster@atmos.email` — handled by the operator-forwarder routing described in section 6. | 215| Microsoft JMRP | Registered | FBL recipient `fbl@atmospheremail.com` accepted. First complaint probe will confirm the delivery path. | 216| Yahoo CFL | Verified 2026-04-20 | Domain verified via TXT (`yahoo-verification-key=…`) at the atmos.email apex. Verification record is a no-op now; tracked for removal in chainlink #144. Complaints will arrive at `fbl@atmospheremail.com` once Yahoo begins sending. | 217 218Adding a new provider later: publish the FBL recipient as 219`fbl@atmospheremail.com` if they accept an external address, otherwise 220a per-provider mailbox. Any new verification mail to an operator 221alias is handled by the forwarder from section 6 — no code change 222required. 223 224### Gmail Postmaster Tools 225- URL: https://postmaster.google.com 226- Domain: atmos.email 227- Verification: TXT record at atmos.email apex (Google provides token) 228- Once verified → domain reputation + spam rate metrics visible in 229 the dashboard. FBL signals fold into the Feedback-ID attribution. 230 231### Microsoft JMRP (Junk Mail Reporting Program) 232- URL: https://sendersupport.olc.protection.outlook.com/pm/Postmaster 233- Register sender: atmos.email 234- FBL recipient: fbl@atmospheremail.com 235- Microsoft mails a verification token within 24–48 h. 236 237### Yahoo CFL (Complaint Feedback Loop) 238- URL: https://senders.yahooinc.com/complaint-feedback-loop/ 239- Form submission: domain atmos.email, DKIM selectors 240 atmos<YYYYMMDD>{r,e}, FBL recipient fbl@atmospheremail.com, contact 241 postmaster@atmospheremail.com. 242- Manual review, 1–5 day turnaround. 243 244### Verifying complaints flow end to end 245 246```bash 247# 1. Confirm inbound MX accepts mail for atmos.email (Porkbun fwds 248# the atmospheremail.com aliases; atmos.email MX is smtp.atmos.email). 249ssh root@atmos-relay "nix-shell -p swaks --run 'swaks --server smtp.atmos.email --to postmaster@atmos.email --helo smtp.atmos.email --from test@atmos.email --body hi 2>&1 | tail -5'" 250 251# 2. Watch inbound_messages for the ARF row. 252ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT id,classification,member_did,subject FROM inbound_messages ORDER BY id DESC LIMIT 5\"'" 253``` 254 255--- 256 257## Known weirdnesses (discovered the hard way) 258 259### Druid is gone, don't bring it back 260 261The original Osprey stack included Druid for analytics. At our event 262volume (<1000/day), Druid's publishing pipeline added 2–18 minutes of 263query latency and 6 containers of operational surface with no upside. 264Retired in homelab-nix PR #677. The relay now consumes 265`osprey.execution_results` from Kafka directly into its own SQLite and 266serves `/admin/events` from there. Don't resurrect Druid. 267 268### Osprey worker Postgres sink is flaky 269 270`OSPREY_EXECUTION_RESULT_STORAGE_BACKEND=postgres` does not reliably 271write every event to Postgres. Events do reach Kafka reliably. 272Don't trust `stored_execution_result` on little-nix's Postgres as a 273source of truth for anything user-visible; the relay's 274`relay_events` table is authoritative. 275 276### Porkbun email forwarding needs FQDN EHLO 277 278A send to Porkbun's MX with bare-hostname EHLO (e.g. 279`EHLO atmos-relay`) gets `504 5.5.2 need fully-qualified hostname` and 280silently bounces. swaks defaults to `hostname(1)`, which is 281`atmos-relay` on the VPS. Pass `--ehlo smtp.atmos.email` for any 282script that sends to Porkbun forwards. 283 284### Tangled stale branch cache 285 286The Tangled mirror's frontend caches branch lists outside the knot's 287git state. Deleting a branch at the git level (ls-remote confirms) 288does not cause Tangled's UI to forget it. Escalate to Tangled 289operators for a manual reindex if the dropdown bothers you. 290 291### Gitea `run: |` blocks use dash 292 293Bash-only array syntax (`ALIASES=(a b c)`) fails inside a workflow's 294`run:` step because the runner shells out to `sh`, which is `dash` on 295Debian slim. Use POSIX patterns (`for x in a b c`). 296 297### Auth LOGIN vs PLAIN 298 299The relay advertises only `AUTH PLAIN` after STARTTLS. swaks defaults 300to LOGIN. Members pasting the quickstart verbatim get 301`Auth not attempted, requested type not available`. The Step 5 swaks 302block on the enroll success page now hard-codes `--auth PLAIN`. 303 304### Em-dashes in Gitea workflow comments 305 306One workflow we wrote broke with `import_bunny_record: command not 307found` when a comment block between function use-sites contained an 308em-dash (U+2014). Generator's tokenizer choked on it. Stick to ASCII 309hyphens inside `run: |` blocks. 310 311### Templ `+` in text nodes 312 313A templ file containing `<strong>Week 3+:</strong>` fails to parse 314— templ treats the `+` as an expression terminator. Rephrase 315("Week 3 onward") or wrap in raw string delimiters. 316 317--- 318 319## Incident response 320 321### "Mail is not being delivered" 322 323```bash 324# Order of investigation: 325# 1. Is the relay running? 326ssh root@atmos-relay "systemctl status atmos-relay --no-pager | head -5" 327 328# 2. Is SMTP reachable from the member's vantage? 329swaks --server smtp.atmos.email --port 587 --tls --quit-after FIRST-HELO 330 331# 3. Did the member's AUTH succeed? (check relay journal) 332ssh root@atmos-relay "journalctl -u atmos-relay --since '15 min ago' | grep smtp.auth" 333 334# 4. Did the message reach Kafka? 335ssh root@big-nix "docker exec osprey-kafka-XXX kafka-console-consumer --bootstrap-server localhost:9092 --topic osprey.actions_input --from-beginning --timeout-ms 5000 | grep <did>" 336 337# 5. Is the downstream MX reachable? 338ssh root@atmos-relay "timeout 10 nc -zv gmail-smtp-in.l.google.com 25" 339 340# 6. Check delivery.result log line for the specific message_id. 341``` 342 343### "Everything is in spam" 344 345Expected on day 1 for a new member. See the Step 6 copy on the 346enroll success page. Real mitigation: 3471. Member engages recipients (open, not-spam, reply) 3482. Confirm DKIM + SPF + DMARC all pass in a received message's headers 3493. Check Google Postmaster Tools domain reputation; target is 350 `Medium` or better by week 2 3514. Check Microsoft SNDS for the relay IP's listing 352 353### "Member is suspended but shouldn't be" 354 355```bash 356# Reactivate 357curl -X POST -H "Authorization: Bearer $ADMIN" \ 358 http://atmos-relay.internal.example:8080/admin/member/<did>/reactivate 359 360# Investigate what rule suspended them 361ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT verdicts, raw FROM relay_events WHERE sender_did = \\\"<did>\\\" AND action_name = \\\"member_suspended\\\" ORDER BY id DESC LIMIT 1\"'" 362``` 363 364### "DKIM verification fails at the receiver" 365 366```bash 367# Confirm DNS has the record the relay signed with. 368MEMBER_DOMAIN=scottlanoue.com 369SELECTOR=atmos20260418 370for v in r e; do 371 echo "=== ${SELECTOR}${v}._domainkey.${MEMBER_DOMAIN} ===" 372 dig +short TXT "${SELECTOR}${v}._domainkey.${MEMBER_DOMAIN}" @8.8.8.8 373done 374 375# Confirm the relay is using that selector. 376ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT domain, dkim_selector FROM member_domains WHERE domain = \\\"${MEMBER_DOMAIN}\\\"\"'" 377 378# If selector in DB does not match DNS, either publish the correct 379# selector or trigger a member re-enrollment to sync them. 380``` 381 382--- 383 384## Deploy + rollback 385 386The relay auto-deploys on merge to `main` via Gitea Actions (see 387`.gitea/workflows/relay-deploy.yml`). Rollback path: 388 389```bash 390# Revert the offending commit and push 391cd /Users/scottlanoue/repos/atmosphere-mail 392git revert <sha> 393git push gitea main 394 395# Or deploy a specific prior tag 396gh pr create --title "Rollback to <sha>" --body "..." && ... 397``` 398 399Relay restart is zero-downtime for accepted-but-unflushed mail because 400messages are persisted in the local queue dir before handoff; the 401queue is drained by the new process on boot. In-flight SMTP 402connections die and the sending client retries. 403 404--- 405 406## Operator accounts 407 408| Service | Login | Purpose | 409|---|---|---| 410| Hetzner | scott@... | Relay VPS hosting | 411| Bunny DNS | scott@... | atmos.email + scottlanoue.com + tryfamilia.com zones | 412| Porkbun | scott@... | atmospheremail.com + threadline.tools | 413| Gmail Postmaster | atmospheremail.ops@gmail.com | FBL for atmos.email | 414| Microsoft JMRP | atmospheremail.ops@outlook.com | FBL for atmos.email | 415| Yahoo CFL | form-based (no account) | FBL for atmos.email | 416| Proton | atmosphere.support@scottlanoue.com | All operational inbound mail | 417 418Password-manager all of these. Grant co-operators via password manager 419share, not by resetting passwords. 420 421--- 422 423## What this runbook doesn't cover 424 425- **atproto labeler operations**: separate stack (little-nix), its own 426 state (`state_dir/labels.db`). Labeler is mostly self-healing via 427 jetstream retries; restart on crash and it catches up. 428- **Kafka/Osprey worker maintenance**: documented in homelab-nix. 429 Single node, KRaft mode, 30-day retention. Run 430 `kafka-consumer-groups --describe` to spot stuck consumers. 431- **Atmosphere Office**: entirely separate product. See 432 `SPEC-atmosphere-office.md`.