Cooperative email for PDS operators
1# Atmosphere Mail Operator Runbook
2
3Runbook for operators of an Atmosphere Mail relay. Covers routine
4maintenance, common incidents, and the "known weirdness" that has bit us
5during development so future sessions don't waste time re-discovering it.
6
7Scope: the relay service itself, the Osprey worker, the labeler, the
8relay's supporting DNS, and the three provider FBL integrations. Does
9not cover atproto PDS operation (separate stack).
10
11---
12
13## Topology one-liner
14
15```
16member SMTP client (swaks, app)
17 │ 587/STARTTLS + SASL PLAIN
18 ▼
19atmos-relay (Hetzner VPS, 87.99.138.77, Tailscale)
20 │ signs mail with d=<member-domain> (alignment) + d=atmos.email (pool FBL)
21 │ stamps X-Atmos-Member-Did + Feedback-ID
22 │ emits relay_attempt → Kafka
23 ▼
24fan-out to recipient MX
25 │
26 └─► Kafka (big-nix) ──► Osprey worker (SML rules) ──► Kafka (execution_results)
27 │
28 ▼
29 atmos-relay consumer goroutine
30 │
31 ▼
32 relay.sqlite.relay_events
33 │
34 ▼
35 /admin/events, /admin/members/{did}
36```
37
38FBL complaints (ARF) and bounces (DSN) arrive on smtp.atmos.email MX,
39classified, written to `relay.sqlite.inbound_messages`, attributed to
40the originating member via X-Atmos-Member-Did, and fired as
41`complaint_received` / `bounce_received` Osprey events.
42
43---
44
45## Routine tasks
46
47### 1. Rotate operator DKIM key (atmos.email)
48
49The operator keypair signs every outbound message with d=atmos.email. A
50rotation is warranted on suspected private-key compromise, the ~annual
51cadence recommended by M3AAWG, or if you simply want to force receivers
52to re-evaluate the key.
53
54```bash
55# 1. On atmos-relay, rotate the key file. The relay regenerates on next
56# startup if the file is missing.
57ssh root@atmos-relay rm /var/lib/atmos-relay/operator-dkim-keys.json
58ssh root@atmos-relay systemctl restart atmos-relay
59
60# 2. Grep the journal for the new public halves.
61ssh root@atmos-relay "journalctl -u atmos-relay --since '2 min ago' | grep operator_dkim.first_boot"
62
63# 3. Publish the new records at atmos<YYYYMMDD>{r,e}._domainkey.atmos.email
64# in Bunny (edit dns-infrastructure/domains.tf, open PR).
65
66# 4. Leave the OLD records in DNS for ~48 h. Any mail signed with the
67# prior key still in flight needs its public half to verify. After
68# 48 h, drop the old records in a follow-up PR.
69```
70
71### 2. Rotate member DKIM key
72
73Triggered by member re-enrollment or explicit rotation request.
74
75```bash
76# Member re-enrolls via /enroll wizard. The relay generates a fresh
77# RSA + Ed25519 pair, returns the public halves on the success page,
78# hashes the API key, and stores the private halves in relay.sqlite
79# member_domains.dkim_{rsa,ed}_privkey columns.
80#
81# Operator publishes the two TXT records at
82# atmos<YYYYMMDD>r._domainkey.<member-domain>
83# atmos<YYYYMMDD>e._domainkey.<member-domain>
84#
85# Labeler (on little-nix) verifies DKIM-in-DNS before issuing
86# verified-mail-operator. It retries on failure, so publish → ≤ 60 s
87# → labeled.
88```
89
90### 3. Approve a pending member
91
92Pending members exist but cannot send (rejection: `535 5.7.8 pending
93operator approval`). Approval is manual — shared-IP reputation requires
94operator scrutiny of every new sender.
95
96```bash
97# Via admin API (Tailscale-only)
98ADMIN=<admin-token-from-sops>
99curl -X POST -H "Authorization: Bearer $ADMIN" \
100 http://atmos-relay.internal.example:8080/admin/member/<did>/approve
101
102# Or via the /admin/members dashboard, click Approve on the pending row.
103```
104
105Before approving: check `/admin/members/<did>` (the rich detail page).
106Green on DKIM, attestation, labels, reasonable send-activity history →
107approve. Anything suspicious (scraped-looking domain, no attestation
108record, broken DKIM) → hold and investigate.
109
110### 4. Emergency suspend a member
111
112```bash
113curl -X POST -H "Authorization: Bearer $ADMIN" \
114 "http://atmos-relay.internal.example:8080/admin/member/<did>/suspend?reason=<url-encoded>"
115```
116
117SMTP submission from that DID starts rejecting immediately with
118`535 5.7.8 account suspended`. Osprey event `relay_rejected` with
119`reject_reason=suspended` fires for each attempt, so the Osprey UI
120surfaces suspension activity.
121
122### 5. Tune warming tier caps
123
124Defaults live in `internal/relay/warming.go` `DefaultWarmingConfig`:
125
126| Tier | Age | Hourly | Daily |
127|---|---|---|---|
128| warming | 0..7 d | 5 | 20 |
129| ramping | 7..14 d | 20 | 100 |
130| warmed | 14+ d | member's configured HourlyLimit/DailyLimit | " |
131
132Tier caps compose with per-member limits via `min()` — a tighter
133per-member configured cap always wins; tier caps never raise a
134member's limit.
135
136Two labels override the tier:
137- `highly_trusted` (issued by Osprey's reputation rules) skips
138 warming entirely.
139- `burst_warming` halves the effective hourly cap during the current
140 window even for warmed senders.
141
142To change the defaults globally, edit `DefaultWarmingConfig()` and
143redeploy. Per-member exceptions: stay with the existing
144`HourlyLimit`/`DailyLimit` columns; those compose.
145
146### 6. Route operator-classified mail to an external inbox
147
148Mail addressed to operator aliases on `atmos.email` (`postmaster@`,
149`abuse@`, `fbl@`, `support@`, `hostmaster@`) is normally accepted and
150dropped at the relay — the inbound classifier logs it and moves on.
151When `operatorForwardTo` is set in the relay config, those messages
152are instead re-sent through the same reply-forwarder used for
153per-member reply routing, landing in whatever inbox you configure.
154
155This is the escape hatch for provider authorization flows: Microsoft
156SNDS enrollment, Google Postmaster domain ownership emails, abuse
157contact probes from CERTs, and anything else that expects a human to
158read mail sent to `postmaster`.
159
160```bash
161# Current forward destination (NixOS-managed)
162ssh root@atmos-relay "cat /var/lib/atmos-relay/config.json | jq -r .operatorForwardTo"
163
164# Change it: edit infra/nixos/default.nix, commit, let CI deploy.
165# The field is config.json `operatorForwardTo`, set from the NixOS
166# module. Restart is not zero-downtime for in-flight SMTP but the
167# queue drains on boot.
168```
169
170Wiring lives in `internal/relay/inbound.go` (`processOperator`
171handler) and `cmd/relay/main.go` (`SetOperatorForwarding` call). When
172the field is empty, the old accept-and-log behaviour is preserved.
173Verified by `TestInbound_OperatorForward_DeliversPostmasterMail` and
174`TestInbound_OperatorForward_DisabledPreservesAcceptAndDrop`.
175
176Testing the forward end-to-end:
177
178```bash
179# Send a test to any operator alias.
180swaks --server smtp.atmos.email --ehlo smtp.atmos.email \
181 --from ops-probe@example.com --to postmaster@atmos.email \
182 --header "Subject: operator-forward smoke test" --body "ping"
183
184# Confirm the forward ran.
185ssh root@atmos-relay "journalctl -u atmos-relay --since '2 min ago' \
186 | grep -E 'inbound\.operator|reply_forwarder\.sent'"
187```
188
189The forwarded message preserves the original From: and Subject:
190headers so the downstream inbox sees the provider's original sender
191(e.g. `notifications@microsoft.com`), not the relay.
192
193---
194
195## FBL integrations
196
197Atmosphere Mail runs one pool-level FBL registration for each of the
198three providers that offer one, anchored to the operator-DKIM domain
199(`atmos.email`) and the Feedback-ID `mailer` field
200(`atmosphere-mail`). Complaints flow to `fbl@atmospheremail.com`
201(Porkbun forwarder → Proton alias `atmosphere.support@scottlanoue.com`).
202
203Every outbound message carries:
204- DKIM d=member-domain (DMARC alignment with From:)
205- DKIM d=atmos.email (pool FBL routing)
206- X-Atmos-Member-Did: <did> (attribution)
207- Feedback-ID: transactional:<did>:<member-domain>:atmosphere-mail (Gmail)
208
209### Current registration status
210
211| Provider | Status | Evidence / next step |
212|---|---|---|
213| Gmail Postmaster Tools | Verified | TXT token published for `atmos.email`; dashboard live at postmaster.google.com. Reputation score needs ~48 h of sending volume to populate. |
214| Microsoft SNDS | IP registered, authorization email landed via operator-forwarder | The enrollment flow required receiving a verification mail at `postmaster@atmos.email` — handled by the operator-forwarder routing described in section 6. |
215| Microsoft JMRP | Registered | FBL recipient `fbl@atmospheremail.com` accepted. First complaint probe will confirm the delivery path. |
216| Yahoo CFL | Verified 2026-04-20 | Domain verified via TXT (`yahoo-verification-key=…`) at the atmos.email apex. Verification record is a no-op now; tracked for removal in chainlink #144. Complaints will arrive at `fbl@atmospheremail.com` once Yahoo begins sending. |
217
218Adding a new provider later: publish the FBL recipient as
219`fbl@atmospheremail.com` if they accept an external address, otherwise
220a per-provider mailbox. Any new verification mail to an operator
221alias is handled by the forwarder from section 6 — no code change
222required.
223
224### Gmail Postmaster Tools
225- URL: https://postmaster.google.com
226- Domain: atmos.email
227- Verification: TXT record at atmos.email apex (Google provides token)
228- Once verified → domain reputation + spam rate metrics visible in
229 the dashboard. FBL signals fold into the Feedback-ID attribution.
230
231### Microsoft JMRP (Junk Mail Reporting Program)
232- URL: https://sendersupport.olc.protection.outlook.com/pm/Postmaster
233- Register sender: atmos.email
234- FBL recipient: fbl@atmospheremail.com
235- Microsoft mails a verification token within 24–48 h.
236
237### Yahoo CFL (Complaint Feedback Loop)
238- URL: https://senders.yahooinc.com/complaint-feedback-loop/
239- Form submission: domain atmos.email, DKIM selectors
240 atmos<YYYYMMDD>{r,e}, FBL recipient fbl@atmospheremail.com, contact
241 postmaster@atmospheremail.com.
242- Manual review, 1–5 day turnaround.
243
244### Verifying complaints flow end to end
245
246```bash
247# 1. Confirm inbound MX accepts mail for atmos.email (Porkbun fwds
248# the atmospheremail.com aliases; atmos.email MX is smtp.atmos.email).
249ssh root@atmos-relay "nix-shell -p swaks --run 'swaks --server smtp.atmos.email --to postmaster@atmos.email --helo smtp.atmos.email --from test@atmos.email --body hi 2>&1 | tail -5'"
250
251# 2. Watch inbound_messages for the ARF row.
252ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT id,classification,member_did,subject FROM inbound_messages ORDER BY id DESC LIMIT 5\"'"
253```
254
255---
256
257## Known weirdnesses (discovered the hard way)
258
259### Druid is gone, don't bring it back
260
261The original Osprey stack included Druid for analytics. At our event
262volume (<1000/day), Druid's publishing pipeline added 2–18 minutes of
263query latency and 6 containers of operational surface with no upside.
264Retired in homelab-nix PR #677. The relay now consumes
265`osprey.execution_results` from Kafka directly into its own SQLite and
266serves `/admin/events` from there. Don't resurrect Druid.
267
268### Osprey worker Postgres sink is flaky
269
270`OSPREY_EXECUTION_RESULT_STORAGE_BACKEND=postgres` does not reliably
271write every event to Postgres. Events do reach Kafka reliably.
272Don't trust `stored_execution_result` on little-nix's Postgres as a
273source of truth for anything user-visible; the relay's
274`relay_events` table is authoritative.
275
276### Porkbun email forwarding needs FQDN EHLO
277
278A send to Porkbun's MX with bare-hostname EHLO (e.g.
279`EHLO atmos-relay`) gets `504 5.5.2 need fully-qualified hostname` and
280silently bounces. swaks defaults to `hostname(1)`, which is
281`atmos-relay` on the VPS. Pass `--ehlo smtp.atmos.email` for any
282script that sends to Porkbun forwards.
283
284### Tangled stale branch cache
285
286The Tangled mirror's frontend caches branch lists outside the knot's
287git state. Deleting a branch at the git level (ls-remote confirms)
288does not cause Tangled's UI to forget it. Escalate to Tangled
289operators for a manual reindex if the dropdown bothers you.
290
291### Gitea `run: |` blocks use dash
292
293Bash-only array syntax (`ALIASES=(a b c)`) fails inside a workflow's
294`run:` step because the runner shells out to `sh`, which is `dash` on
295Debian slim. Use POSIX patterns (`for x in a b c`).
296
297### Auth LOGIN vs PLAIN
298
299The relay advertises only `AUTH PLAIN` after STARTTLS. swaks defaults
300to LOGIN. Members pasting the quickstart verbatim get
301`Auth not attempted, requested type not available`. The Step 5 swaks
302block on the enroll success page now hard-codes `--auth PLAIN`.
303
304### Em-dashes in Gitea workflow comments
305
306One workflow we wrote broke with `import_bunny_record: command not
307found` when a comment block between function use-sites contained an
308em-dash (U+2014). Generator's tokenizer choked on it. Stick to ASCII
309hyphens inside `run: |` blocks.
310
311### Templ `+` in text nodes
312
313A templ file containing `<strong>Week 3+:</strong>` fails to parse
314— templ treats the `+` as an expression terminator. Rephrase
315("Week 3 onward") or wrap in raw string delimiters.
316
317---
318
319## Incident response
320
321### "Mail is not being delivered"
322
323```bash
324# Order of investigation:
325# 1. Is the relay running?
326ssh root@atmos-relay "systemctl status atmos-relay --no-pager | head -5"
327
328# 2. Is SMTP reachable from the member's vantage?
329swaks --server smtp.atmos.email --port 587 --tls --quit-after FIRST-HELO
330
331# 3. Did the member's AUTH succeed? (check relay journal)
332ssh root@atmos-relay "journalctl -u atmos-relay --since '15 min ago' | grep smtp.auth"
333
334# 4. Did the message reach Kafka?
335ssh root@big-nix "docker exec osprey-kafka-XXX kafka-console-consumer --bootstrap-server localhost:9092 --topic osprey.actions_input --from-beginning --timeout-ms 5000 | grep <did>"
336
337# 5. Is the downstream MX reachable?
338ssh root@atmos-relay "timeout 10 nc -zv gmail-smtp-in.l.google.com 25"
339
340# 6. Check delivery.result log line for the specific message_id.
341```
342
343### "Everything is in spam"
344
345Expected on day 1 for a new member. See the Step 6 copy on the
346enroll success page. Real mitigation:
3471. Member engages recipients (open, not-spam, reply)
3482. Confirm DKIM + SPF + DMARC all pass in a received message's headers
3493. Check Google Postmaster Tools domain reputation; target is
350 `Medium` or better by week 2
3514. Check Microsoft SNDS for the relay IP's listing
352
353### "Member is suspended but shouldn't be"
354
355```bash
356# Reactivate
357curl -X POST -H "Authorization: Bearer $ADMIN" \
358 http://atmos-relay.internal.example:8080/admin/member/<did>/reactivate
359
360# Investigate what rule suspended them
361ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT verdicts, raw FROM relay_events WHERE sender_did = \\\"<did>\\\" AND action_name = \\\"member_suspended\\\" ORDER BY id DESC LIMIT 1\"'"
362```
363
364### "DKIM verification fails at the receiver"
365
366```bash
367# Confirm DNS has the record the relay signed with.
368MEMBER_DOMAIN=scottlanoue.com
369SELECTOR=atmos20260418
370for v in r e; do
371 echo "=== ${SELECTOR}${v}._domainkey.${MEMBER_DOMAIN} ==="
372 dig +short TXT "${SELECTOR}${v}._domainkey.${MEMBER_DOMAIN}" @8.8.8.8
373done
374
375# Confirm the relay is using that selector.
376ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT domain, dkim_selector FROM member_domains WHERE domain = \\\"${MEMBER_DOMAIN}\\\"\"'"
377
378# If selector in DB does not match DNS, either publish the correct
379# selector or trigger a member re-enrollment to sync them.
380```
381
382---
383
384## Deploy + rollback
385
386The relay auto-deploys on merge to `main` via Gitea Actions (see
387`.gitea/workflows/relay-deploy.yml`). Rollback path:
388
389```bash
390# Revert the offending commit and push
391cd /Users/scottlanoue/repos/atmosphere-mail
392git revert <sha>
393git push gitea main
394
395# Or deploy a specific prior tag
396gh pr create --title "Rollback to <sha>" --body "..." && ...
397```
398
399Relay restart is zero-downtime for accepted-but-unflushed mail because
400messages are persisted in the local queue dir before handoff; the
401queue is drained by the new process on boot. In-flight SMTP
402connections die and the sending client retries.
403
404---
405
406## Operator accounts
407
408| Service | Login | Purpose |
409|---|---|---|
410| Hetzner | scott@... | Relay VPS hosting |
411| Bunny DNS | scott@... | atmos.email + scottlanoue.com + tryfamilia.com zones |
412| Porkbun | scott@... | atmospheremail.com + threadline.tools |
413| Gmail Postmaster | atmospheremail.ops@gmail.com | FBL for atmos.email |
414| Microsoft JMRP | atmospheremail.ops@outlook.com | FBL for atmos.email |
415| Yahoo CFL | form-based (no account) | FBL for atmos.email |
416| Proton | atmosphere.support@scottlanoue.com | All operational inbound mail |
417
418Password-manager all of these. Grant co-operators via password manager
419share, not by resetting passwords.
420
421---
422
423## What this runbook doesn't cover
424
425- **atproto labeler operations**: separate stack (little-nix), its own
426 state (`state_dir/labels.db`). Labeler is mostly self-healing via
427 jetstream retries; restart on crash and it catches up.
428- **Kafka/Osprey worker maintenance**: documented in homelab-nix.
429 Single node, KRaft mode, 30-day retention. Run
430 `kafka-consumer-groups --describe` to spot stuck consumers.
431- **Atmosphere Office**: entirely separate product. See
432 `SPEC-atmosphere-office.md`.