Cooperative email for PDS operators
7
fork

Configure Feed

Select the types of activity you want to include in your feed.

Atmosphere Mail Operator Runbook#

Runbook for operators of an Atmosphere Mail relay. Covers routine maintenance, common incidents, and the "known weirdness" that has bit us during development so future sessions don't waste time re-discovering it.

Scope: the relay service itself, the Osprey worker, the labeler, the relay's supporting DNS, and the three provider FBL integrations. Does not cover atproto PDS operation (separate stack).


Topology one-liner#

member SMTP client (swaks, app)
        │  587/STARTTLS + SASL PLAIN
        ▼
atmos-relay (Hetzner VPS, 87.99.138.77, Tailscale)
  │  signs mail with d=<member-domain> (alignment) + d=atmos.email (pool FBL)
  │  stamps X-Atmos-Member-Did + Feedback-ID
  │  emits relay_attempt → Kafka
  ▼
fan-out to recipient MX
  │
  └─► Kafka (big-nix) ──► Osprey worker (SML rules) ──► Kafka (execution_results)
                                                          │
                                                          ▼
                                         atmos-relay consumer goroutine
                                                          │
                                                          ▼
                                            relay.sqlite.relay_events
                                                          │
                                                          ▼
                                         /admin/events, /admin/members/{did}

FBL complaints (ARF) and bounces (DSN) arrive on smtp.atmos.email MX, classified, written to relay.sqlite.inbound_messages, attributed to the originating member via X-Atmos-Member-Did, and fired as complaint_received / bounce_received Osprey events.


Routine tasks#

1. Rotate operator DKIM key (atmos.email)#

The operator keypair signs every outbound message with d=atmos.email. A rotation is warranted on suspected private-key compromise, the ~annual cadence recommended by M3AAWG, or if you simply want to force receivers to re-evaluate the key.

# 1. On atmos-relay, rotate the key file. The relay regenerates on next
#    startup if the file is missing.
ssh root@atmos-relay rm /var/lib/atmos-relay/operator-dkim-keys.json
ssh root@atmos-relay systemctl restart atmos-relay

# 2. Grep the journal for the new public halves.
ssh root@atmos-relay "journalctl -u atmos-relay --since '2 min ago' | grep operator_dkim.first_boot"

# 3. Publish the new records at atmos<YYYYMMDD>{r,e}._domainkey.atmos.email
#    in Bunny (edit dns-infrastructure/domains.tf, open PR).

# 4. Leave the OLD records in DNS for ~48 h. Any mail signed with the
#    prior key still in flight needs its public half to verify. After
#    48 h, drop the old records in a follow-up PR.

2. Rotate member DKIM key#

Triggered by member re-enrollment or explicit rotation request.

# Member re-enrolls via /enroll wizard. The relay generates a fresh
# RSA + Ed25519 pair, returns the public halves on the success page,
# hashes the API key, and stores the private halves in relay.sqlite
# member_domains.dkim_{rsa,ed}_privkey columns.
#
# Operator publishes the two TXT records at
#   atmos<YYYYMMDD>r._domainkey.<member-domain>
#   atmos<YYYYMMDD>e._domainkey.<member-domain>
#
# Labeler (on little-nix) verifies DKIM-in-DNS before issuing
# verified-mail-operator. It retries on failure, so publish → ≤ 60 s
# → labeled.

3. Approve a pending member#

Pending members exist but cannot send (rejection: 535 5.7.8 pending operator approval). Approval is manual — shared-IP reputation requires operator scrutiny of every new sender.

# Via admin API (Tailscale-only)
ADMIN=<admin-token-from-sops>
curl -X POST -H "Authorization: Bearer $ADMIN" \
  http://atmos-relay.internal.example:8080/admin/member/<did>/approve

# Or via the /admin/members dashboard, click Approve on the pending row.

Before approving: check /admin/members/<did> (the rich detail page). Green on DKIM, attestation, labels, reasonable send-activity history → approve. Anything suspicious (scraped-looking domain, no attestation record, broken DKIM) → hold and investigate.

4. Emergency suspend a member#

curl -X POST -H "Authorization: Bearer $ADMIN" \
  "http://atmos-relay.internal.example:8080/admin/member/<did>/suspend?reason=<url-encoded>"

SMTP submission from that DID starts rejecting immediately with 535 5.7.8 account suspended. Osprey event relay_rejected with reject_reason=suspended fires for each attempt, so the Osprey UI surfaces suspension activity.

5. Tune warming tier caps#

Defaults live in internal/relay/warming.go DefaultWarmingConfig:

Tier Age Hourly Daily
warming 0..7 d 5 20
ramping 7..14 d 20 100
warmed 14+ d member's configured HourlyLimit/DailyLimit "

Tier caps compose with per-member limits via min() — a tighter per-member configured cap always wins; tier caps never raise a member's limit.

Two labels override the tier:

  • highly_trusted (issued by Osprey's reputation rules) skips warming entirely.
  • burst_warming halves the effective hourly cap during the current window even for warmed senders.

To change the defaults globally, edit DefaultWarmingConfig() and redeploy. Per-member exceptions: stay with the existing HourlyLimit/DailyLimit columns; those compose.

6. Route operator-classified mail to an external inbox#

Mail addressed to operator aliases on atmos.email (postmaster@, abuse@, fbl@, support@, hostmaster@) is normally accepted and dropped at the relay — the inbound classifier logs it and moves on. When operatorForwardTo is set in the relay config, those messages are instead re-sent through the same reply-forwarder used for per-member reply routing, landing in whatever inbox you configure.

This is the escape hatch for provider authorization flows: Microsoft SNDS enrollment, Google Postmaster domain ownership emails, abuse contact probes from CERTs, and anything else that expects a human to read mail sent to postmaster.

# Current forward destination (NixOS-managed)
ssh root@atmos-relay "cat /var/lib/atmos-relay/config.json | jq -r .operatorForwardTo"

# Change it: edit infra/nixos/default.nix, commit, let CI deploy.
# The field is config.json `operatorForwardTo`, set from the NixOS
# module. Restart is not zero-downtime for in-flight SMTP but the
# queue drains on boot.

Wiring lives in internal/relay/inbound.go (processOperator handler) and cmd/relay/main.go (SetOperatorForwarding call). When the field is empty, the old accept-and-log behaviour is preserved. Verified by TestInbound_OperatorForward_DeliversPostmasterMail and TestInbound_OperatorForward_DisabledPreservesAcceptAndDrop.

Testing the forward end-to-end:

# Send a test to any operator alias.
swaks --server smtp.atmos.email --ehlo smtp.atmos.email \
      --from ops-probe@example.com --to postmaster@atmos.email \
      --header "Subject: operator-forward smoke test" --body "ping"

# Confirm the forward ran.
ssh root@atmos-relay "journalctl -u atmos-relay --since '2 min ago' \
  | grep -E 'inbound\.operator|reply_forwarder\.sent'"

The forwarded message preserves the original From: and Subject: headers so the downstream inbox sees the provider's original sender (e.g. notifications@microsoft.com), not the relay.


FBL integrations#

Atmosphere Mail runs one pool-level FBL registration for each of the three providers that offer one, anchored to the operator-DKIM domain (atmos.email) and the Feedback-ID mailer field (atmosphere-mail). Complaints flow to fbl@atmospheremail.com (Porkbun forwarder → Proton alias atmosphere.support@scottlanoue.com).

Every outbound message carries:

  • DKIM d=member-domain (DMARC alignment with From:)
  • DKIM d=atmos.email (pool FBL routing)
  • X-Atmos-Member-Did: (attribution)
  • Feedback-ID: transactional:::atmosphere-mail (Gmail)

Current registration status#

Provider Status Evidence / next step
Gmail Postmaster Tools Verified TXT token published for atmos.email; dashboard live at postmaster.google.com. Reputation score needs ~48 h of sending volume to populate.
Microsoft SNDS IP registered, authorization email landed via operator-forwarder The enrollment flow required receiving a verification mail at postmaster@atmos.email — handled by the operator-forwarder routing described in section 6.
Microsoft JMRP Registered FBL recipient fbl@atmospheremail.com accepted. First complaint probe will confirm the delivery path.
Yahoo CFL Verified 2026-04-20 Domain verified via TXT (yahoo-verification-key=…) at the atmos.email apex. Verification record is a no-op now; tracked for removal in chainlink #144. Complaints will arrive at fbl@atmospheremail.com once Yahoo begins sending.

Adding a new provider later: publish the FBL recipient as fbl@atmospheremail.com if they accept an external address, otherwise a per-provider mailbox. Any new verification mail to an operator alias is handled by the forwarder from section 6 — no code change required.

Gmail Postmaster Tools#

  • URL: https://postmaster.google.com
  • Domain: atmos.email
  • Verification: TXT record at atmos.email apex (Google provides token)
  • Once verified → domain reputation + spam rate metrics visible in the dashboard. FBL signals fold into the Feedback-ID attribution.

Microsoft JMRP (Junk Mail Reporting Program)#

Yahoo CFL (Complaint Feedback Loop)#

Verifying complaints flow end to end#

# 1. Confirm inbound MX accepts mail for atmos.email (Porkbun fwds
#    the atmospheremail.com aliases; atmos.email MX is smtp.atmos.email).
ssh root@atmos-relay "nix-shell -p swaks --run 'swaks --server smtp.atmos.email --to postmaster@atmos.email --helo smtp.atmos.email --from test@atmos.email --body hi 2>&1 | tail -5'"

# 2. Watch inbound_messages for the ARF row.
ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT id,classification,member_did,subject FROM inbound_messages ORDER BY id DESC LIMIT 5\"'"

Known weirdnesses (discovered the hard way)#

Druid is gone, don't bring it back#

The original Osprey stack included Druid for analytics. At our event volume (<1000/day), Druid's publishing pipeline added 2–18 minutes of query latency and 6 containers of operational surface with no upside. Retired in homelab-nix PR #677. The relay now consumes osprey.execution_results from Kafka directly into its own SQLite and serves /admin/events from there. Don't resurrect Druid.

Osprey worker Postgres sink is flaky#

OSPREY_EXECUTION_RESULT_STORAGE_BACKEND=postgres does not reliably write every event to Postgres. Events do reach Kafka reliably. Don't trust stored_execution_result on little-nix's Postgres as a source of truth for anything user-visible; the relay's relay_events table is authoritative.

Porkbun email forwarding needs FQDN EHLO#

A send to Porkbun's MX with bare-hostname EHLO (e.g. EHLO atmos-relay) gets 504 5.5.2 need fully-qualified hostname and silently bounces. swaks defaults to hostname(1), which is atmos-relay on the VPS. Pass --ehlo smtp.atmos.email for any script that sends to Porkbun forwards.

Tangled stale branch cache#

The Tangled mirror's frontend caches branch lists outside the knot's git state. Deleting a branch at the git level (ls-remote confirms) does not cause Tangled's UI to forget it. Escalate to Tangled operators for a manual reindex if the dropdown bothers you.

Gitea run: | blocks use dash#

Bash-only array syntax (ALIASES=(a b c)) fails inside a workflow's run: step because the runner shells out to sh, which is dash on Debian slim. Use POSIX patterns (for x in a b c).

Auth LOGIN vs PLAIN#

The relay advertises only AUTH PLAIN after STARTTLS. swaks defaults to LOGIN. Members pasting the quickstart verbatim get Auth not attempted, requested type not available. The Step 5 swaks block on the enroll success page now hard-codes --auth PLAIN.

Em-dashes in Gitea workflow comments#

One workflow we wrote broke with import_bunny_record: command not found when a comment block between function use-sites contained an em-dash (U+2014). Generator's tokenizer choked on it. Stick to ASCII hyphens inside run: | blocks.

Templ + in text nodes#

A templ file containing <strong>Week 3+:</strong> fails to parse — templ treats the + as an expression terminator. Rephrase ("Week 3 onward") or wrap in raw string delimiters.


Incident response#

"Mail is not being delivered"#

# Order of investigation:
# 1. Is the relay running?
ssh root@atmos-relay "systemctl status atmos-relay --no-pager | head -5"

# 2. Is SMTP reachable from the member's vantage?
swaks --server smtp.atmos.email --port 587 --tls --quit-after FIRST-HELO

# 3. Did the member's AUTH succeed? (check relay journal)
ssh root@atmos-relay "journalctl -u atmos-relay --since '15 min ago' | grep smtp.auth"

# 4. Did the message reach Kafka?
ssh root@big-nix "docker exec osprey-kafka-XXX kafka-console-consumer --bootstrap-server localhost:9092 --topic osprey.actions_input --from-beginning --timeout-ms 5000 | grep <did>"

# 5. Is the downstream MX reachable?
ssh root@atmos-relay "timeout 10 nc -zv gmail-smtp-in.l.google.com 25"

# 6. Check delivery.result log line for the specific message_id.

"Everything is in spam"#

Expected on day 1 for a new member. See the Step 6 copy on the enroll success page. Real mitigation:

  1. Member engages recipients (open, not-spam, reply)
  2. Confirm DKIM + SPF + DMARC all pass in a received message's headers
  3. Check Google Postmaster Tools domain reputation; target is Medium or better by week 2
  4. Check Microsoft SNDS for the relay IP's listing

"Member is suspended but shouldn't be"#

# Reactivate
curl -X POST -H "Authorization: Bearer $ADMIN" \
  http://atmos-relay.internal.example:8080/admin/member/<did>/reactivate

# Investigate what rule suspended them
ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT verdicts, raw FROM relay_events WHERE sender_did = \\\"<did>\\\" AND action_name = \\\"member_suspended\\\" ORDER BY id DESC LIMIT 1\"'"

"DKIM verification fails at the receiver"#

# Confirm DNS has the record the relay signed with.
MEMBER_DOMAIN=scottlanoue.com
SELECTOR=atmos20260418
for v in r e; do
  echo "=== ${SELECTOR}${v}._domainkey.${MEMBER_DOMAIN} ==="
  dig +short TXT "${SELECTOR}${v}._domainkey.${MEMBER_DOMAIN}" @8.8.8.8
done

# Confirm the relay is using that selector.
ssh root@atmos-relay "nix-shell -p sqlite --run 'sqlite3 /var/lib/atmos-relay/relay.sqlite \"SELECT domain, dkim_selector FROM member_domains WHERE domain = \\\"${MEMBER_DOMAIN}\\\"\"'"

# If selector in DB does not match DNS, either publish the correct
# selector or trigger a member re-enrollment to sync them.

Deploy + rollback#

The relay auto-deploys on merge to main via Gitea Actions (see .gitea/workflows/relay-deploy.yml). Rollback path:

# Revert the offending commit and push
cd /Users/scottlanoue/repos/atmosphere-mail
git revert <sha>
git push gitea main

# Or deploy a specific prior tag
gh pr create --title "Rollback to <sha>" --body "..." && ...

Relay restart is zero-downtime for accepted-but-unflushed mail because messages are persisted in the local queue dir before handoff; the queue is drained by the new process on boot. In-flight SMTP connections die and the sending client retries.


Operator accounts#

Service Login Purpose
Hetzner scott@... Relay VPS hosting
Bunny DNS scott@... atmos.email + scottlanoue.com + tryfamilia.com zones
Porkbun scott@... atmospheremail.com + threadline.tools
Gmail Postmaster atmospheremail.ops@gmail.com FBL for atmos.email
Microsoft JMRP atmospheremail.ops@outlook.com FBL for atmos.email
Yahoo CFL form-based (no account) FBL for atmos.email
Proton atmosphere.support@scottlanoue.com All operational inbound mail

Password-manager all of these. Grant co-operators via password manager share, not by resetting passwords.


What this runbook doesn't cover#

  • atproto labeler operations: separate stack (little-nix), its own state (state_dir/labels.db). Labeler is mostly self-healing via jetstream retries; restart on crash and it catches up.
  • Kafka/Osprey worker maintenance: documented in homelab-nix. Single node, KRaft mode, 30-day retention. Run kafka-consumer-groups --describe to spot stuck consumers.
  • Atmosphere Office: entirely separate product. See SPEC-atmosphere-office.md.