···4848 working-directory: osprey/tests
4949 run: |
5050 set -eu
5151- docker compose -f docker-compose.test.yml -p osprey-test up -d
5151+ docker-compose -f docker-compose.test.yml -p osprey-test up -d
5252 # Surface the initial layout prep so a rules-layout mistake is
5353 # obvious in the log rather than hiding behind a timeout.
5454- docker compose -f docker-compose.test.yml -p osprey-test logs rules-prep
5454+ docker-compose -f docker-compose.test.yml -p osprey-test logs rules-prep
55555656 - name: Wait for worker to consume
5757 run: |
···6060 # osprey's Kafka consumer init. 90s cap; past that the stack
6161 # is wedged on something structural.
6262 for i in $(seq 1 90); do
6363- if docker compose -f osprey/tests/docker-compose.test.yml -p osprey-test \
6363+ if docker-compose -f osprey/tests/docker-compose.test.yml -p osprey-test \
6464 logs osprey-worker 2>&1 | grep -q -iE "starting to consume|subscribed|consumer started"; then
6565 echo "worker is consuming"
6666 exit 0
···6868 sleep 1
6969 done
7070 echo "worker did not reach consume state within 90s — logs follow:"
7171- docker compose -f osprey/tests/docker-compose.test.yml -p osprey-test logs osprey-worker | tail -100
7171+ docker-compose -f osprey/tests/docker-compose.test.yml -p osprey-test logs osprey-worker | tail -100
7272 exit 1
73737474 - name: Install harness deps
···8484 - name: Dump worker logs on failure
8585 if: failure()
8686 run: |
8787- docker compose -f osprey/tests/docker-compose.test.yml -p osprey-test logs osprey-worker | tail -200
8787+ docker-compose -f osprey/tests/docker-compose.test.yml -p osprey-test logs osprey-worker | tail -200
88888989 - name: Tear down test stack
9090 if: always()
9191 run: |
9292- docker compose -f osprey/tests/docker-compose.test.yml -p osprey-test down -v || true
9292+ docker-compose -f osprey/tests/docker-compose.test.yml -p osprey-test down -v || true
+27
CHANGELOG.md
···6677## [Unreleased]
8899+### Added
1010+- Queue.DeliverFunc injection point + dispatch lifecycle test (#228, installment 4). Production change: new `QueueConfig.DeliverFunc` field defaulting to the existing `deliverMessage` (real MX lookup + SMTP). Any caller that doesn't set the field keeps the original behavior — `cmd/relay/main.go` doesn't set it, so production is unchanged. New integration test `TestIntegration_QueueDispatchesViaDeliverFunc` injects a fake delivery function to assert the full lifecycle: SMTP submit → onAccept → Queue.Enqueue → Queue.Run() worker dispatches → injected DeliverFunc fires → onDelivery callback receives a "sent" terminal result. Closes the unit-test gap that previously could only be filled by mocking DNS or running a fake SMTP at the MX-lookup edge
1111+- Multi-recipient + capacity pre-check tests added to the integration harness (#228, installment 3). Two new tests: (a) `TestIntegration_SMTPSubmit_MultiRecipient` drives a 3-recipient submission, asserts all three round-trip through Store + Queue, and pins the AggregateRecipientOutcomes contract (succeeded=3, failed=0, retryAll=false); (b) `TestIntegration_SMTPSubmit_CapacityPreCheckRejectsBatch` pins the #226 invariant — when `HasCapacity(len(to))` returns false, the WHOLE batch must be rejected with 451 BEFORE any Store.InsertMessage runs, preventing the duplicate-delivery scenario where M of N recipients persist then the client retries. Zero production code touched
1212+- Suppression-list test layer added to the integration harness (#228, installment 2). Two new tests in `internal/relay/integration_smoke_test.go`: (a) `TestIntegration_SMTPSubmit_SuppressionDropsRecipient` pre-inserts a suppression and submits to one suppressed + one clean recipient, asserting only the clean one round-trips through Store + Queue while the suppressed one drops silently — the exact behavior `cmd/relay/main.go` lines 648-681 implements; (b) `TestIntegration_SMTPSubmit_AllSuppressedRejects` covers the boundary where every RCPT TO has a live suppression and the SMTP submit returns 550. Zero production code touched
1313+- First installment of the cross-component SMTP integration harness (#228). New `internal/relay/integration_smoke_test.go` wires real `Store` + `RateLimiter` + `Queue` + `SMTPServer` together — the same shape `cmd/relay/main()` builds — and asserts that one SMTP submission flows all the way from AUTH → RCPT → DATA → onAccept → `Store.InsertMessage` → `Queue.Enqueue`. Acts as a tripwire for cross-component contract drift (Queue.Enqueue signature, MemberLookupFunc shape, OnAcceptFunc parameters) ahead of the larger #217 cmd/relay refactor. Zero-risk additive change — no production code touched. Subsequent PRs will layer in suppression, partial-delivery aggregation, real fake-SMTP delivery, and admin enroll-approval → SMTP-AUTH-with-new-credentials
1414+- Content spray detection promoted from shadow → live enforcement (#196). The fingerprint pipeline (sha256 over normalized subject+body, `relay_events.content_fingerprint` index, `Store.GetSameContentRecipientsSince` query, Osprey `same_content_recipients_last_hour` enrichment) was already wired; this PR removes the `shadow:` prefix from the labels and adds a `DeclareVerdict(verdict='reject')` to `ExtremeContentSpray`. Two-tier policy: `ContentSpray` (15+ same-content recipients/hr → 12h observational `content_spray` label, no verdict) and `ExtremeContentSpray` (50+ → 3-day `content_spray_extreme` label + 550 reject). Bake-in audit before promotion confirmed zero `shadow:content_spray*` firings against Osprey's `entity_labels` table across the entire shadow window. Two new test fixtures under `osprey/tests/fixtures/` cover the moderate (label-only) and extreme (label+reject) paths. Privacy: only the sha256 hash + scalar count cross the relay→Osprey boundary; recipient addresses and body content stay relay-side
1515+- Periodic PLC tombstone check (#248). New `internal/scheduler.TombstoneChecker` runs daily, polls `plc.directory` for every did:plc with active labels, and negates all of a DID's labels when PLC returns 410 Gone (the canonical tombstone signal). Closes the gap where a member retiring their atproto identity post-enrollment would leave Atmosphere Mail vouching for a non-existent account indefinitely. did:web is skipped (no PLC). 5xx and non-410 4xx responses are explicitly NOT misread as tombstones — labels stay live across PLC outages. Defaults: 24h interval, 500ms between requests (= 2 req/s, fits PLC fair-use). Configurable via `plcTombstoneCheckInterval` and `plcRequestDelay`; set the interval `<=0` to disable. Exposes `labeler_plc_status_checks_total{result=ok|tombstoned|err}` and `labeler_plc_status_last_run_unix_seconds` on `/metrics`
1616+- New `services.restic-offsite-copy` NixOS module (`infra/nixos/restic-offsite.nix`) that copies the local restic repo to an offsite destination on a daily timer. Backend-agnostic (B2, S3, SFTP-via-Tailnet, REST). Imported into both `atmos-relay` and `atmos-ops` configs but ships dormant (`enable = false`) — activation requires picking a destination and provisioning credentials per `docs/offsite-backups.md`. Closes the failure mode where a single Hetzner volume failure destroys both data and "backups" simultaneously (#221)
1717+- Hetzner-native daily snapshots enabled on both `atmos-relay` and `atmos-ops` VPS resources (terraform `backups = true`). 7-day retention, +20% server cost (~€3.20/mo for both). Survives volume failure that would destroy local restic backups (#221) since snapshots live on Hetzner's separate storage cluster. Apply via the `relay-provision` workflow with `action=apply` after merge (#231)
1818+- New `GET /admin/sender-reputation?did=&since=` admin endpoint returning per-DID rolling-window send/bounce/complaint counts plus current suspension state. Reads from `Store.SenderReputation` over relay_events + inbound_messages (FBL-ARF) + members.status. Default window is 30 days, capped at 365. Sets up the data path for the labeler's clean-sender computation in #245 (#244)
1919+- /account/manage shows a "Publish attestation" form for any signed-in domain whose attestation_rkey is empty — lets members who completed enrollment but never ran the publish OAuth round-trip self-recover without operator action (#235)
2020+- End-to-end enrollment-funnel integration test covering wizard finish → atomic publish redirect → callback. Pins both the success path (PutRecord lands, SetAttestationPublished stamps, credentials render) and the publish-failure path (credentials preserved, /account/manage retry link present). Closes the test gap that let #233 ship (#237)
2121+- /account/manage renders a "Label status" section showing the live verified-mail-operator and relay-member state from the labeler XRPC, plus a re-publish form when labels are missing despite a published attestation. Closes the silent-failure mode where attestation_rkey is set but the labeler rejected DKIM, leaving SMTP sending broken with no diagnostic on the manage page (#240)
2222+2323+### Fixed
2424+- Osprey rules now actually deploy on merge. Previously the `osprey-rules-sync` systemd service had `RemainAfterExit=true`, so it ran exactly once per atmos-ops boot and any rule change merged after that silently never reached the running worker. Discovered when verifying #196's content_spray promotion — the production worker (image from 2026-04-22) had only 13 of 14 rule files, and content_spray.sml had never loaded, meaning the entire shadow-mode bake-in was a no-op. This PR (a) drops `RemainAfterExit=true` so the service is freely re-runnable, (b) adds a content-hash compare so the worker only restarts when rules actually changed, (c) adds an hourly systemd timer for defense-in-depth autosync, (d) adds `osprey/**` to ops-deploy.yml's path filter so merges trigger an immediate deploy, (e) adds an explicit `systemctl start osprey-rules-sync.service` step to the deploy workflow so rule changes propagate within the deploy window rather than waiting up to an hour for the timer (#251)
2525+- ops-deploy.yml path filter no longer misses transitive labeler dependencies — added `internal/{config,did,dns,domain,jetstream,label,loghash,scheduler,server,store}/**` and `infra/nixos/**`. PR #340 (DID hardening) merged but didn't deploy to atmos-ops because the only filter entries were `cmd/label{,er}/**` and a non-existent `internal/labeler/**`; the labeler ran stale code for ~17 minutes until #341 happened to touch `cmd/labeler/main.go` and finally tripped the filter. Same gap fixed in relay-deploy.yml (added `internal/did/**`, `internal/loghash/**`). Comment in both workflows tells future devs to re-derive via `go list -deps` whenever a new internal package is introduced (#249)
2626+- Account UX papercuts on /account/* navigation. (a) Round-tripping back to /account from any sub-page (e.g. /account/deliverability) no longer re-prompts a signed-in member for sign-in — handleLanding now redirects to /account/manage when a valid recovery cookie is present, falling through to the form on stale cookies so there's no redirect loop. (b) /account/deliverability collapsed the doubled-up topnav stack (publicLayout's "← home" plus a redundant "← Account" breadcrumb) into a single nav band — the parent-link is preserved as an inline "← Back to account" beneath the lede (#239)
2727+928### Changed
2929+- Extract config.go and message.go from cmd/relay/main.go (#264)
3030+- Extract background workers + delivery callback from main() — periodic goroutines, shutdown (#262)
3131+- About §1 marketing copy now correctly states the relay is AGPL-3.0-licensed, not MIT (#227). Surface had been stale since the license switch landed earlier in this Unreleased window.
3232+- Privacy policy §4 and About §3 now accurately distinguish public atproto labels (verified-mail-operator, relay-member, signed and network-visible via labeler.atmos.email) from internal-only Osprey reputation signals (highly_trusted, auto_suspended, used for SMTP-time enforcement only). Prior copy claimed Osprey labels were atproto-published, which was never wired in code (#243)
3333+- Atomic enroll+publish: the wizard now kicks the publish-OAuth round-trip automatically on /enroll/verify success and reveals credentials only on the post-publish callback. Closes the funnel cliff that stranded richferro.com and self.surf — closing the tab after seeing credentials is now harmless because the attestation is already on the PDS (#234)
3434+- Soften the credentials-page warning copy: replace "the only remedy is to re-enroll" with a /account self-service rotation reference, since `/recover/start` (now `/account/start`) lets members rotate the API key without re-enrolling (#236)
1035- License changed from MIT to AGPL-3.0-or-later
1136- Add SPDX-License-Identifier headers to all Go source files
12371338### Security
3939+- Remove DNS grace period and enforce DNS checks from first send (#263)
4040+- harden(labeler): unified DID syntax validation (`internal/did.Valid`) replaces three diverging copies that disagreed on whether did:web could contain `%3A` port-encoding — the admin and diagnostics endpoints would 400 on member DIDs that the labeler had already verified. Adds a 253-byte length cap to did:web (DNS hostname limit) where the prior label-side regex had no cap. Five label/manager.go log sites now redact DIDs via the new `internal/loghash` package. `PerDIDRateLimiter.Allow("")` now rejects empty DIDs up-front so a code path that loses the DID can't silently flood the global bucket via the implicit empty-string window. (#247)
1441- Add DID validation to admin handleMember endpoint (#16)
1542- Narrow OAuth scope from transition:generic to repo:email.atmos.attestation (#189)
1643- sec(account): SameSite=Strict blocks cookie after OAuth cross-site redirect — switch to Lax (#180)
+97
MIRROR.md
···11+# Mirror notice
22+33+The repository at `tangled.org/scottlanoue.com/atmosphere-mail` is a
44+**published mirror** of a private development repository. The two are
55+kept in sync by `scripts/sync-tangled.sh`, which runs on demand (not on
66+every commit).
77+88+This file exists so anyone reading the public mirror knows what they
99+are looking at and what is *not* visible.
1010+1111+## What is the same
1212+1313+- All Go source code under `cmd/` and `internal/`
1414+- The lexicons and design docs under `lexicons/` and `docs/`
1515+- The Nix/Terraform infrastructure definitions under `infra/`
1616+- Tests, fixtures, and the AGPL-3.0 license
1717+1818+If a reader builds the binary from this mirror and runs it, they
1919+get the same software the operator runs in production.
2020+2121+## What is scrubbed before publishing
2222+2323+The sync script applies `git filter-repo` with the following rules:
2424+2525+1. **Internal hostnames replaced** with neutral substitutes. The
2626+ operator's tailnet uses `*.internal.example` host names; these
2727+ get rewritten to `*.internal` (e.g. `kafka-broker.internal`,
2828+ `db.internal`, `ops.internal`). The substitution is mechanical and
2929+ is applied to both file contents and commit messages.
3030+2. **Internal IPs redacted.** Specific Tailscale-internal IPs are
3131+ replaced with the literal `<internal-ip>`.
3232+3. **AI co-author trailers removed.** Commit messages with
3333+ `Co-Authored-By: ...claude...` or `...anthropic.com` lines have
3434+ those lines stripped before publishing. The commit content is
3535+ unchanged.
3636+4. **Author normalization.** All commits show the same author
3737+ (`Scott Lanoue <scott@lanoue.dev>`); historical aliases are
3838+ collapsed via mailmap. Co-contributor commits, if and when they
3939+ exist, will be additive — this rule normalizes the operator's own
4040+ shifting email aliases, not third-party contributors.
4141+5. **History squashed into phase commits.** Rather than mirroring
4242+ every individual commit, the sync collapses runs of related work
4343+ into a single descriptive commit per "phase." Each phase commit
4444+ represents a coherent theme (e.g. `feat: integration test
4545+ harness, outbound deliver-path verification, and multi-domain
4646+ enrollment`). The full per-merge history lives on the private
4747+ side; the public mirror gets the synthesized story.
4848+4949+## What is removed entirely
5050+5151+Some paths are excluded from the mirror because they are operationally
5252+sensitive or inherently project-internal:
5353+5454+- `scripts/` — the publishing script itself, plus repo automation
5555+ helpers. The script's contents would partially undo the scrub list
5656+ (e.g. a reader could see what hostnames are being substituted *to*
5757+ what, which is strictly more information than not having the script
5858+ at all).
5959+- `.claude/` — agent worktree caches and logs.
6060+- `.agent-skills.md` — internal collaboration notes.
6161+- A subset of `.gitea/workflows/` — specifically the deploy and
6262+ ops-automation workflows (`relay-deploy.yml`, `ops-deploy.yml`,
6363+ `relay-provision.yml`, the DNS-record manipulation workflows, and
6464+ the Hetzner firewall workflow). These reveal infrastructure topology
6565+ and credentials-handling patterns that the operator considers
6666+ reasonable to defer until the project has more usage history.
6767+- The code-verification workflows (`go-tests.yml`, `govulncheck.yml`,
6868+ `template-escape-lint.yml`, `sml-tests.yml`, `validate-sml.yml`)
6969+ **are** published, so anyone can see how PRs get gated before merge.
7070+7171+## Sync cadence
7272+7373+There is no SLA. The sync is run by the operator manually, generally
7474+when a coherent batch of work has accumulated. In practice that's
7575+been every few weeks during the alpha. The mirror **is delayed from
7676+production** — typically by hours to days, occasionally longer.
7777+7878+## Why this setup
7979+8080+The relay project's mission is to lower the cost of running legitimate
8181+self-hosted email by pooling reputation across small operators. That
8282+mission is served by the source being inspectable, AGPL-licensed, and
8383+reproducible. It is **not** served by exposing operational secrets
8484+that would let an attacker degrade the relay's deliverability for
8585+everyone using it.
8686+8787+The compromise reflected here — publish the source and the verification
8888+workflows, hold back the deploy automation — is a deliberate alpha-
8989+phase tradeoff. As the project gains more members and external
9090+contributors, the bar for "what's worth holding back" should naturally
9191+shift toward smaller. The operator intends to revisit periodically.
9292+9393+## Reporting issues
9494+9595+For source code questions, use the `tangled.org` issue tracker on this
9696+repo. For deliverability or operational issues with the relay itself,
9797+mail `postmaster@atmos.email`.
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package main
44+55+// Helpers for the SMTP submission pipeline.
66+//
77+// emitRelayAttemptEvent is the simplest phase to extract: it's the
88+// last block of onAccept, has a clearly bounded set of inputs (member
99+// info + recipient count + content fingerprint), and its output is a
1010+// single Osprey event emission. It reads from store via 6 lookups but
1111+// doesn't write anything, so behavior is observable purely through
1212+// the emitted event.
1313+1414+import (
1515+ "context"
1616+ "time"
1717+1818+ "atmosphere-mail/internal/osprey"
1919+ "atmosphere-mail/internal/relay"
2020+ "atmosphere-mail/internal/relaystore"
2121+)
2222+2323+// emitRelayAttemptEvent collects velocity counters from the store
2424+// and emits a single relay_attempt event. Lookups are best-effort —
2525+// a query error emits 0 rather than blocking send. Mirrors the inline
2626+// block that lived at lines 843-869 of main.go's onAccept closure
2727+// before the extraction.
2828+//
2929+// Why this exists as a function: it's pure data assembly. No SMTP
3030+// state, no per-recipient mutation, no error returns to the caller.
3131+// onAccept can fire-and-forget it after every successful batch
3232+// without juggling per-phase outcomes.
3333+func emitRelayAttemptEvent(
3434+ ctx context.Context,
3535+ store *relaystore.Store,
3636+ emitter *osprey.Emitter,
3737+ member *relay.AuthMember,
3838+ recipientCount int,
3939+ contentFP string,
4040+) {
4141+ memberAge := int(time.Since(member.CreatedAt).Hours() / 24)
4242+ now := time.Now().UTC()
4343+4444+ sendsLastHour, _ := store.GetRateCount(ctx, member.DID, relaystore.WindowHourly, now.Truncate(time.Hour))
4545+ sendsLastMinute, _ := store.GetSendCountSince(ctx, member.DID, now.Add(-time.Minute))
4646+ sendsLast5Min, _ := store.GetSendCountSince(ctx, member.DID, now.Add(-5*time.Minute))
4747+ uniqueDomains, _ := store.GetUniqueRecipientDomainsSince(ctx, member.DID, now.Add(-time.Hour))
4848+ _, bounced24h, _ := store.GetMessageCounts(ctx, member.DID, now.Add(-24*time.Hour))
4949+ sameContentRecipients, _ := store.GetSameContentRecipientsSince(ctx, member.DID, contentFP, now.Add(-time.Hour))
5050+5151+ emitter.Emit(ctx, osprey.EventData{
5252+ EventType: osprey.EventRelayAttempt,
5353+ SenderDID: member.DID,
5454+ SenderDomain: member.Domain,
5555+ RecipientCount: recipientCount,
5656+ SendCount: member.SendCount,
5757+ MemberAgeDays: memberAge,
5858+ SendsLastMinute: sendsLastMinute,
5959+ SendsLast5Minutes: sendsLast5Min,
6060+ SendsLastHour: sendsLastHour,
6161+ HardBouncesLast24h: int(bounced24h),
6262+ UniqueRecipientDomainsLastHour: uniqueDomains,
6363+ ContentFingerprint: contentFP,
6464+ SameContentRecipientsLastHour: sameContentRecipients,
6565+ })
6666+}
+197
cmd/relay/inbound.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package main
44+55+// Inbound SMTP server setup. Bundles the member-hash cache, bounce
66+// handler, reply forwarder, metrics, audit log, and FBL/ARF ingestion
77+// into one setup function.
88+//
99+// The FBL handler captures a forward-declared notifier that gets
1010+// bound later to adminAPI.FireFBLComplaint — admin and inbound are
1111+// initialized at different points in main(), and the inbound side
1212+// needs to be running before adminAPI exists so bounces don't pile
1313+// up. setupInboundServer returns a setter so main() can wire the
1414+// notifier after adminAPI is constructed.
1515+1616+import (
1717+ "context"
1818+ "log"
1919+ "time"
2020+2121+ "atmosphere-mail/internal/osprey"
2222+ "atmosphere-mail/internal/relay"
2323+ "atmosphere-mail/internal/relaystore"
2424+)
2525+2626+// fblNotifierFunc is the signature the FBL handler invokes when a
2727+// complaint is attributed to a known member. main() binds this to
2828+// adminAPI.FireFBLComplaint after adminAPI is constructed.
2929+type fblNotifierFunc func(ctx context.Context, memberDID, senderDomain, recipientDomain, feedbackType, provider string)
3030+3131+// inboundDeps gathers everything setupInboundServer needs from main().
3232+// Pulled into a struct so the call site reads as a labelled-arg
3333+// invocation rather than a 7-positional-arg function call.
3434+type inboundDeps struct {
3535+ cfg *RelayConfig
3636+ store *relaystore.Store
3737+ metrics *relay.Metrics
3838+ ospreyEmitter *osprey.Emitter
3939+ bounceProcessor *relay.BounceProcessor
4040+}
4141+4242+// inboundSetup is what setupInboundServer hands back to main(). The
4343+// server is for ListenAndServe + Close at shutdown; the member-hash
4444+// cache is for the periodic rebuild ticker (which needs the
4545+// shutdown ctx that's created in main()); SetFBLNotifier closes the
4646+// late-binding loop with adminAPI.
4747+type inboundSetup struct {
4848+ Server *relay.InboundServer
4949+ MemberHashCache *relay.MemberHashCache
5050+ SetFBLNotifier func(fblNotifierFunc)
5151+}
5252+5353+// setupInboundServer wires the inbound SMTP server (port 25) used for
5454+// bounces, replies, operator mail, and FBL/ARF complaint ingestion.
5555+// Returns an inboundSetup carrying the live server, the member-hash
5656+// cache (which main() reuses for a periodic-rebuild ticker tied to the
5757+// shutdown ctx), and a setter for the FBL notifier (bound later by
5858+// main() to adminAPI.FireFBLComplaint).
5959+func setupInboundServer(deps inboundDeps) inboundSetup {
6060+ cfg := deps.cfg
6161+ store := deps.store
6262+ metrics := deps.metrics
6363+ ospreyEmitter := deps.ospreyEmitter
6464+ bounceProcessor := deps.bounceProcessor
6565+6666+ // Inbound SMTP server for bounce processing (port 25). The cache below
6767+ // answers VERP "is this hash a member?" lookups without hitting the DB
6868+ // on every inbound. Both a positive cache (rebuilt at most every 30s)
6969+ // and a negative cache (5min TTL, 10k entries) defend against random-
7070+ // VERP DoS protection.
7171+ memberHashCache := relay.NewMemberHashCache(relay.MemberHashCacheConfig{
7272+ Rebuild: func() (map[string]string, error) {
7373+ members, err := store.ListMembers(context.Background())
7474+ if err != nil {
7575+ return nil, err
7676+ }
7777+ out := make(map[string]string, len(members))
7878+ for _, mb := range members {
7979+ out[relay.MemberHashFromDID(mb.DID)] = mb.DID
8080+ }
8181+ return out, nil
8282+ },
8383+ Metrics: metrics,
8484+ })
8585+8686+ inboundMemberLookup := func(_ context.Context, memberHash string) (string, bool) {
8787+ return memberHashCache.Lookup(memberHash)
8888+ }
8989+9090+ inboundBounceHandler := func(ctx context.Context, memberDID, recipient, bounceType, details string) {
9191+ if bounceType == "hard" {
9292+ metrics.BouncesTotal.WithLabelValues("hard").Inc()
9393+ } else {
9494+ metrics.BouncesTotal.WithLabelValues("soft").Inc()
9595+ }
9696+9797+ ospreyEmitter.Emit(ctx, osprey.EventData{
9898+ EventType: osprey.EventBounceReceived,
9999+ SenderDID: memberDID,
100100+ RecipientDomain: recipientDomain(recipient),
101101+ BounceType: bounceType,
102102+ Details: details,
103103+ })
104104+105105+ action, err := bounceProcessor.RecordBounce(ctx, memberDID, recipient, details)
106106+ if err != nil {
107107+ log.Printf("inbound.bounce_error: did=%s recipient=%s error=%v", memberDID, recipient, err)
108108+ } else if action != "none" {
109109+ log.Printf("inbound.bounce_action: did=%s recipient=%s bounce_type=%s action=%s", memberDID, recipient, bounceType, action)
110110+ if action == "suspend" {
111111+ ospreyEmitter.Emit(ctx, osprey.EventData{
112112+ EventType: osprey.EventMemberSuspended,
113113+ SenderDID: memberDID,
114114+ Reason: "bounce_rate_exceeded",
115115+ })
116116+ }
117117+ }
118118+ }
119119+120120+ inboundServer := relay.NewInboundServer(relay.InboundConfig{
121121+ ListenAddr: cfg.InboundAddr,
122122+ Domain: cfg.Domain,
123123+ RateLimitMsgsPerMinute: cfg.InboundRateLimitMsgsPerMinute,
124124+ RateLimitBurst: cfg.InboundRateLimitBurst,
125125+ }, inboundBounceHandler, inboundMemberLookup)
126126+127127+ // Inbound reply forwarding: classify inbound mail and deliver replies
128128+ // to the member's registered forward_to mailbox. SRS key is persisted
129129+ // so bounces of already-forwarded mail remain verifiable across
130130+ // restarts. Key generation mirrors the unsub key pattern.
131131+ srsKey, err := relay.LoadOrCreateUnsubKey(cfg.StateDir + "/srs.key")
132132+ if err != nil {
133133+ log.Fatalf("load srs key: %v", err)
134134+ }
135135+ srsRewriter := relay.NewSRSRewriter(srsKey, cfg.Domain)
136136+ forwarder := relay.NewForwarder(srsRewriter, cfg.Domain)
137137+ domainLookup := func(ctx context.Context, domain string) (string, bool) {
138138+ d, err := store.GetMemberDomain(ctx, domain)
139139+ if err != nil || d == nil {
140140+ return "", false
141141+ }
142142+ return d.ForwardTo, true
143143+ }
144144+ inboundServer.SetReplyForwarding(domainLookup, forwarder, srsRewriter)
145145+ // Operator mail (postmaster@, abuse@, fbl@ at the relay's own domain)
146146+ // is forwarded externally when configured. Without this, provider
147147+ // verification emails (Microsoft SNDS authorization, Yahoo CFL
148148+ // confirmation) and ops-team mail would land in the audit log and
149149+ // never reach a human.
150150+ if cfg.OperatorForwardTo != "" {
151151+ inboundServer.SetOperatorForwarding(forwarder, cfg.OperatorForwardTo)
152152+ log.Printf("operator_forward.enabled: to=%s", cfg.OperatorForwardTo)
153153+ }
154154+ inboundServer.SetMetrics(metrics)
155155+ // Persistent audit log — every accepted inbound message lands in the
156156+ // inbound_messages table. Failures inside LogInbound are swallowed so
157157+ // SMTP delivery is never affected.
158158+ inboundServer.SetInboundLogger(&relayInboundLogger{store: store})
159159+ log.Printf("inbound.reply_forwarding.enabled: srs_domain=%s", cfg.Domain)
160160+161161+ // FBL / ARF feedback-report ingestion. Inbound reports to fbl@<domain>
162162+ // get parsed, attributed to the sending member, and emitted as an
163163+ // Osprey complaint event so rules can react (e.g. auto-suspend after
164164+ // N complaints in 24h). memberExists guards against spoofed reports
165165+ // naming DIDs we never issued.
166166+ //
167167+ // fblNotify is forward-declared and bound later by main() via the
168168+ // setter we return. Closures capture by reference, so the late
169169+ // binding takes effect when SetFBL's callback eventually fires.
170170+ var fblNotify fblNotifierFunc
171171+ memberExists := func(ctx context.Context, did string) bool {
172172+ m, err := store.GetMember(ctx, did)
173173+ return err == nil && m != nil
174174+ }
175175+ inboundServer.SetFBL(func(ctx context.Context, memberDID, senderDomain, recipientDomain, feedbackType, providerUA string, arrival time.Time) {
176176+ provider := normalizeProviderUA(providerUA)
177177+ metrics.ComplaintsTotal.WithLabelValues(feedbackType, provider).Inc()
178178+ ospreyEmitter.Emit(ctx, osprey.EventData{
179179+ EventType: osprey.EventComplaintReceived,
180180+ SenderDID: memberDID,
181181+ SenderDomain: senderDomain,
182182+ RecipientDomain: recipientDomain,
183183+ FeedbackType: feedbackType,
184184+ ProviderUA: providerUA,
185185+ })
186186+ if fblNotify != nil {
187187+ fblNotify(ctx, memberDID, senderDomain, recipientDomain, feedbackType, provider)
188188+ }
189189+ }, memberExists)
190190+ log.Printf("inbound.fbl.enabled: inbox=fbl@%s", cfg.Domain)
191191+192192+ return inboundSetup{
193193+ Server: inboundServer,
194194+ MemberHashCache: memberHashCache,
195195+ SetFBLNotifier: func(fn fblNotifierFunc) { fblNotify = fn },
196196+ }
197197+}
+112-1712
cmd/relay/main.go
···33package main
4455import (
66- "bufio"
76 "context"
88- "crypto/ed25519"
99- "crypto/rsa"
107 "crypto/tls"
1111- "crypto/x509"
1212- "encoding/json"
1313- "errors"
88+ "encoding/hex"
149 "flag"
1515- "fmt"
1610 "log"
1711 "net"
1812 "net/http"
1919- "net/textproto"
2020- "net/url"
2113 "os"
2214 "os/signal"
2315 "path/filepath"
2424- "strings"
2516 "syscall"
2617 "time"
27182828- "atmosphere-mail/internal/admin"
2929- adminui "atmosphere-mail/internal/admin/ui"
3030- "atmosphere-mail/internal/atpoauth"
3131- "atmosphere-mail/internal/config"
3219 "atmosphere-mail/internal/dns"
3320 "atmosphere-mail/internal/enroll"
3434- "atmosphere-mail/internal/notify"
3521 "atmosphere-mail/internal/osprey"
3622 "atmosphere-mail/internal/relay"
3723 "atmosphere-mail/internal/relaystore"
38243939- "github.com/emersion/go-smtp"
4025 "github.com/prometheus/client_golang/prometheus"
4141- "github.com/prometheus/client_golang/prometheus/promhttp"
2626+ "github.com/prometheus/client_golang/prometheus/collectors"
4227)
43284444-// PublicDomain describes a single host served on the public HTTPS listener.
4545-// Each host can have its own TLS cert (via SNI) and a role that determines
4646-// what handlers it answers.
4747-type PublicDomain struct {
4848- Host string `json:"host"` // SNI / Host header match, e.g. "atmosphereemail.org"
4949- CertFile string `json:"certFile"` // path to TLS cert (fullchain)
5050- KeyFile string `json:"keyFile"` // path to TLS private key
5151- Role string `json:"role"` // "site", "infra", or "redirect"
5252- RedirectTo string `json:"redirectTo"` // for Role=="redirect": target URL prefix, e.g. "https://atmosphereemail.org"
5353-}
5454-5555-// RelayConfig holds the relay-specific configuration.
5656-type RelayConfig struct {
5757- // SMTP submission
5858- SMTPAddr string `json:"smtpAddr"` // default ":587"
5959- Domain string `json:"domain"` // relay domain, e.g. "atmos.email"
6060-6161- // Inbound SMTP (bounce processing)
6262- InboundAddr string `json:"inboundAddr"` // default ":25" (port 25 for receiving bounces)
6363-6464- // InboundRateLimitMsgsPerMinute caps per-source-IP message rate at
6565- // MAIL FROM on the inbound listener. Zero or negative disables.
6666- // Default: 30. Provider bounce traffic and FBL reports come from
6767- // many IPs, so per-IP caps don't affect legitimate volume.
6868- InboundRateLimitMsgsPerMinute float64 `json:"inboundRateLimitMsgsPerMinute"`
6969- // InboundRateLimitBurst is the per-IP token-bucket capacity. Zero
7070- // defaults to 10. Higher values tolerate larger short bursts at the
7171- // cost of weaker abuse protection.
7272- InboundRateLimitBurst int `json:"inboundRateLimitBurst"`
7373-7474- // Admin API
7575- AdminAddr string `json:"adminAddr"` // default ":8080" (Tailscale-only)
7676- AdminToken string `json:"adminToken"` // Bearer token for admin API
7777-7878- // AdminOrigins is the CSRF allowlist for the admin dashboard UI
7979- // (/ui/* POSTs). Must contain the externally-reachable origin of the
8080- // dashboard, e.g. "https://atmos-relay.internal.example". Empty
8181- // list fails-closed — every state-changing admin POST returns 403.
8282- AdminOrigins []string `json:"adminOrigins"`
8383-8484- // Labeler
8585- LabelerURL string `json:"labelerURL"` // XRPC URL for label checks
8686-8787- // TLS
8888- TLSCertFile string `json:"tlsCertFile"` // path to TLS cert
8989- TLSKeyFile string `json:"tlsKeyFile"` // path to TLS key
9090-9191- // Public HTTPS listener — serves /u/{token} for List-Unsubscribe one-click.
9292- // If empty, the public server is disabled and List-Unsubscribe headers
9393- // are not emitted. Set to ":443" in production.
9494- PublicAddr string `json:"publicAddr"`
9595- // PublicBaseURL is the externally-reachable URL prefix for the public
9696- // HTTPS listener's INFRASTRUCTURE endpoints (unsubscribe). Used inside
9797- // List-Unsubscribe header values. Always points at the smtp.* host.
9898- PublicBaseURL string `json:"publicBaseURL"`
9999-100100- // PublicDomains lists every hostname the public HTTPS listener answers
101101- // for, with its role. When empty, the listener falls back to legacy
102102- // single-cert behavior using TLSCertFile/TLSKeyFile and serves the
103103- // full enroll handler + unsubscribe on whatever Host is requested.
104104- //
105105- // When populated, each domain gets its own TLS cert (via SNI) and is
106106- // routed by Role:
107107- // "site" — marketing + legal + enrollment wizard (e.g. atmosphereemail.org)
108108- // "infra" — operational endpoints only (e.g. smtp.atmos.email): /u/, /healthz
109109- // "redirect" — 301 permanent redirect to RedirectTo + request path
110110- PublicDomains []PublicDomain `json:"publicDomains"`
111111-112112- // Storage
113113- StateDir string `json:"stateDir"` // default "./state"
114114-115115- // Rate limits
116116- HourlyLimit int `json:"hourlyLimit"` // default 100
117117- DailyLimit int `json:"dailyLimit"` // default 1000
118118- GlobalPerMinute int `json:"globalPerMinute"` // default 500
119119-120120- // Osprey integration (optional — leave empty to disable)
121121- KafkaBroker string `json:"kafkaBroker"` // e.g. "localhost:9092"
122122- OspreyURL string `json:"ospreyURL"` // e.g. "https://osprey-api.example.com"
123123-124124- // Site-facing OAuth for self-service attestation publishing.
125125- // When SiteBaseURL is set, the public listener serves both the
126126- // atproto OAuth client metadata (at
127127- // /.well-known/atproto-oauth-client-metadata.json) and the
128128- // /enroll/attest/{start,callback} wizard routes.
129129- //
130130- // SiteBaseURL MUST be the externally-reachable https:// origin of the
131131- // marketing/site host (e.g. "https://atmospheremail.com"). It MUST
132132- // match the origin the public listener serves, since the atproto
133133- // spec requires client_id == metadata URL.
134134- SiteBaseURL string `json:"siteBaseURL"`
135135-136136- // Pool-level operator DKIM. Every outbound message gets a second DKIM
137137- // signature with d=OperatorDKIMDomain so FBL complaints (Microsoft JMRP,
138138- // Yahoo CFL, etc.) route to one pool-level registration instead of each
139139- // member registering individually. Omit to disable operator signing
140140- // (message gets member-domain-only signatures).
141141- OperatorDKIMKeyPath string `json:"operatorDKIMKeyPath"` // default: StateDir/operator-dkim-keys.json
142142- OperatorDKIMDomain string `json:"operatorDKIMDomain"` // default: Domain (relay domain)
143143-144144- // OperatorForwardTo is the external mailbox that receives inbound
145145- // postmaster@ / abuse@ / fbl@ mail for the relay's own domain
146146- // (e.g. atmos.email). Provider authorization emails (Microsoft
147147- // SNDS, Yahoo CFL) are delivered to these addresses; without a
148148- // forward the messages land in the audit log and never reach a
149149- // human. Omit to preserve the accept-and-drop-only behavior.
150150- OperatorForwardTo string `json:"operatorForwardTo"`
151151-152152- // OperatorWebhookURL is the HTTP endpoint the relay POSTs
153153- // structured operator events to — new pending enrollments,
154154- // approvals, suspensions, reactivations. Each operator wires this
155155- // to their own sink (Slack incoming webhook, Matrix bot, ntfy.sh,
156156- // PagerDuty integration, etc.) so the relay doesn't couple to any
157157- // particular notification channel. Empty disables notifications.
158158- OperatorWebhookURL string `json:"operatorWebhookURL"`
159159- // OperatorWebhookSecret is an HMAC-SHA256 secret used to sign
160160- // every webhook POST (X-Atmos-Signature header). Strongly
161161- // recommended when the webhook URL is anywhere the receiver can't
162162- // otherwise authenticate the sender.
163163- OperatorWebhookSecret string `json:"operatorWebhookSecret"`
164164-}
165165-166166-var flagConfigPath = flag.String("config", "./relay-config.json", "path to relay config file")
167167-168168-// storeDomainLister adapts *relaystore.Store to the narrow
169169-// adminui.DomainLister interface so the enrollment landing can show
170170-// existing domains without a full store import.
171171-type storeDomainLister struct{ store *relaystore.Store }
172172-173173-func (s storeDomainLister) ListMemberDomains(ctx context.Context, did string) ([]string, error) {
174174- domains, err := s.store.ListMemberDomains(ctx, did)
175175- if err != nil {
176176- return nil, err
177177- }
178178- names := make([]string, len(domains))
179179- for i, d := range domains {
180180- names[i] = d.Domain
181181- }
182182- return names, nil
183183-}
184184-18529func main() {
18630 flag.Parse()
18731···2054920650 // Open relay store
20751 dbPath := cfg.StateDir + "/relay.sqlite"
208208- store, err := relaystore.New(dbPath)
5252+ var piiKey relaystore.PIIKey
5353+ if raw := os.Getenv("RELAY_PII_KEY"); raw != "" {
5454+ k, err := hex.DecodeString(raw)
5555+ if err != nil || len(k) != 32 {
5656+ log.Fatalf("RELAY_PII_KEY must be 64 hex chars (32 bytes AES-256)")
5757+ }
5858+ piiKey = relaystore.PIIKey(k)
5959+ }
6060+ store, err := relaystore.NewWithPIIKey(dbPath, piiKey)
20961 if err != nil {
21062 log.Fatalf("open store: %v", err)
21163 }
···2166821769 // Prometheus metrics
21870 metricsRegistry := prometheus.NewRegistry()
219219- metricsRegistry.MustRegister(prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{}))
220220- metricsRegistry.MustRegister(prometheus.NewGoCollector())
7171+ metricsRegistry.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
7272+ metricsRegistry.MustRegister(collectors.NewGoCollector())
22173 metrics := relay.NewMetrics(metricsRegistry)
222222- // Wire panic recovery for background goroutines (#209). Every
223223- // relay.GoSafe call below counts recovered panics into
224224- // metrics.GoroutineCrashes; without this wire the panics are
225225- // still logged but not counted.
7474+ // Wire panic recovery for background goroutines. Every
7575+ // relay.GoSafe call counts recovered panics into
7676+ // metrics.GoroutineCrashes.
22677 relay.SetPanicRecorder(metrics)
227227- // Wire SQLITE_BUSY classification at hot-path writers (#210).
228228- // The store reports busy errors via metrics.SQLiteBusyErrors;
229229- // the periodic pool-stats sampler is started below once the
230230- // cancellable ctx is in scope.
7878+ // Wire SQLITE_BUSY classification at hot-path writers. The store
7979+ // reports busy errors via metrics.SQLiteBusyErrors.
23180 store.SetBusyRecorder(metrics)
2328123382 // Label checker
···305154 if cfg.OspreyURL != "" {
306155 ospreyEnforcer = relay.NewOspreyEnforcer(cfg.OspreyURL, &http.Client{Timeout: 5 * time.Second})
307156 // Persist labelcheck cache so a relay restart doesn't reset
308308- // to fully cold (#215). The fail-closed branch in
309309- // activeLabelsFor is the safety net for the rare case where
310310- // snapshot read fails AND Osprey is unreachable.
157157+ // to fully cold. The fail-closed branch in activeLabelsFor
158158+ // is the safety net for the rare case where snapshot read
159159+ // fails AND Osprey is unreachable.
311160 snapPath := filepath.Join(cfg.StateDir, "osprey-cache.json")
312161 ospreyEnforcer.SetSnapshotPath(snapPath)
313162 ospreyEnforcer.SetColdCacheRecorder(metrics)
···331180 Dropped: metrics.OspreyEventsDropped,
332181 SpoolDepth: metrics.OspreySpoolDepth,
333182 })
334334- // On-disk DLQ for failed Kafka writes (#214). Without this an
183183+ // On-disk DLQ for failed Kafka writes. Without this an
335184 // atmos-ops outage silently drops every event the relay emits
336185 // during the window — labels stop propagating and trust scoring
337186 // freezes on stale data with no operator-visible signal. The
···352201 log.Printf("osprey.disabled: kafkaBroker not configured — relay events will not be propagated")
353202 }
354203355355-356356- // Delivery queue
357357- queue := relay.NewQueue(func(result relay.DeliveryResult) {
358358- status := result.Status
359359- if status == "sent" {
360360- if err := store.UpdateMessageStatus(context.Background(), result.EntryID, relaystore.MsgSent, result.SMTPCode); err != nil {
361361- if errors.Is(err, relaystore.ErrMessageNotFound) {
362362- // Spool entry without a backing DB row — the
363363- // orphan signature from #208. Log + count so the
364364- // reconciliation janitor's effectiveness is
365365- // observable; do NOT surface to delivery state.
366366- log.Printf("delivery.orphan: entry_id=%d status=sent — DB row missing", result.EntryID)
367367- metrics.OrphanDeliveries.WithLabelValues("sent").Inc()
368368- } else {
369369- log.Printf("delivery.update_error: entry_id=%d status=sent error=%v", result.EntryID, err)
370370- }
371371- }
372372- ospreyEmitter.Emit(context.Background(), osprey.EventData{
373373- EventType: osprey.EventDeliveryResult,
374374- SenderDID: result.MemberDID,
375375- RecipientDomain: recipientDomain(result.Recipient),
376376- DeliveryStatus: "sent",
377377- SMTPCode: result.SMTPCode,
378378- })
379379- } else {
380380- if err := store.UpdateMessageStatus(context.Background(), result.EntryID, relaystore.MsgBounced, result.SMTPCode); err != nil {
381381- if errors.Is(err, relaystore.ErrMessageNotFound) {
382382- log.Printf("delivery.orphan: entry_id=%d status=bounced — DB row missing", result.EntryID)
383383- metrics.OrphanDeliveries.WithLabelValues("bounced").Inc()
384384- } else {
385385- log.Printf("delivery.update_error: entry_id=%d status=bounced error=%v", result.EntryID, err)
386386- }
387387- }
388388- if result.SMTPCode >= 500 {
389389- metrics.BouncesTotal.WithLabelValues("hard").Inc()
390390- } else {
391391- metrics.BouncesTotal.WithLabelValues("soft").Inc()
392392- }
393393-394394- ospreyEmitter.Emit(context.Background(), osprey.EventData{
395395- EventType: osprey.EventDeliveryResult,
396396- SenderDID: result.MemberDID,
397397- RecipientDomain: recipientDomain(result.Recipient),
398398- DeliveryStatus: "bounced",
399399- SMTPCode: result.SMTPCode,
400400- })
401401-402402- // Process bounce: record event, evaluate rate, potentially auto-suspend
403403- action, err := bounceProcessor.RecordBounce(context.Background(), result.MemberDID, result.Recipient, result.Error)
404404- if err != nil {
405405- log.Printf("bounce.process_error: did=%s entry_id=%d error=%v",
406406- result.MemberDID, result.EntryID, err)
407407- } else if action != "none" {
408408- log.Printf("bounce.action: did=%s entry_id=%d action=%s",
409409- result.MemberDID, result.EntryID, action)
410410- if action == "suspend" {
411411- ospreyEmitter.Emit(context.Background(), osprey.EventData{
412412- EventType: osprey.EventMemberSuspended,
413413- SenderDID: result.MemberDID,
414414- Reason: "bounce_rate_exceeded",
415415- })
416416- }
417417- }
418418- }
419419- }, func() relay.QueueConfig {
204204+ // Delivery queue. The result handler updates DB status, emits Osprey
205205+ // events, and feeds the bounce processor. See cmd/relay/delivery.go.
206206+ deliveryHandler := &deliveryResultHandler{
207207+ store: store,
208208+ metrics: metrics,
209209+ ospreyEmitter: ospreyEmitter,
210210+ bounceProcessor: bounceProcessor,
211211+ }
212212+ queue := relay.NewQueue(deliveryHandler.Handle, func() relay.QueueConfig {
420213 qc := relay.DefaultQueueConfig()
421214 qc.RelayDomain = cfg.Domain
422215 return qc
···432225 log.Printf("spool.reload: recovered %d queued messages", reloaded)
433226 }
434227435435- // Member lookup for SMTP AUTH — returns member + all domains so auth
436436- // can match the API key to a specific domain.
437437- memberLookup := func(ctx context.Context, did string) (*relay.MemberWithDomains, error) {
438438- member, domains, err := store.GetMemberWithDomains(ctx, did)
439439- if err != nil {
440440- return nil, err
441441- }
442442-443443- // Fallback: if DID lookup fails and username doesn't look like a DID,
444444- // try domain-based lookup. This supports SMTP clients (e.g. nodemailer)
445445- // that can't preserve percent-encoded colons in URL userinfo, making
446446- // DID-based usernames impossible via SMTP URL configuration.
447447- if member == nil && !strings.HasPrefix(did, "did:") {
448448- m, d, err := store.GetMemberByDomain(ctx, did)
449449- if err != nil {
450450- return nil, err
451451- }
452452- if m != nil {
453453- member = m
454454- domains = []relaystore.MemberDomain{*d}
455455- }
456456- }
457457-458458- if member == nil {
459459- return nil, nil
460460- }
461461-462462- domainInfos := make([]relay.DomainInfo, len(domains))
463463- for i, d := range domains {
464464- rsaKey, edKey, err := deserializeDKIMKeys(d.DKIMRSAPriv, d.DKIMEdPriv)
465465- if err != nil {
466466- return nil, fmt.Errorf("deserialize DKIM keys for %s/%s: %w", did, d.Domain, err)
467467- }
468468- domainInfos[i] = relay.DomainInfo{
469469- Domain: d.Domain,
470470- APIKeyHash: d.APIKeyHash,
471471- DKIMKeys: &relay.DKIMKeys{
472472- Selector: d.DKIMSelector,
473473- RSAPriv: rsaKey,
474474- EdPriv: edKey,
475475- },
476476- DKIMSelector: d.DKIMSelector,
477477- CreatedAt: d.CreatedAt,
478478- }
479479- }
480480-481481- mwd := &relay.MemberWithDomains{
482482- DID: member.DID,
483483- Status: member.Status,
484484- SendCount: member.SendCount,
485485- HourlyLimit: member.HourlyLimit,
486486- DailyLimit: member.DailyLimit,
487487- CreatedAt: member.CreatedAt,
488488- Domains: domainInfos,
489489- }
490490-491491- // Auth-time Osprey check: derive policy from labels. Suspended DIDs
492492- // are blocked at the session level. Trust/throttle labels flow
493493- // through to rate-limit computation at send time. Fail-stale: uses
494494- // cached value if Osprey is unreachable — a previously suspended
495495- // DID stays blocked even during a network partition.
496496- if ospreyEnforcer != nil && mwd.Status == relaystore.StatusActive {
497497- policy, err := ospreyEnforcer.GetPolicy(ctx, member.DID)
498498- if errors.Is(err, relay.ErrOspreyColdCache) {
499499- // Cold cache + Osprey unreachable. #215: block AUTH
500500- // rather than fail-open. The rejection is transient
501501- // from the client's POV; once Osprey returns, the
502502- // policy resolves normally.
503503- log.Printf("osprey.enforce: did=%s action=block_auth reason=cold_cache_unreachable", member.DID)
504504- mwd.Status = relaystore.StatusSuspended
505505- }
506506- if policy != nil && policy.Suspended {
507507- log.Printf("osprey.enforce: did=%s action=block_auth reason=%s", member.DID, policy.SuspendReason)
508508- mwd.Status = relaystore.StatusSuspended
509509- }
510510- if ospreyEnforcer.Reachable() {
511511- metrics.OspreyReachable.Set(1)
512512- } else {
513513- metrics.OspreyReachable.Set(0)
514514- }
515515- }
516516-517517- return mwd, nil
518518- }
519519-520520- // Send check: rate limits (with warming + label policy) + label verification
521521- sendCheck := func(ctx context.Context, member *relay.AuthMember, from, to string) error {
522522- // Fetch the member's Osprey-derived policy up front so both rate
523523- // limits and suspension checks use the same snapshot.
524524- var policy *relay.LabelPolicy
525525- if ospreyEnforcer != nil {
526526- p, err := ospreyEnforcer.GetPolicy(ctx, member.DID)
527527- if errors.Is(err, relay.ErrOspreyColdCache) {
528528- // #215: cold cache + Osprey unreachable → 451 SMTP
529529- // deferral. Client retries; by then either Osprey
530530- // is back or the cache has been warmed.
531531- return fmt.Errorf("451 osprey unreachable, please retry")
532532- }
533533- policy = p
534534- }
535535-536536- // Apply warming limits + label policy (highly_trusted skips warming,
537537- // burst_warming halves the hourly limit, etc.).
538538- hourly, daily := relay.WarmingLimitsForPolicy(warmingCfg, member.CreatedAt, member.HourlyLimit, member.DailyLimit, policy)
539539-540540- // Check rate limits
541541- if err := rateLimiter.Check(ctx, member.DID, hourly, daily); err != nil {
542542- if rle, ok := err.(*relay.RateLimitError); ok {
543543- rle.Tier = relay.MemberTier(warmingCfg, member.CreatedAt, time.Now())
544544- metrics.RateLimitHits.WithLabelValues(rle.LimitType).Inc()
545545- }
546546- log.Printf("ratelimit.hit: did=%s hourly_limit=%d daily_limit=%d error=%v",
547547- member.DID, hourly, daily, err)
548548- metrics.MessagesRejected.WithLabelValues("rate_limit").Inc()
549549- ospreyEmitter.Emit(ctx, osprey.EventData{
550550- EventType: osprey.EventRelayRejected,
551551- SenderDID: member.DID,
552552- SenderDomain: member.Domain,
553553- RejectReason: "rate_limit",
554554- })
555555- return err
556556- }
557557-558558- // Osprey send-time enforcement. Reuses the policy we fetched at
559559- // the top of sendCheck so we only hit the enforcer cache once per
560560- // session. Fail-stale: stale cache > fail-open.
561561- if ospreyEnforcer != nil {
562562- if metrics.OspreyReachable != nil {
563563- if ospreyEnforcer.Reachable() {
564564- metrics.OspreyReachable.Set(1)
565565- } else {
566566- metrics.OspreyReachable.Set(0)
567567- }
568568- }
569569- if policy != nil && policy.Suspended {
570570- log.Printf("osprey.enforce: did=%s action=block_send reason=%s", member.DID, policy.SuspendReason)
571571- metrics.OspreyChecksTotal.WithLabelValues("blocked").Inc()
572572- metrics.MessagesRejected.WithLabelValues("osprey_suspended").Inc()
573573- ospreyEmitter.Emit(ctx, osprey.EventData{
574574- EventType: osprey.EventRelayRejected,
575575- SenderDID: member.DID,
576576- SenderDomain: member.Domain,
577577- RejectReason: "osprey_auto_suspended",
578578- })
579579- return &smtp.SMTPError{
580580- Code: 550,
581581- EnhancedCode: smtp.EnhancedCode{5, 7, 1},
582582- Message: "Account suspended by safety system — check status: GET /member/status?did=" + member.DID + " with Authorization: Bearer header",
583583- }
584584- }
585585- metrics.OspreyChecksTotal.WithLabelValues("allowed").Inc()
586586- }
587587-588588- // Check labels (fail-closed)
589589- ok, err := labelChecker.CheckLabels(ctx, member.DID, member.SendCount)
590590- if err != nil {
591591- log.Printf("label.check: did=%s result=error labeler_reachable=false error=%v", member.DID, err)
592592- metrics.LabelerReachable.Set(0)
593593- metrics.MessagesRejected.WithLabelValues("label_denied").Inc()
594594- ospreyEmitter.Emit(ctx, osprey.EventData{
595595- EventType: osprey.EventRelayRejected,
596596- SenderDID: member.DID,
597597- SenderDomain: member.Domain,
598598- RejectReason: "label_unavailable",
599599- })
600600- return fmt.Errorf("451 temporary error — label verification unavailable")
601601- }
602602- metrics.LabelerReachable.Set(1)
603603- if !ok {
604604- log.Printf("label.check: did=%s result=denied", member.DID)
605605- metrics.MessagesRejected.WithLabelValues("label_denied").Inc()
606606- ospreyEmitter.Emit(ctx, osprey.EventData{
607607- EventType: osprey.EventRelayRejected,
608608- SenderDID: member.DID,
609609- SenderDomain: member.Domain,
610610- RejectReason: "label_denied",
611611- })
612612- return fmt.Errorf("550 sending not authorized — required labels missing")
613613- }
614614- log.Printf("label.check: did=%s result=ok cache_hit=false", member.DID)
615615- return nil
616616- }
617617-618618- // On message accepted: batch rate check, DKIM sign, queue for delivery
619619- onAccept := func(member *relay.AuthMember, from string, to []string, data []byte) error {
620620- // Pre-check queue capacity for the full batch BEFORE consuming rate budget.
621621- // This prevents partial delivery: if we enqueue 2 of 5 recipients then fail,
622622- // the client retries all 5, duplicating the first 2.
623623- if !queue.HasCapacity(len(to)) {
624624- metrics.MessagesRejected.WithLabelValues("queue_full").Inc()
625625- return fmt.Errorf("451 delivery queue full — try again later")
626626- }
627627-628628- // Filter out suppressed recipients BEFORE consuming rate budget so
629629- // an unsubscribed recipient doesn't count against the member's daily
630630- // limit. Rejecting the whole batch here would surprise senders who
631631- // include a mix of subscribed and unsubscribed addresses — instead
632632- // we quietly drop suppressed recipients and proceed with the rest.
633633- // If ALL recipients are suppressed, return 550.
634634- var deliverable []string
635635- var suppressedCount int
636636- if unsubscriber != nil {
637637- for _, r := range to {
638638- supp, err := store.IsSuppressed(context.Background(), member.DID, r)
639639- if err != nil {
640640- log.Printf("suppression.check_error: did=%s recipient=%s error=%v", member.DID, r, err)
641641- // Fail open — a DB read error shouldn't block a legitimate send.
642642- deliverable = append(deliverable, r)
643643- continue
644644- }
645645- if supp {
646646- suppressedCount++
647647- log.Printf("smtp.suppressed: did=%s recipient=%s", member.DID, r)
648648- metrics.MessagesRejected.WithLabelValues("suppressed").Inc()
649649- continue
650650- }
651651- deliverable = append(deliverable, r)
652652- }
653653- if len(deliverable) == 0 {
654654- log.Printf("smtp.all_suppressed: did=%s recipients=%d", member.DID, len(to))
655655- return &smtp.SMTPError{
656656- Code: 550,
657657- EnhancedCode: smtp.EnhancedCode{5, 7, 1},
658658- Message: "All recipients have unsubscribed",
659659- }
660660- }
661661- if suppressedCount > 0 {
662662- log.Printf("smtp.partial_suppressed: did=%s total=%d suppressed=%d deliverable=%d",
663663- member.DID, len(to), suppressedCount, len(deliverable))
664664- }
665665- } else {
666666- deliverable = to
667667- }
668668-669669- // Atomically check rate limits AND record the sends for the full batch.
670670- // This eliminates the TOCTOU race where concurrent sessions could both pass
671671- // a check-only call before either records. Uses the same label policy as
672672- // sendCheck above (highly_trusted skips warming, burst_warming throttles).
673673- var batchPolicy *relay.LabelPolicy
674674- if ospreyEnforcer != nil {
675675- p, err := ospreyEnforcer.GetPolicy(context.Background(), member.DID)
676676- if errors.Is(err, relay.ErrOspreyColdCache) {
677677- // #215: same cold-cache fail-closed as the per-msg
678678- // path; reject the batch with 451 so the sender
679679- // retries when Osprey is healthy again.
680680- return fmt.Errorf("451 osprey unreachable, please retry")
681681- }
682682- batchPolicy = p
683683- }
684684- hourly, daily := relay.WarmingLimitsForPolicy(warmingCfg, member.CreatedAt, member.HourlyLimit, member.DailyLimit, batchPolicy)
685685- if err := rateLimiter.CheckBatchAndRecord(context.Background(), member.DID, len(deliverable), hourly, daily); err != nil {
686686- if rle, ok := err.(*relay.RateLimitError); ok {
687687- rle.Tier = relay.MemberTier(warmingCfg, member.CreatedAt, time.Now())
688688- metrics.RateLimitHits.WithLabelValues(rle.LimitType).Inc()
689689- }
690690- log.Printf("ratelimit.batch_reject: did=%s recipients=%d hourly_limit=%d daily_limit=%d error=%v",
691691- member.DID, len(deliverable), hourly, daily, err)
692692- metrics.MessagesRejected.WithLabelValues("rate_limit").Inc()
693693- return err
694694- }
695695-696696- // Content fingerprint computed once from the original data (before
697697- // per-recipient headers are prepended). Used for both the messages
698698- // table (content-spray detection) and the Osprey event.
699699- subject, body := extractSubjectAndBody(data)
700700- contentFP := relay.ContentFingerprint(subject, body)
701701-702702- // Multi-RCPT DATA fans out to one queue entry per recipient. If the
703703- // loop returns early on a per-recipient error, recipients 1..N-1 are
704704- // already enqueued and the SMTP client will retry the entire DATA
705705- // (because we returned a transient error), duplicating those
706706- // recipients. Instead, we collect per-recipient outcomes and only
707707- // reject the whole DATA when ZERO recipients succeeded. See #226.
708708- outcomes := make([]relay.RecipientOutcome, 0, len(deliverable))
709709- for _, recipient := range deliverable {
710710- outcome := relay.RecipientOutcome{Recipient: recipient}
711711-712712- verpFrom := relay.VERPReturnPath(member.DID, recipient, cfg.Domain)
713713-714714- // Build per-recipient message with its own List-Unsubscribe header.
715715- // The header references a per-recipient token, so each recipient
716716- // can unsubscribe only themselves (not the whole batch).
717717- perMsgData := data
718718- if unsubscriber != nil {
719719- lu, lup := unsubscriber.HeaderValues(member.DID, recipient, time.Now())
720720- perMsgData = prependListUnsubHeaders(data, lu, lup)
721721- }
722722- // X-Atmos-Member-Did: stamps the sending member's DID on every
723723- // outbound message so inbound FBL/ARF reports can be attributed
724724- // back to a member. Preserved by all major providers in Part 3
725725- // of their ARF reports. Must come before DKIM signing so the
726726- // signature covers it (and the DKIM signer includes X-Atmos-*
727727- // headers in its signed list).
728728- perMsgData = prependHeader(perMsgData, "X-Atmos-Member-Did", member.DID)
729729-730730- // Stamp Feedback-ID BEFORE signing so both the member and operator
731731- // DKIM signatures cover it. Receivers (Gmail in particular) only
732732- // trust the Feedback-ID for FBL routing when it's authenticated.
733733- // Category is "transactional" for all relay mail today; widen
734734- // when marketing/bulk categories are introduced.
735735- perMsgData = relay.PrependFeedbackID(perMsgData, "transactional", member.DID, member.Domain)
736736-737737- // DKIM sign per-recipient (required because the prepended headers
738738- // differ per recipient — a shared signature would break on the other
739739- // recipients). Slight perf cost acceptable for the deliverability win.
740740- //
741741- // Dual-domain: member signature first (d=member.Domain, required
742742- // for DMARC alignment) → operator signature on top (d=atmos.email,
743743- // carries FBL routing).
744744- signer := relay.NewDualDomainSigner(member.DKIMKeys, operatorKeys, member.Domain, cfg.OperatorDKIMDomain)
745745- signed, signErr := signer.Sign(strings.NewReader(string(perMsgData)))
746746- if signErr != nil {
747747- outcome.Err = fmt.Errorf("DKIM sign: %w", signErr)
748748- log.Printf("smtp.recipient_failed: did=%s recipient=%s stage=dkim error=%v", member.DID, recipient, signErr)
749749- outcomes = append(outcomes, outcome)
750750- continue
751751- }
752752-753753- // Log message to store
754754- msgID, insErr := store.InsertMessage(context.Background(), &relaystore.Message{
755755- MemberDID: member.DID,
756756- FromAddr: from,
757757- ToAddr: recipient,
758758- MessageID: extractMessageID(string(data)),
759759- Status: relaystore.MsgQueued,
760760- CreatedAt: time.Now().UTC(),
761761- ContentFingerprint: contentFP,
762762- })
763763- if insErr != nil {
764764- outcome.Err = fmt.Errorf("log message: %w", insErr)
765765- log.Printf("smtp.recipient_failed: did=%s recipient=%s stage=insert error=%v", member.DID, recipient, insErr)
766766- outcomes = append(outcomes, outcome)
767767- continue
768768- }
769769- outcome.MsgID = msgID
770770-771771- // Enqueue for delivery — capacity was pre-checked above so this
772772- // should only fail on spool I/O errors, not capacity.
773773- if enqErr := queue.Enqueue(&relay.QueueEntry{
774774- ID: msgID,
775775- From: verpFrom,
776776- To: recipient,
777777- Data: signed,
778778- MemberDID: member.DID,
779779- }); enqErr != nil {
780780- // Mark the row as failed so it doesn't masquerade as queued
781781- // (the orphan-reconciliation janitor would catch it eventually,
782782- // but immediate update keeps the messages table consistent).
783783- if updErr := store.UpdateMessageStatus(context.Background(), msgID, relaystore.MsgFailed, 0); updErr != nil {
784784- log.Printf("smtp.mark_failed_error: did=%s msg_id=%d error=%v", member.DID, msgID, updErr)
785785- }
786786- outcome.Err = fmt.Errorf("queue.enqueue: %w", enqErr)
787787- log.Printf("smtp.recipient_failed: did=%s recipient=%s stage=enqueue msg_id=%d error=%v", member.DID, recipient, msgID, enqErr)
788788- outcomes = append(outcomes, outcome)
789789- continue
790790- }
791791-792792- // Only count the send AFTER successful enqueue — failed recipients
793793- // shouldn't burn lifetime send-count budget. Rate counters were
794794- // pre-recorded for the full batch by CheckBatchAndRecord above; that
795795- // over-counts on partial failure but the warming/limit window is
796796- // short enough that the impact is negligible vs. the complexity of
797797- // rolling back per-recipient rate-counter rows.
798798- store.IncrementSendCount(context.Background(), member.DID)
799799-800800- outcomes = append(outcomes, outcome)
801801- }
802802-803803- succeeded, failed, retryAll, lastErr := relay.AggregateRecipientOutcomes(outcomes)
804804- if metrics.PartialDeliveryRecipients != nil {
805805- if succeeded > 0 {
806806- metrics.PartialDeliveryRecipients.WithLabelValues("succeeded").Add(float64(succeeded))
807807- }
808808- if failed > 0 {
809809- metrics.PartialDeliveryRecipients.WithLabelValues("failed").Add(float64(failed))
810810- }
811811- }
812812- if retryAll {
813813- metrics.MessagesRejected.WithLabelValues("delivery_failed").Inc()
814814- log.Printf("smtp.delivery_all_failed: did=%s recipients=%d last_error=%v", member.DID, len(deliverable), lastErr)
815815- return fmt.Errorf("451 delivery queue error — try again later: %w", lastErr)
816816- }
817817- if failed > 0 {
818818- if metrics.PartialDeliveries != nil {
819819- metrics.PartialDeliveries.Inc()
820820- }
821821- log.Printf("smtp.partial_delivery: did=%s succeeded=%d failed=%d last_error=%v", member.DID, succeeded, failed, lastErr)
822822- }
823823-824824- // Emit relay_attempt event after successful queuing. Enrich with
825825- // velocity counters so Osprey rules can do stateless burst + bounce
826826- // reputation checks (SML has no windowed-count primitive). Lookups
827827- // are best-effort — a query error emits 0 rather than blocking send.
828828- memberAge := int(time.Since(member.CreatedAt).Hours() / 24)
829829- now := time.Now().UTC()
830830- sendsLastHour, _ := store.GetRateCount(context.Background(), member.DID, relaystore.WindowHourly, now.Truncate(time.Hour))
831831- sendsLastMinute, _ := store.GetSendCountSince(context.Background(), member.DID, now.Add(-time.Minute))
832832- sendsLast5Min, _ := store.GetSendCountSince(context.Background(), member.DID, now.Add(-5*time.Minute))
833833- uniqueDomains, _ := store.GetUniqueRecipientDomainsSince(context.Background(), member.DID, now.Add(-time.Hour))
834834- _, bounced24h, _ := store.GetMessageCounts(context.Background(), member.DID, now.Add(-24*time.Hour))
835835- sameContentRecipients, _ := store.GetSameContentRecipientsSince(context.Background(), member.DID, contentFP, now.Add(-time.Hour))
836836- ospreyEmitter.Emit(context.Background(), osprey.EventData{
837837- EventType: osprey.EventRelayAttempt,
838838- SenderDID: member.DID,
839839- SenderDomain: member.Domain,
840840- RecipientCount: len(deliverable),
841841- SendCount: member.SendCount,
842842- MemberAgeDays: memberAge,
843843- SendsLastMinute: sendsLastMinute,
844844- SendsLast5Minutes: sendsLast5Min,
845845- SendsLastHour: sendsLastHour,
846846- HardBouncesLast24h: int(bounced24h),
847847- UniqueRecipientDomainsLastHour: uniqueDomains,
848848- ContentFingerprint: contentFP,
849849- SameContentRecipientsLastHour: sameContentRecipients,
850850- })
851851-852852- return nil
228228+ // SMTP submission pipeline. The handler bundles AUTH-time member
229229+ // lookup, per-message rate / label checking, and DATA-time
230230+ // acceptance into one struct that owns its deps explicitly.
231231+ submissions := &submissionHandler{
232232+ store: store,
233233+ queue: queue,
234234+ metrics: metrics,
235235+ rateLimiter: rateLimiter,
236236+ labelChecker: labelChecker,
237237+ ospreyEnforcer: ospreyEnforcer,
238238+ ospreyEmitter: ospreyEmitter,
239239+ unsubscriber: unsubscriber,
240240+ operatorKeys: operatorKeys,
241241+ cfg: cfg,
242242+ warmingCfg: warmingCfg,
853243 }
244244+ memberLookup := submissions.Lookup
245245+ sendCheck := submissions.Check
246246+ onAccept := submissions.Accept
854247855855- // Inbound SMTP server for bounce processing (port 25). The cache below
856856- // answers VERP "is this hash a member?" lookups without hitting the DB
857857- // on every inbound. Both a positive cache (rebuilt at most every 30s)
858858- // and a negative cache (5min TTL, 10k entries) defend against random-
859859- // VERP DoS — see #218.
860860- memberHashCache := relay.NewMemberHashCache(relay.MemberHashCacheConfig{
861861- Rebuild: func() (map[string]string, error) {
862862- members, err := store.ListMembers(context.Background())
863863- if err != nil {
864864- return nil, err
865865- }
866866- out := make(map[string]string, len(members))
867867- for _, mb := range members {
868868- out[relay.MemberHashFromDID(mb.DID)] = mb.DID
869869- }
870870- return out, nil
871871- },
872872- Metrics: metrics,
248248+ // Inbound SMTP server (port 25) for bounce processing, reply
249249+ // forwarding, operator mail, and FBL/ARF complaint ingestion. The
250250+ // FBL notifier is bound later (see fblNotify = adminAPI.FireFBLComplaint
251251+ // further down) once adminAPI exists; setFBLNotifier closes that
252252+ // loop. See cmd/relay/inbound.go.
253253+ inbound := setupInboundServer(inboundDeps{
254254+ cfg: cfg,
255255+ store: store,
256256+ metrics: metrics,
257257+ ospreyEmitter: ospreyEmitter,
258258+ bounceProcessor: bounceProcessor,
873259 })
874874-875875- inboundMemberLookup := func(_ context.Context, memberHash string) (string, bool) {
876876- return memberHashCache.Lookup(memberHash)
877877- }
878878-879879- inboundBounceHandler := func(ctx context.Context, memberDID, recipient, bounceType, details string) {
880880- if bounceType == "hard" {
881881- metrics.BouncesTotal.WithLabelValues("hard").Inc()
882882- } else {
883883- metrics.BouncesTotal.WithLabelValues("soft").Inc()
884884- }
885885-886886- ospreyEmitter.Emit(ctx, osprey.EventData{
887887- EventType: osprey.EventBounceReceived,
888888- SenderDID: memberDID,
889889- RecipientDomain: recipientDomain(recipient),
890890- BounceType: bounceType,
891891- Details: details,
892892- })
893893-894894- action, err := bounceProcessor.RecordBounce(ctx, memberDID, recipient, details)
895895- if err != nil {
896896- log.Printf("inbound.bounce_error: did=%s recipient=%s error=%v", memberDID, recipient, err)
897897- } else if action != "none" {
898898- log.Printf("inbound.bounce_action: did=%s recipient=%s bounce_type=%s action=%s", memberDID, recipient, bounceType, action)
899899- if action == "suspend" {
900900- ospreyEmitter.Emit(ctx, osprey.EventData{
901901- EventType: osprey.EventMemberSuspended,
902902- SenderDID: memberDID,
903903- Reason: "bounce_rate_exceeded",
904904- })
905905- }
906906- }
907907- }
908908-909909- inboundServer := relay.NewInboundServer(relay.InboundConfig{
910910- ListenAddr: cfg.InboundAddr,
911911- Domain: cfg.Domain,
912912- RateLimitMsgsPerMinute: cfg.InboundRateLimitMsgsPerMinute,
913913- RateLimitBurst: cfg.InboundRateLimitBurst,
914914- }, inboundBounceHandler, inboundMemberLookup)
915915-916916- // Inbound reply forwarding: classify inbound mail and deliver replies
917917- // to the member's registered forward_to mailbox. SRS key is persisted
918918- // so bounces of already-forwarded mail remain verifiable across
919919- // restarts. Key generation mirrors the unsub key pattern.
920920- srsKey, err := relay.LoadOrCreateUnsubKey(cfg.StateDir + "/srs.key")
921921- if err != nil {
922922- log.Fatalf("load srs key: %v", err)
923923- }
924924- srsRewriter := relay.NewSRSRewriter(srsKey, cfg.Domain)
925925- forwarder := relay.NewForwarder(srsRewriter, cfg.Domain)
926926- domainLookup := func(ctx context.Context, domain string) (string, bool) {
927927- d, err := store.GetMemberDomain(ctx, domain)
928928- if err != nil || d == nil {
929929- return "", false
930930- }
931931- return d.ForwardTo, true
932932- }
933933- inboundServer.SetReplyForwarding(domainLookup, forwarder, srsRewriter)
934934- // Operator mail (postmaster@, abuse@, fbl@ at the relay's own domain)
935935- // is forwarded externally when configured. Without this, provider
936936- // verification emails (Microsoft SNDS authorization, Yahoo CFL
937937- // confirmation) and ops-team mail would land in the audit log and
938938- // never reach a human.
939939- if cfg.OperatorForwardTo != "" {
940940- inboundServer.SetOperatorForwarding(forwarder, cfg.OperatorForwardTo)
941941- log.Printf("operator_forward.enabled: to=%s", cfg.OperatorForwardTo)
942942- }
943943- inboundServer.SetMetrics(metrics)
944944- // Persistent audit log — every accepted inbound message lands in the
945945- // inbound_messages table. Failures inside LogInbound are swallowed so
946946- // SMTP delivery is never affected.
947947- inboundServer.SetInboundLogger(&relayInboundLogger{store: store})
948948- log.Printf("inbound.reply_forwarding.enabled: srs_domain=%s", cfg.Domain)
949949-950950- // FBL / ARF feedback-report ingestion. Inbound reports to fbl@<domain>
951951- // get parsed, attributed to the sending member, and emitted as an
952952- // Osprey complaint event so rules can react (e.g. auto-suspend after
953953- // N complaints in 24h). memberExists guards against spoofed reports
954954- // naming DIDs we never issued.
955955- var fblNotify func(ctx context.Context, memberDID, senderDomain, recipientDomain, feedbackType, provider string)
956956- memberExists := func(ctx context.Context, did string) bool {
957957- m, err := store.GetMember(ctx, did)
958958- return err == nil && m != nil
959959- }
960960- inboundServer.SetFBL(func(ctx context.Context, memberDID, senderDomain, recipientDomain, feedbackType, providerUA string, arrival time.Time) {
961961- provider := normalizeProviderUA(providerUA)
962962- metrics.ComplaintsTotal.WithLabelValues(feedbackType, provider).Inc()
963963- ospreyEmitter.Emit(ctx, osprey.EventData{
964964- EventType: osprey.EventComplaintReceived,
965965- SenderDID: memberDID,
966966- SenderDomain: senderDomain,
967967- RecipientDomain: recipientDomain,
968968- FeedbackType: feedbackType,
969969- ProviderUA: providerUA,
970970- })
971971- if fblNotify != nil {
972972- fblNotify(ctx, memberDID, senderDomain, recipientDomain, feedbackType, provider)
973973- }
974974- }, memberExists)
975975- log.Printf("inbound.fbl.enabled: inbox=fbl@%s", cfg.Domain)
260260+ inboundServer := inbound.Server
976261977262 // Context for graceful shutdown
978263 ctx, cancel := context.WithCancel(context.Background())
979264 defer cancel()
980265981981- // Osprey labelcheck cache snapshotter (#215). Persists the
982982- // in-memory enforcer cache every 60s so a relay restart doesn't
983983- // reset to fully cold. Combined with fail-closed-on-cold-cache
984984- // in the enforcer, this turns the previously-load-bearing
985985- // fail-open path into a rare edge case (snapshot read failed
986986- // AND Osprey unreachable AND DID has never been seen).
987987- if ospreyEnforcer != nil {
988988- relay.GoSafe("osprey.cache_snapshot", func() {
989989- t := time.NewTicker(60 * time.Second)
990990- defer t.Stop()
991991- for {
992992- select {
993993- case <-ctx.Done():
994994- if err := ospreyEnforcer.Snapshot(); err != nil {
995995- log.Printf("osprey.cache.snapshot_error_on_shutdown: %v", err)
996996- }
997997- return
998998- case <-t.C:
999999- if err := ospreyEnforcer.Snapshot(); err != nil {
10001000- log.Printf("osprey.cache.snapshot_error: %v", err)
10011001- }
10021002- }
10031003- }
10041004- })
10051005- }
10061006-10071007- // Osprey DLQ replayer (#214). Drains the on-disk spool back to
10081008- // Kafka every 30s. A sustained Kafka outage manifests as a
10091009- // growing osprey_spool_depth gauge without permanent loss until
10101010- // the cap is hit. Started here, after ctx is in scope, so the
10111011- // loop respects the same shutdown signal as the rest of the
10121012- // long-lived goroutines.
10131013- if ospreyEmitter.Enabled() {
10141014- relay.GoSafe("osprey.replayer", func() {
10151015- t := time.NewTicker(30 * time.Second)
10161016- defer t.Stop()
10171017- for {
10181018- select {
10191019- case <-ctx.Done():
10201020- return
10211021- case <-t.C:
10221022- n, failed, err := ospreyEmitter.ReplaySpool(ctx)
10231023- if err != nil {
10241024- log.Printf("osprey.replay.error: %v", err)
10251025- continue
10261026- }
10271027- if n > 0 || failed > 0 {
10281028- log.Printf("osprey.replay: replayed=%d failed=%d", n, failed)
10291029- }
10301030- }
10311031- }
10321032- })
10331033- }
10341034-1035266 sigCh := make(chan os.Signal, 1)
1036267 signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
1037268 relay.GoSafe("signal.shutdown", func() {
···1040271 cancel()
1041272 })
104227310431043- // Start SMTP server. TLS uses CertReloader (#216) so ACME cert
10441044- // renewals are picked up automatically without a process restart.
10451045- // Previously the ACME reloadServices hook restarted the relay
10461046- // every 60-90 days, dropping in-flight SMTP/HTTP sessions and
10471047- // triggering the spool-reload race in #208. With GetCertificate,
10481048- // the next TLS handshake after renewal serves the new cert with
10491049- // zero session disruption.
274274+ // Start SMTP server. TLS uses CertReloader so ACME cert renewals
275275+ // are picked up automatically without a process restart.
1050276 var tlsConfig *tls.Config
1051277 if cfg.TLSCertFile != "" && cfg.TLSKeyFile != "" {
1052278 reloader, err := relay.NewCertReloader(cfg.TLSCertFile, cfg.TLSKeyFile)
···1090316 // Still needed for /enroll/resolve (handle→DID lookup in the wizard).
1091317 didResolver := relay.NewDIDResolver(&http.Client{Timeout: 10 * time.Second}, "")
109231810931093- // Periodically sweep expired pending enrollments so stale rows don't
10941094- // accumulate. One-per-hour is plenty given 24h TTL and UNIQUE(domain)
10951095- // already guarantees the table stays small in practice.
10961096- relay.GoSafe("pending_enrollment_cleanup", func() {
10971097- t := time.NewTicker(1 * time.Hour)
10981098- defer t.Stop()
10991099- for range t.C {
11001100- cutoff := time.Now().UTC()
11011101- if n, err := store.CleanExpiredPendingEnrollments(context.Background(), cutoff); err != nil {
11021102- log.Printf("pending_enrollment_cleanup: error=%v", err)
11031103- } else if n > 0 {
11041104- log.Printf("pending_enrollment_cleanup: expired=%d", n)
11051105- }
11061106- }
11071107- })
11081108-11091109- // SQLite pool-stats sampler (#210). Polls sql.DB.Stats() every
11101110- // 10s and republishes the values as Prometheus gauges so
11111111- // operators can graph pool pressure (open/in-use/idle) and
11121112- // contention (WaitCount, WaitDuration) without a busy-error
11131113- // ever escaping the 5s busy_timeout PRAGMA. Combined with
11141114- // metrics.SQLiteBusyErrors at hot writers, this turns the
11151115- // previously-invisible contention surface into both a leading
11161116- // indicator (pool waits climbing) AND a firing one (busy
11171117- // errors actually returned).
11181118- relay.GoSafe("sqlite.stats", func() {
11191119- t := time.NewTicker(10 * time.Second)
11201120- defer t.Stop()
11211121- for {
11221122- select {
11231123- case <-ctx.Done():
11241124- return
11251125- case <-t.C:
11261126- ps := store.SampleStats()
11271127- metrics.SetSQLiteStats(ps.OpenConnections, ps.InUse, ps.Idle, ps.WaitCount, ps.WaitDurationSecond)
11281128- }
11291129- }
11301130- })
11311131-11321132- // Orphan-reconciliation janitor (#208). Finds messages rows that
11331133- // are still status=queued long after creation but have no spool
11341134- // file backing them, and marks them failed so dashboards stop
11351135- // showing them as in-flight forever and operators can see the
11361136- // rate via metrics.OrphanReconciled.
11371137- //
11381138- // Why this is necessary: a multi-recipient batch where recipient
11391139- // N's queue.Enqueue fails after recipients 1..N-1 succeeded
11401140- // leaves an N-th row at status=queued with no spool entry. The
11411141- // SMTP session returns 4xx; the client retries; rows for
11421142- // recipients 1..N-1 get duplicated; the original N-th row is
11431143- // orphaned. Fixing the duplicate-delivery side requires changing
11441144- // the SMTP session to accept partial success (#226 follow-up);
11451145- // this janitor closes the orphan accounting in the meantime.
11461146- //
11471147- // orphanMinAge gives Enqueue plenty of time to land its spool
11481148- // file before we second-guess. 5 minutes is far longer than any
11491149- // reasonable Enqueue path.
11501150- const orphanMinAge = 5 * time.Minute
11511151- relay.GoSafe("orphan_reconcile", func() {
11521152- t := time.NewTicker(5 * time.Minute)
11531153- defer t.Stop()
11541154- for range t.C {
11551155- ids, err := store.ListQueuedMessageIDsOlderThan(context.Background(), orphanMinAge, 500)
11561156- if err != nil {
11571157- log.Printf("orphan_reconcile: list_error=%v", err)
11581158- continue
11591159- }
11601160- closed := 0
11611161- for _, id := range ids {
11621162- if spool.Exists(id) {
11631163- continue
11641164- }
11651165- if err := store.UpdateMessageStatus(context.Background(), id, relaystore.MsgFailed, 0); err != nil {
11661166- log.Printf("orphan_reconcile: update_error id=%d error=%v", id, err)
11671167- continue
11681168- }
11691169- closed++
11701170- metrics.OrphanReconciled.Inc()
11711171- }
11721172- if closed > 0 {
11731173- log.Printf("orphan_reconcile: scanned=%d closed=%d", len(ids), closed)
11741174- }
11751175- }
11761176- })
11771177-11781178- // Periodic refresh of the inbound member-hash cache (#218). The cache
11791179- // rebuilds on-miss too, but that path is rate-limited to one rebuild
11801180- // per 30s; this background ticker guarantees newly enrolled members
11811181- // become resolvable within ~60s without needing a miss to trigger it.
11821182- relay.GoSafe("member_hash_refresh", func() {
11831183- memberHashCache.PeriodicRebuild(ctx, 60*time.Second)
11841184- })
11851185-11861186- // Bypass-expiry janitor (#213). Runs every 5min; removes bypass
11871187- // entries whose expires_at has passed and writes 'expired' audit
11881188- // rows. Without this, an admin token compromise that issued a
11891189- // long bypass would persist past any reasonable detection
11901190- // window — even with the expiry recorded, removal needs an
11911191- // active sweep. Legacy bypass entries (expires_at='') are NOT
11921192- // touched; operators must explicitly re-add with expiry.
11931193- relay.GoSafe("bypass_expiry", func() {
11941194- t := time.NewTicker(5 * time.Minute)
11951195- defer t.Stop()
11961196- for {
11971197- select {
11981198- case <-ctx.Done():
11991199- return
12001200- case <-t.C:
12011201- // Snapshot the live set before purge so we can mirror
12021202- // the eviction into the labelChecker's in-memory bypass
12031203- // list. The store path uses formatTime cutoffs; the
12041204- // in-memory set is just a string slice, so we recompute
12051205- // the diff: anything in labelChecker.BypassDIDs() that
12061206- // isn't in the post-purge store list has expired.
12071207- n, err := store.PurgeExpiredBypassDIDs(context.Background())
12081208- if err != nil {
12091209- log.Printf("bypass_expiry: error=%v", err)
12101210- continue
12111211- }
12121212- if n == 0 {
12131213- continue
12141214- }
12151215- active, err := store.ListBypassDIDs(context.Background())
12161216- if err != nil {
12171217- log.Printf("bypass_expiry: list_error=%v", err)
12181218- continue
12191219- }
12201220- keep := make(map[string]struct{}, len(active))
12211221- for _, d := range active {
12221222- keep[d] = struct{}{}
12231223- }
12241224- for _, d := range labelChecker.BypassDIDs() {
12251225- if _, ok := keep[d]; !ok {
12261226- labelChecker.RemoveBypassDID(d)
12271227- }
12281228- }
12291229- log.Printf("bypass_expiry: removed=%d", n)
12301230- }
12311231- }
12321232- })
12331233-12341234- // Start admin API (includes /metrics endpoint)
12351235- adminAPI := admin.NewComplete(store, cfg.AdminToken, cfg.Domain, labelChecker, spfChecker, domainVerifier)
12361236- // Register the operator DKIM copy-paste view. Admin-token-authenticated
12371237- // (same as the rest of /admin/*), Tailscale-only via the admin mux bind.
12381238- if operatorKeys != nil {
12391239- adminAPI.SetOperatorDKIM(operatorKeys, cfg.OperatorDKIMDomain)
12401240- }
12411241-12421242- // Operator notification webhook. Pluggable per deployment — the
12431243- // operator brings their own sink (Slack/Matrix/ntfy/etc.). Empty
12441244- // URL disables notifications; SetNotifier tolerates a nil sender.
12451245- // Validate scheme/host at startup so a misconfig fails fast rather
12461246- // than silently posting credentials over plaintext or to file://.
12471247- if err := config.ValidateWebhookURL(cfg.OperatorWebhookURL); err != nil {
12481248- log.Fatalf("invalid operatorWebhookURL: %v", err)
12491249- }
12501250- if notifier := notify.NewSender(cfg.OperatorWebhookURL, cfg.OperatorWebhookSecret); notifier != nil {
12511251- adminAPI.SetNotifier(notifier)
12521252- // Log host only — Slack/Discord/Matrix incoming webhooks carry
12531253- // authorization material in the URL path, so the full URL must
12541254- // not land in journald.
12551255- log.Printf("notify.enabled: host=%s signed=%v", webhookHostForLog(cfg.OperatorWebhookURL), cfg.OperatorWebhookSecret != "")
12561256- }
12571257-12581258- // System-mail helper: operator-ping on enroll, member-welcome on approve,
12591259- // key-regenerated on rotate. Signs with the operator DKIM keypair and
12601260- // delivers via the same direct-MX path as member mail. Disabled if no
12611261- // operator DKIM is configured.
12621262- if operatorKeys != nil {
12631263- opSigner := relay.NewDKIMSigner(operatorKeys, cfg.OperatorDKIMDomain)
12641264- opMailer := relay.NewOpMailer(
12651265- relay.OpMailContext{RelayDomain: cfg.OperatorDKIMDomain},
12661266- opSigner,
12671267- relay.DefaultOpMailSender(),
12681268- relay.WithOpMailMetrics(metrics),
12691269- )
12701270- adminAPI.SetOpMailer(opMailer, cfg.OperatorForwardTo, cfg.PublicBaseURL)
12711271- }
12721272-12731273- fblNotify = adminAPI.FireFBLComplaint
12741274-12751275- // Operator-initiated warmup sends. Seed addresses come from a
12761276- // sops-encrypted env var so they never appear in the repo. Empty
12771277- // WARMUP_SEED_ADDRESSES disables the feature (button hidden in UI).
12781278- if seeds := os.Getenv("WARMUP_SEED_ADDRESSES"); seeds != "" {
12791279- seedList := strings.Split(seeds, ",")
12801280- for i := range seedList {
12811281- seedList[i] = strings.TrimSpace(seedList[i])
12821282- }
12831283- var fromParts []string
12841284- if fp := os.Getenv("WARMUP_FROM_LOCAL_PARTS"); fp != "" {
12851285- for _, p := range strings.Split(fp, ",") {
12861286- fromParts = append(fromParts, strings.TrimSpace(p))
12871287- }
12881288- }
12891289- ws := relay.NewWarmupSender(relay.WarmupConfig{
12901290- SeedAddresses: seedList,
12911291- FromLocalParts: fromParts,
12921292- MemberLookup: memberLookup,
12931293- Queue: queue,
12941294- OperatorKeys: operatorKeys,
12951295- OperatorDKIMDomain: cfg.OperatorDKIMDomain,
12961296- RelayDomain: cfg.Domain,
12971297- InsertMessage: func(ctx context.Context, did, from, to, msgID string) (int64, error) {
12981298- return store.InsertMessage(ctx, &relaystore.Message{
12991299- MemberDID: did,
13001300- FromAddr: from,
13011301- ToAddr: to,
13021302- MessageID: msgID,
13031303- Status: relaystore.MsgQueued,
13041304- CreatedAt: time.Now().UTC(),
13051305- })
13061306- },
13071307- IncrSendCount: func(ctx context.Context, did string) {
13081308- store.IncrementSendCount(ctx, did)
13091309- },
13101310- })
13111311- adminAPI.SetWarmupSender(ws)
13121312- log.Printf("warmup.enabled: seed_count=%d", len(seedList))
13131313-13141314- if warmupDIDsEnv := os.Getenv("WARMUP_DIDS"); warmupDIDsEnv != "" {
13151315- var warmupDIDs []string
13161316- for _, d := range strings.Split(warmupDIDsEnv, ",") {
13171317- warmupDIDs = append(warmupDIDs, strings.TrimSpace(d))
13181318- }
13191319- warmupSched := relay.NewWarmupScheduler(relay.WarmupSchedulerConfig{
13201320- Sender: ws,
13211321- ListDIDs: func(ctx context.Context) ([]string, error) {
13221322- return warmupDIDs, nil
13231323- },
13241324- })
13251325- warmupSched.Start(ctx)
13261326- defer warmupSched.Stop()
13271327- log.Printf("warmup.scheduler: dids=%v", warmupDIDs)
13281328- }
13291329- }
13301330-13311331- // Durable notification queue worker (audit #158). Drains
13321332- // pending_notifications rows that RegenerateKey / FireMemberWelcome
13331333- // enqueue, dispatching each via the admin API's kind-aware
13341334- // DeliverNotification. Failures retry with exponential backoff and
13351335- // dead-letter after MaxNotificationAttempts. 15s tick is fast enough
13361336- // that rotation mail lands within a minute under normal conditions,
13371337- // slow enough not to hammer an already-struggling downstream.
13381338- notifyWorker := notify.NewQueueWorker(store, adminAPI.DeliverNotification, 15*time.Second)
13391339- relay.GoSafe("notify.queue", func() {
13401340- if err := notifyWorker.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {
13411341- log.Printf("notify.queue: %v", err)
13421342- }
319319+ // Admin server (Tailscale-only) — admin API + dashboard UI + events,
320320+ // inbound, and review-queue UI handlers + opMailer + warmup +
321321+ // notify-queue worker. See cmd/relay/admin.go.
322322+ adminServer, adminAPI := setupAdminServer(adminDeps{
323323+ ctx: ctx,
324324+ cfg: cfg,
325325+ store: store,
326326+ metrics: metrics,
327327+ metricsRegistry: metricsRegistry,
328328+ queue: queue,
329329+ labelChecker: labelChecker,
330330+ spfChecker: spfChecker,
331331+ domainVerifier: domainVerifier,
332332+ operatorKeys: operatorKeys,
333333+ memberLookup: memberLookup,
334334+ bindFBLNotifier: inbound.SetFBLNotifier,
1343335 })
13441344- log.Printf("notify.queue.enabled: tick=15s max_attempts=%d", relaystore.MaxNotificationAttempts)
134533613461346- dashboardUI := adminui.NewWithQueue(store, labelChecker, func() int { return queue.Depth() })
13471347- // CSRF allowlist for /ui/* POSTs. Empty list fails-closed: dashboard
13481348- // becomes read-only until operator populates adminOrigins in config.
13491349- dashboardUI.AllowOrigins(cfg.AdminOrigins)
13501350- if len(cfg.AdminOrigins) == 0 {
13511351- log.Printf("system.startup.warn: adminOrigins is empty — admin UI state-changing POSTs will be rejected by CSRF middleware")
13521352- }
13531353- // Wire the UI approve path to fire the member-welcome email via the
13541354- // admin API. Goroutined inside the handler so the htmx response isn't
13551355- // blocked on the mail send.
13561356- dashboardUI.SetApproveHook(func(did, domain, contactEmail string) {
13571357- adminAPI.FireMemberWelcome(context.Background(), domain, contactEmail)
13581358- })
13591359- // Wire the UI regenerate-key button through the admin API's
13601360- // transport-agnostic RegenerateKey core — same rotation semantics
13611361- // as the HTTP endpoint (shape of errors, atomic hash update,
13621362- // notification email fired automatically).
13631363- dashboardUI.SetRegenerateKeyHook(func(did, domain string) (string, string, error) {
13641364- selected, apiKey, err := adminAPI.RegenerateKey(context.Background(), did, domain)
13651365- return apiKey, selected, err
13661366- })
13671367- // Mirror UI-side state changes (suspend/reactivate/reject/approve)
13681368- // into the operator notification webhook so operators see the same
13691369- // event stream regardless of which interface triggered it.
13701370- dashboardUI.SetNotifyStateChangeHook(adminAPI.NotifyStateChange)
13711371- if adminAPI.WarmupSeedCount() > 0 {
13721372- dashboardUI.SetWarmupHook(func(ctx context.Context, did string) (int, int, []string, error) {
13731373- result, err := adminAPI.SendWarmup(ctx, did)
13741374- if err != nil {
13751375- return 0, 0, nil, err
13761376- }
13771377- return result.Sent, result.Failed, result.Errors, nil
13781378- }, adminAPI.WarmupSeedCount())
13791379- }
13801380- eventsUI := adminui.NewEventsHandler(store)
13811381- inboundUI := adminui.NewInboundHandler(store)
13821382- reviewQueueUI := adminui.NewReviewQueueHandler(store)
13831383- adminMux := http.NewServeMux()
13841384- adminMux.HandleFunc("GET /{$}", func(w http.ResponseWriter, r *http.Request) {
13851385- http.Redirect(w, r, "/ui/", http.StatusFound)
13861386- })
13871387- adminMux.Handle("/ui/", dashboardUI)
13881388- // Relay-local event mirror pages — /admin/events, /admin/members/{did}/events,
13891389- // /admin/rules. These replace the old Druid-backed Osprey UI and run on
13901390- // the Tailscale-only admin listener.
13911391- eventsUI.Register(adminMux)
13921392- // Inbound audit log pages — /admin/inbound, /admin/inbound/{id}.
13931393- // Same Tailscale-only mux.
13941394- inboundUI.Register(adminMux)
13951395- // Human review queue for auto-suspension overrides —
13961396- // /admin/review-queue and POST actions under it.
13971397- reviewQueueUI.Register(adminMux)
13981398- adminMux.Handle("/", adminAPI)
13991399- adminMux.Handle("/metrics", promhttp.HandlerFor(metricsRegistry, promhttp.HandlerOpts{}))
14001400- adminServer := &http.Server{
14011401- Addr: cfg.AdminAddr,
14021402- Handler: adminMux,
14031403- ReadTimeout: 10 * time.Second,
14041404- WriteTimeout: 30 * time.Second,
14051405- IdleTimeout: 120 * time.Second,
14061406- }
14071407- relay.GoSafe("admin.serve", func() {
14081408- log.Printf("admin API listening on %s", cfg.AdminAddr)
14091409- if err := adminServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
14101410- log.Printf("admin server: %v", err)
14111411- }
337337+ // Public HTTPS listener — site mux (marketing + enrollment + OAuth),
338338+ // infra mux (unsubscribe + healthz), SNI cert routing, and listener
339339+ // goroutine. See cmd/relay/public.go.
340340+ public := setupPublicServer(publicDeps{
341341+ cfg: cfg,
342342+ adminAPI: adminAPI,
343343+ didResolver: didResolver,
344344+ unsubscriber: unsubscriber,
345345+ store: store,
346346+ metrics: metrics,
347347+ labelChecker: labelChecker,
348348+ tlsConfig: tlsConfig,
1412349 })
141335014141414- // Public HTTPS listener — answers on multiple hostnames with different
14151415- // roles. See internal/relay/publicrouter.go for the routing rules:
14161416- //
14171417- // "site" — full marketing / legal / enrollment (atmospheremail.com)
14181418- // "infra" — operational endpoints only (smtp.atmos.email): /u/, /healthz
14191419- // "redirect" — 301 to canonical apex (atmos.email → atmospheremail.com)
14201420- //
14211421- // When PublicDomains is empty the listener falls back to serving the full
14221422- // handler set on any Host with the legacy single-cert TLS config — this
14231423- // keeps local/dev deploys simple.
14241424- //
14251425- // Admin UI stays Tailscale-only on :8080. The enroll handler invokes the
14261426- // admin API in-process via httptest — it never forwards the caller's
14271427- // Authorization header, so admin credentials can't leak to the public
14281428- // listener regardless of which Host the request came in on.
14291429- var publicServer *http.Server
14301430- // Captured for graceful shutdown — the RecoverHandler spawns a
14311431- // background prune ticker that must be stopped explicitly. Nil
14321432- // when the OAuth client isn't configured.
14331433- var recoverHandlerForShutdown *adminui.RecoverHandler
14341434- if cfg.PublicAddr != "" && unsubscriber != nil {
14351435- enrollHandler := adminui.NewEnrollHandler(adminAPI, didResolver)
14361436- enrollHandler.SetDomainLister(storeDomainLister{store: store})
14371437- enrollHandler.SetFunnelRecorder(metrics)
14381438- // Bind enrollment to OAuth-verified DIDs (#207). Without this
14391439- // wire, /admin/enroll-start and /admin/enroll accept any DID
14401440- // from a request body — letting an attacker who only owns a
14411441- // domain enroll under any victim's atproto identity.
14421442- adminAPI.SetEnrollAuthVerifier(enrollHandler)
14431443- // Enable /enroll/label-status for the success-page polling UX.
14441444- // LabelChecker is tailnet-only; proxying through the relay keeps
14451445- // labeler connectivity private.
14461446- if labelChecker != nil {
14471447- enrollHandler.SetLabelStatusQuerier(labelChecker)
14481448- }
14491449- healthHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
14501450- w.WriteHeader(http.StatusOK)
14511451- _, _ = w.Write([]byte("ok\n"))
14521452- })
14531453-14541454- // Site mux: full public site (marketing, enrollment, legal, plus
14551455- // the functional unsubscribe endpoint and a health check so callers
14561456- // hitting the canonical host for those still work).
14571457- siteMux := http.NewServeMux()
14581458- siteMux.Handle("/", enrollHandler)
14591459- siteMux.Handle("/enroll", enrollHandler)
14601460- siteMux.Handle("/enroll/", enrollHandler)
14611461- siteMux.Handle("/u/", unsubscriber.Handler())
14621462- siteMux.Handle("/healthz", healthHandler)
14631463- siteMux.HandleFunc("/verify-email", adminAPI.HandleVerifyEmail)
14641464-14651465- // Self-service attestation publishing (atproto OAuth). Only active
14661466- // when the operator has configured SiteBaseURL — the client_id MUST
14671467- // equal the metadata URL per spec, so without a baseURL the metadata
14681468- // endpoint would publish a client_id we can't honor. The wizard's
14691469- // Step 4 button (in EnrollSuccess) POSTs to /enroll/attest/start.
14701470- if cfg.SiteBaseURL != "" {
14711471- oauthCfg := atpoauth.Config{
14721472- ClientID: cfg.SiteBaseURL + "/.well-known/atproto-oauth-client-metadata.json",
14731473- CallbackURL: cfg.SiteBaseURL + "/enroll/attest/callback",
14741474- Scopes: []string{"atproto", "repo:email.atmos.attestation"},
14751475- SigningKeyPath: cfg.StateDir + "/oauth-signing-key.pem",
14761476- }
14771477- oauthClient, err := atpoauth.NewClient(oauthCfg, store)
14781478- if err != nil {
14791479- log.Fatalf("atpoauth.NewClient: %v", err)
14801480- }
14811481- siteMux.Handle("/.well-known/atproto-oauth-client-metadata.json",
14821482- adminui.NewMetadataHandler(oauthClient, "Atmosphere Mail", cfg.SiteBaseURL))
14831483- pub := &adminui.AtpoauthPublisher{C: oauthClient}
14841484- attestHandler := adminui.NewAttestHandler(pub, store)
14851485- attestHandler.SetFunnelRecorder(metrics)
14861486- attestHandler.SetDIDHandleResolver(didResolver)
14871487- attestHandler.RegisterRoutes(siteMux)
14881488-14891489- // Self-service credential recovery. Shares the attest OAuth
14901490- // callback (indigo only supports one redirect URI per client)
14911491- // and dispatches on whether the session carries an attestation
14921492- // payload — empty means recovery. regenFn wraps the admin API's
14931493- // transport-agnostic RegenerateKey so the rotation path is
14941494- // identical between operator-triggered and member-triggered.
14951495- recoverHandler := adminui.NewRecoverHandler(pub, store, cfg.SiteBaseURL,
14961496- func(did, domain string) (string, error) {
14971497- _, apiKey, err := adminAPI.RegenerateKey(context.Background(), did, domain)
14981498- return apiKey, err
14991499- })
15001500- recoverHandler.SetHandleResolver(didResolver)
15011501- recoverHandler.SetContactEmailChangedHook(func(ctx context.Context, domain, contactEmail string) {
15021502- adminAPI.TriggerEmailVerification(ctx, domain, contactEmail)
15031503- })
15041504- recoverHandler.RegisterRoutes(siteMux)
15051505- attestHandler.SetRecoveryIssuer(recoverHandler)
15061506- attestHandler.SetEnrollAuthIssuer(enrollHandler)
15071507- enrollHandler.SetPublisher(pub)
15081508- enrollHandler.SetAccountTicketIssuer(recoverHandler)
15091509- recoverHandlerForShutdown = recoverHandler
15101510-15111511- log.Printf("atpoauth.enabled: client_id=%s callback=%s confidential=%v",
15121512- oauthCfg.ClientID, oauthCfg.CallbackURL, oauthClient.IsConfidential())
15131513- }
15141514-15151515- // Infra mux: narrow surface for the SMTP domain. The List-Unsubscribe
15161516- // header points here by design (PublicBaseURL), so /u/ must remain
15171517- // addressable, but we deliberately don't serve the marketing UI on
15181518- // the infra host.
15191519- infraMux := http.NewServeMux()
15201520- infraMux.Handle("/u/", unsubscriber.Handler())
15211521- infraMux.Handle("/healthz", healthHandler)
15221522- infraMux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
15231523- http.Error(w, "not found", http.StatusNotFound)
15241524- })
15251525-15261526- var publicHandler http.Handler
15271527- var publicTLS *tls.Config
15281528-15291529- if len(cfg.PublicDomains) > 0 {
15301530- // Build a SNI-aware cert map. One cert per PublicDomain; the
15311531- // TLS stack picks the right one via ClientHelloInfo.ServerName.
15321532- certByHost := make(map[string]*tls.Certificate, len(cfg.PublicDomains))
15331533- routes := make(map[string]relay.HostRoute, len(cfg.PublicDomains))
15341534- var anyCert *tls.Certificate
15351535- for _, pd := range cfg.PublicDomains {
15361536- role, err := parseHostRole(pd.Role)
15371537- if err != nil {
15381538- log.Fatalf("publicDomains: host=%s: %v", pd.Host, err)
15391539- }
15401540- routes[pd.Host] = relay.HostRoute{Role: role, RedirectTo: pd.RedirectTo}
15411541- if pd.CertFile == "" || pd.KeyFile == "" {
15421542- // Redirect-only hosts may share a wildcard cert with
15431543- // another entry — an empty cert path means "inherit".
15441544- continue
15451545- }
15461546- c, err := tls.LoadX509KeyPair(pd.CertFile, pd.KeyFile)
15471547- if err != nil {
15481548- log.Printf("public.tls_unavailable: host=%s error=%v (domain will fail TLS handshake until cert is provisioned)", pd.Host, err)
15491549- continue
15501550- }
15511551- cert := c
15521552- certByHost[strings.ToLower(pd.Host)] = &cert
15531553- if anyCert == nil {
15541554- anyCert = &cert
15551555- }
15561556- log.Printf("public.tls_loaded: host=%s cert=%s", pd.Host, pd.CertFile)
15571557- }
15581558- if anyCert == nil {
15591559- log.Fatalf("publicDomains: at least one domain must have a loadable cert")
15601560- }
15611561- publicTLS = &tls.Config{
15621562- MinVersion: tls.VersionTLS12,
15631563- GetCertificate: func(hello *tls.ClientHelloInfo) (*tls.Certificate, error) {
15641564- if c, ok := certByHost[strings.ToLower(hello.ServerName)]; ok {
15651565- return c, nil
15661566- }
15671567- // Unknown SNI: hand back any cert so the client gets a
15681568- // response (it'll fail hostname verification, which is
15691569- // the right outcome). Better than leaking a handshake
15701570- // error at the TCP layer.
15711571- return anyCert, nil
15721572- },
15731573- }
15741574- // Default fallback = siteMux — misdirected requests to unknown
15751575- // hosts get the marketing site, not a 404.
15761576- publicHandler = relay.NewPublicRouter(routes, siteMux, infraMux, siteMux)
15771577- for h, r := range routes {
15781578- log.Printf("public.route: host=%s role=%d redirect_to=%s", h, r.Role, r.RedirectTo)
15791579- }
15801580- } else {
15811581- // Legacy single-cert mode: every Host gets the full site handler.
15821582- if tlsConfig == nil {
15831583- log.Printf("public.tls_unavailable: skipping public listener (no cert and no publicDomains)")
15841584- } else {
15851585- publicTLS = tlsConfig
15861586- publicHandler = siteMux
15871587- log.Printf("public.mode: legacy_single_cert")
15881588- }
15891589- }
15901590-15911591- if publicHandler != nil && publicTLS != nil {
15921592- publicServer = &http.Server{
15931593- Addr: cfg.PublicAddr,
15941594- Handler: metrics.HTTPMiddleware(publicHandler),
15951595- TLSConfig: publicTLS,
15961596- ReadTimeout: 10 * time.Second,
15971597- WriteTimeout: 10 * time.Second,
15981598- IdleTimeout: 60 * time.Second,
15991599- }
16001600- publicErrCh := make(chan error, 1)
16011601- relay.GoSafe("public.serve", func() {
16021602- log.Printf("public HTTPS listening on %s", cfg.PublicAddr)
16031603- if err := publicServer.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed {
16041604- publicErrCh <- err
16051605- }
16061606- })
16071607- relay.GoSafe("public.errwatch", func() {
16081608- if err := <-publicErrCh; err != nil {
16091609- log.Fatalf("public server: %v", err)
16101610- }
16111611- })
16121612- }
16131613- }
16141614-1615351 // Start inbound SMTP server (bounce processing)
1616352 relay.GoSafe("inbound.serve", func() {
1617353 log.Printf("inbound SMTP server listening on %s", cfg.InboundAddr)
···1645381 log.Printf("relay_events.enabled: broker=%s topic=%s", cfg.KafkaBroker, relay.OspreyOutputTopic)
1646382 }
164738316481648- // Update member counts on startup
16491649- updateMemberMetrics := func() {
16501650- active, suspended, pending, err := store.MemberCountsByStatus(context.Background())
16511651- if err != nil {
16521652- log.Printf("metrics.member_count_error: %v", err)
16531653- return
16541654- }
16551655- metrics.MembersTotal.WithLabelValues("active").Set(float64(active))
16561656- metrics.MembersTotal.WithLabelValues("suspended").Set(float64(suspended))
16571657- metrics.MembersTotal.WithLabelValues("pending").Set(float64(pending))
16581658- }
16591659- updateMemberMetrics()
16601660-16611661- // Background health probes: update labeler/osprey reachability gauges
16621662- // independently of SMTP traffic so Grafana shows real health, not
16631663- // "was-queried-recently". Without this the gauges falsely report
16641664- // unreachable during quiet periods (between sends) — an outage at
16651665- // 3 AM would look identical to idle.
16661666- relay.GoSafe("health.probe", func() {
16671667- // Short initial delay so the first probe runs ~10s after startup,
16681668- // giving dependent services time to become ready after a deploy.
16691669- initialDelay := time.NewTimer(10 * time.Second)
16701670- defer initialDelay.Stop()
16711671- select {
16721672- case <-ctx.Done():
16731673- return
16741674- case <-initialDelay.C:
16751675- }
16761676-16771677- ticker := time.NewTicker(30 * time.Second)
16781678- defer ticker.Stop()
16791679-16801680- probe := func() {
16811681- probeCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
16821682- defer cancel()
16831683-16841684- // Labeler probe: a cheap queryLabels call with a sentinel DID.
16851685- // Any non-error HTTP response (including 4xx) means the labeler
16861686- // is reachable; transport error means it isn't.
16871687- req, _ := http.NewRequestWithContext(probeCtx, http.MethodGet,
16881688- cfg.LabelerURL+"/xrpc/com.atproto.label.queryLabels?uriPatterns=did:plc:healthprobe", nil)
16891689- if resp, err := (&http.Client{Timeout: 5 * time.Second}).Do(req); err != nil {
16901690- metrics.LabelerReachable.Set(0)
16911691- } else {
16921692- resp.Body.Close()
16931693- metrics.LabelerReachable.Set(1)
16941694- }
16951695-16961696- // Osprey probe (only if enforcer is configured).
16971697- if ospreyEnforcer != nil {
16981698- if ospreyEnforcer.Reachable() {
16991699- metrics.OspreyReachable.Set(1)
17001700- } else {
17011701- // Fall back to a direct HTTP probe — Reachable() returns
17021702- // false for quiet periods too, so this disambiguates.
17031703- req, _ := http.NewRequestWithContext(probeCtx, http.MethodGet,
17041704- cfg.OspreyURL+"/entities/labels?entity_id=did:plc:healthprobe&entity_type=SenderDID", nil)
17051705- if resp, err := (&http.Client{Timeout: 5 * time.Second}).Do(req); err != nil {
17061706- metrics.OspreyReachable.Set(0)
17071707- } else {
17081708- resp.Body.Close()
17091709- metrics.OspreyReachable.Set(1)
17101710- }
17111711- }
17121712- }
17131713- }
17141714-17151715- probe() // first probe fires immediately after initial delay
17161716- for {
17171717- select {
17181718- case <-ctx.Done():
17191719- return
17201720- case <-ticker.C:
17211721- probe()
17221722- }
17231723- }
17241724- })
17251725-17261726- // Periodic rate counter cleanup (every hour)
17271727- relay.GoSafe("rate_counter.cleanup", func() {
17281728- ticker := time.NewTicker(1 * time.Hour)
17291729- defer ticker.Stop()
17301730- for {
17311731- select {
17321732- case <-ctx.Done():
17331733- return
17341734- case <-ticker.C:
17351735- cutoff := time.Now().UTC().Add(-48 * time.Hour)
17361736- deleted, err := rateLimiter.Cleanup(ctx, cutoff)
17371737- if err != nil {
17381738- log.Printf("rate cleanup: %v", err)
17391739- } else if deleted > 0 {
17401740- log.Printf("rate cleanup: deleted %d old counters", deleted)
17411741- }
17421742-17431743- // Evict expired label cache entries
17441744- if evicted := labelChecker.CleanExpired(); evicted > 0 {
17451745- log.Printf("label cache cleanup: evicted %d expired entries", evicted)
17461746- }
17471747-17481748- // Evict expired Osprey enforcer cache entries
17491749- if ospreyEnforcer != nil {
17501750- if evicted := ospreyEnforcer.CleanExpired(); evicted > 0 {
17511751- log.Printf("osprey cache cleanup: evicted %d expired entries", evicted)
17521752- }
17531753- }
17541754-17551755- // Clean expired OAuth pending rows. These accumulate when
17561756- // users start the attestation flow and walk away — each
17571757- // row carries a PKCE verifier + DPoP key material we want off
17581758- // disk once the window for a legitimate callback has closed.
17591759- if evicted, err := store.CleanupExpiredOAuth(ctx, time.Now().UTC()); err != nil {
17601760- log.Printf("oauth cleanup: error=%v", err)
17611761- } else if evicted > 0 {
17621762- log.Printf("oauth cleanup: evicted %d expired auth requests", evicted)
17631763- }
17641764-17651765- // Update member metrics
17661766- updateMemberMetrics()
17671767-17681768- // Purge terminal messages older than 30 days
17691769- msgCutoff := time.Now().UTC().Add(-30 * 24 * time.Hour)
17701770- purged, err := store.PurgeOldMessages(ctx, msgCutoff)
17711771- if err != nil {
17721772- log.Printf("message purge: %v", err)
17731773- } else if purged > 0 {
17741774- log.Printf("message purge: deleted %d old messages", purged)
17751775- }
17761776- }
17771777- }
384384+ // Background maintenance workers — cache snapshots, DLQ replay,
385385+ // janitors, health probes, rate cleanup. See cmd/relay/workers.go
386386+ // See cmd/relay/workers.go.
387387+ startBackgroundWorkers(workerDeps{
388388+ ctx: ctx,
389389+ cfg: cfg,
390390+ store: store,
391391+ metrics: metrics,
392392+ ospreyEnforcer: ospreyEnforcer,
393393+ ospreyEmitter: ospreyEmitter,
394394+ rateLimiter: rateLimiter,
395395+ labelChecker: labelChecker,
396396+ memberHashCache: inbound.MemberHashCache,
397397+ spool: spool,
1778398 })
17793991780400 <-ctx.Done()
···1786406 smtpServer.Close()
1787407 inboundServer.Close()
1788408 adminServer.Shutdown(shutdownCtx)
17891789- if publicServer != nil {
17901790- publicServer.Shutdown(shutdownCtx)
409409+ if public.Server != nil {
410410+ public.Server.Shutdown(shutdownCtx)
1791411 }
17921792- // Stop the recovery-ticket prune ticker. Idempotent — safe under
17931793- // the (unlikely) case where the public listener was up but OAuth
17941794- // wasn't configured, leaving recoverHandlerForShutdown nil.
17951795- if recoverHandlerForShutdown != nil {
17961796- recoverHandlerForShutdown.Close()
412412+ if public.RecoverHandler != nil {
413413+ public.RecoverHandler.Close()
414414+ }
415415+ if public.EnrollHandler != nil {
416416+ public.EnrollHandler.Close()
1797417 }
1798418 // Close the Osprey events consumer — unblocks its ReadMessage.
1799419 if eventsConsumer != nil {
···1814434 log.Printf("shutdown complete (queue depth: %d)", queue.Depth())
1815435}
181643618171817-// parseHostRole maps the string role in RelayConfig.PublicDomains to the
18181818-// typed relay.HostRole enum. Returns an error for unknown roles so typos
18191819-// fail loudly at startup rather than silently falling through to fallback.
18201820-func parseHostRole(s string) (relay.HostRole, error) {
18211821- switch strings.ToLower(strings.TrimSpace(s)) {
18221822- case "site":
18231823- return relay.RoleSite, nil
18241824- case "infra":
18251825- return relay.RoleInfra, nil
18261826- case "redirect":
18271827- return relay.RoleRedirect, nil
18281828- default:
18291829- return 0, fmt.Errorf("unknown role %q (must be site, infra, or redirect)", s)
18301830- }
18311831-}
18321832-18331833-// webhookHostForLog returns the host portion of a webhook URL so we can
18341834-// log "webhook enabled" without leaking auth material embedded in the
18351835-// path (Slack/Discord incoming webhooks carry tokens in the URL). On a
18361836-// parse error we fall back to "<malformed>" rather than echoing the
18371837-// raw value.
18381838-func webhookHostForLog(raw string) string {
18391839- if raw == "" {
18401840- return "<unset>"
18411841- }
18421842- u, err := url.Parse(raw)
18431843- if err != nil || u.Host == "" {
18441844- return "<malformed>"
18451845- }
18461846- return u.Host
18471847-}
18481848-18491849-func loadConfig(path string) (*RelayConfig, error) {
18501850- data, err := os.ReadFile(path)
18511851- if err != nil {
18521852- return nil, fmt.Errorf("read config %s: %w", path, err)
18531853- }
18541854-18551855- var cfg RelayConfig
18561856- if err := json.Unmarshal(data, &cfg); err != nil {
18571857- return nil, fmt.Errorf("parse config %s: %w", path, err)
18581858- }
18591859-18601860- // Env var overrides
18611861- if v := os.Getenv("ADMIN_TOKEN"); v != "" {
18621862- cfg.AdminToken = v
18631863- }
18641864- if v := os.Getenv("LABELER_URL"); v != "" {
18651865- cfg.LabelerURL = v
18661866- }
18671867-18681868- // Defaults
18691869- if cfg.SMTPAddr == "" {
18701870- cfg.SMTPAddr = ":587"
18711871- }
18721872- if cfg.AdminAddr == "" {
18731873- cfg.AdminAddr = ":8080"
18741874- }
18751875- if cfg.StateDir == "" {
18761876- cfg.StateDir = "./state"
18771877- }
18781878- if cfg.Domain == "" {
18791879- cfg.Domain = "atmos.email"
18801880- }
18811881- if cfg.InboundRateLimitMsgsPerMinute == 0 {
18821882- cfg.InboundRateLimitMsgsPerMinute = 30
18831883- }
18841884- if cfg.InboundRateLimitBurst == 0 {
18851885- cfg.InboundRateLimitBurst = 10
18861886- }
18871887- if cfg.InboundAddr == "" {
18881888- cfg.InboundAddr = ":25"
18891889- }
18901890- if cfg.LabelerURL == "" {
18911891- log.Fatalf("labelerURL is required (set in config or LABELER_URL env var)")
18921892- }
18931893- if cfg.HourlyLimit == 0 {
18941894- cfg.HourlyLimit = 100
18951895- }
18961896- if cfg.DailyLimit == 0 {
18971897- cfg.DailyLimit = 1000
18981898- }
18991899- if cfg.GlobalPerMinute == 0 {
19001900- cfg.GlobalPerMinute = 500
19011901- }
19021902- if cfg.OperatorDKIMKeyPath == "" {
19031903- cfg.OperatorDKIMKeyPath = cfg.StateDir + "/operator-dkim-keys.json"
19041904- }
19051905- if cfg.OperatorDKIMDomain == "" {
19061906- cfg.OperatorDKIMDomain = cfg.Domain
19071907- }
19081908-19091909- return &cfg, nil
19101910-}
19111911-19121912-func deserializeDKIMKeys(rsaBytes, edBytes []byte) (*rsa.PrivateKey, ed25519.PrivateKey, error) {
19131913- rsaRaw, err := x509.ParsePKCS8PrivateKey(rsaBytes)
19141914- if err != nil {
19151915- return nil, nil, fmt.Errorf("parse RSA key: %w", err)
19161916- }
19171917- rsaKey, ok := rsaRaw.(*rsa.PrivateKey)
19181918- if !ok {
19191919- return nil, nil, fmt.Errorf("expected RSA key, got %T", rsaRaw)
19201920- }
19211921-19221922- edRaw, err := x509.ParsePKCS8PrivateKey(edBytes)
19231923- if err != nil {
19241924- return nil, nil, fmt.Errorf("parse Ed25519 key: %w", err)
19251925- }
19261926- edKey, ok := edRaw.(ed25519.PrivateKey)
19271927- if !ok {
19281928- return nil, nil, fmt.Errorf("expected Ed25519 key, got %T", edRaw)
19291929- }
19301930-19311931- return rsaKey, edKey, nil
19321932-}
19331933-19341934-// extractMessageID extracts the Message-ID header from raw message data.
19351935-// Handles folded headers per RFC 5322.
19361936-func extractMessageID(data string) string {
19371937- r := textproto.NewReader(bufio.NewReader(strings.NewReader(data)))
19381938- header, err := r.ReadMIMEHeader()
19391939- if err != nil {
19401940- return fmt.Sprintf("<%d@relay>", time.Now().UnixNano())
19411941- }
19421942- if mid := header.Get("Message-Id"); mid != "" {
19431943- return mid
19441944- }
19451945- return fmt.Sprintf("<%d@relay>", time.Now().UnixNano())
19461946-}
19471947-19481948-// extractSubjectAndBody pulls the Subject header and the message body out
19491949-// of raw RFC 5322 bytes for content fingerprinting. On parse errors it
19501950-// returns what it found (possibly empty strings) — the fingerprint is a
19511951-// best-effort correlation signal and must never block a send, so we'd
19521952-// rather emit a fingerprint of ("", "") than fail the outbound pipeline.
19531953-//
19541954-// Body extraction is intentionally naive (everything after the first
19551955-// blank line). Multipart MIME walking is a future improvement — for v1
19561956-// the goal is "two identical messages fingerprint the same", and the
19571957-// raw bytes after the headers are stable enough to deliver that.
19581958-func extractSubjectAndBody(data []byte) (string, string) {
19591959- br := bufio.NewReader(strings.NewReader(string(data)))
19601960- r := textproto.NewReader(br)
19611961- header, err := r.ReadMIMEHeader()
19621962- if err != nil {
19631963- return "", ""
19641964- }
19651965- subject := header.Get("Subject")
19661966- // textproto consumed the headers + the terminating blank line; whatever
19671967- // is left on the reader is the body.
19681968- var body strings.Builder
19691969- for {
19701970- line, err := br.ReadString('\n')
19711971- body.WriteString(line)
19721972- if err != nil {
19731973- break
19741974- }
19751975- }
19761976- return subject, body.String()
19771977-}
19781978-19791979-// normalizeProviderUA maps a raw FBL User-Agent string to a canonical
19801980-// provider bucket for the complaints_total metric.
19811981-func normalizeProviderUA(ua string) string {
19821982- ua = strings.ToLower(ua)
19831983- switch {
19841984- case strings.Contains(ua, "google") || strings.Contains(ua, "gmail"):
19851985- return "gmail"
19861986- case strings.Contains(ua, "microsoft") || strings.Contains(ua, "outlook"):
19871987- return "microsoft"
19881988- case strings.Contains(ua, "yahoo"):
19891989- return "yahoo"
19901990- default:
19911991- return "other"
19921992- }
19931993-}
19941994-19951995-// recipientDomain extracts the domain part from an email address.
19961996-func recipientDomain(addr string) string {
19971997- if i := strings.LastIndex(addr, "@"); i >= 0 {
19981998- return addr[i+1:]
19991999- }
20002000- return addr
20012001-}
20022002-20032003-// prependListUnsubHeaders inserts List-Unsubscribe and List-Unsubscribe-Post
20042004-// headers at the top of the raw message bytes. It must be called BEFORE DKIM
20052005-// signing so the new headers are covered by the signature — otherwise mail
20062006-// servers like Gmail will see a List-Unsubscribe header that isn't in the
20072007-// signed-headers list and treat it as unauthenticated.
20082008-//
20092009-// The headers go at the top of the message (before the existing headers)
20102010-// which keeps the function allocation-light and lets the DKIM signer cover
20112011-// them as part of its normal "from, to, subject, ..." header set when we
20122012-// add "list-unsubscribe" and "list-unsubscribe-post" to the signed list.
20132013-//
20142014-// Note: DKIM signer config currently doesn't include these header names in
20152015-// its signed-headers list. This function documents the wiring; the signer
20162016-// change lives in internal/relay/dkim.go.
20172017-func prependListUnsubHeaders(data []byte, listUnsub, listUnsubPost string) []byte {
20182018- prefix := "List-Unsubscribe: " + listUnsub + "\r\n" +
20192019- "List-Unsubscribe-Post: " + listUnsubPost + "\r\n"
20202020- out := make([]byte, 0, len(prefix)+len(data))
20212021- out = append(out, prefix...)
20222022- out = append(out, data...)
20232023- return out
20242024-}
20252025-20262026-// prependHeader adds a single `Name: value` header at the top of the raw
20272027-// message bytes. Like prependListUnsubHeaders, must be called before DKIM
20282028-// signing so the signature covers it. Used for internal attribution
20292029-// headers (X-Atmos-Member-Did) that need to survive FBL round-trips.
20302030-func prependHeader(data []byte, name, value string) []byte {
20312031- prefix := name + ": " + value + "\r\n"
20322032- out := make([]byte, 0, len(prefix)+len(data))
20332033- out = append(out, prefix...)
20342034- out = append(out, data...)
20352035- return out
20362036-}
+128
cmd/relay/message.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package main
44+55+import (
66+ "bufio"
77+ "crypto/ed25519"
88+ "crypto/rsa"
99+ "crypto/x509"
1010+ "fmt"
1111+ "net/textproto"
1212+ "strings"
1313+ "time"
1414+)
1515+1616+func deserializeDKIMKeys(rsaBytes, edBytes []byte) (*rsa.PrivateKey, ed25519.PrivateKey, error) {
1717+ rsaRaw, err := x509.ParsePKCS8PrivateKey(rsaBytes)
1818+ if err != nil {
1919+ return nil, nil, fmt.Errorf("parse RSA key: %w", err)
2020+ }
2121+ rsaKey, ok := rsaRaw.(*rsa.PrivateKey)
2222+ if !ok {
2323+ return nil, nil, fmt.Errorf("expected RSA key, got %T", rsaRaw)
2424+ }
2525+2626+ edRaw, err := x509.ParsePKCS8PrivateKey(edBytes)
2727+ if err != nil {
2828+ return nil, nil, fmt.Errorf("parse Ed25519 key: %w", err)
2929+ }
3030+ edKey, ok := edRaw.(ed25519.PrivateKey)
3131+ if !ok {
3232+ return nil, nil, fmt.Errorf("expected Ed25519 key, got %T", edRaw)
3333+ }
3434+3535+ return rsaKey, edKey, nil
3636+}
3737+3838+// extractMessageID extracts the Message-ID header from raw message data.
3939+// Handles folded headers per RFC 5322.
4040+func extractMessageID(data string) string {
4141+ r := textproto.NewReader(bufio.NewReader(strings.NewReader(data)))
4242+ header, err := r.ReadMIMEHeader()
4343+ if err != nil {
4444+ return fmt.Sprintf("<%d@relay>", time.Now().UnixNano())
4545+ }
4646+ if mid := header.Get("Message-Id"); mid != "" {
4747+ return mid
4848+ }
4949+ return fmt.Sprintf("<%d@relay>", time.Now().UnixNano())
5050+}
5151+5252+// extractSubjectAndBody pulls the Subject header and the message body out
5353+// of raw RFC 5322 bytes for content fingerprinting. On parse errors it
5454+// returns what it found (possibly empty strings) — the fingerprint is a
5555+// best-effort correlation signal and must never block a send, so we'd
5656+// rather emit a fingerprint of ("", "") than fail the outbound pipeline.
5757+//
5858+// Body extraction is intentionally naive (everything after the first
5959+// blank line). Multipart MIME walking is a future improvement — for v1
6060+// the goal is "two identical messages fingerprint the same", and the
6161+// raw bytes after the headers are stable enough to deliver that.
6262+func extractSubjectAndBody(data []byte) (string, string) {
6363+ br := bufio.NewReader(strings.NewReader(string(data)))
6464+ r := textproto.NewReader(br)
6565+ header, err := r.ReadMIMEHeader()
6666+ if err != nil {
6767+ return "", ""
6868+ }
6969+ subject := header.Get("Subject")
7070+ // textproto consumed the headers + the terminating blank line; whatever
7171+ // is left on the reader is the body.
7272+ var body strings.Builder
7373+ for {
7474+ line, err := br.ReadString('\n')
7575+ body.WriteString(line)
7676+ if err != nil {
7777+ break
7878+ }
7979+ }
8080+ return subject, body.String()
8181+}
8282+8383+// normalizeProviderUA maps a raw FBL User-Agent string to a canonical
8484+// provider bucket for the complaints_total metric.
8585+func normalizeProviderUA(ua string) string {
8686+ ua = strings.ToLower(ua)
8787+ switch {
8888+ case strings.Contains(ua, "google") || strings.Contains(ua, "gmail"):
8989+ return "gmail"
9090+ case strings.Contains(ua, "microsoft") || strings.Contains(ua, "outlook"):
9191+ return "microsoft"
9292+ case strings.Contains(ua, "yahoo"):
9393+ return "yahoo"
9494+ default:
9595+ return "other"
9696+ }
9797+}
9898+9999+// recipientDomain extracts the domain part from an email address.
100100+func recipientDomain(addr string) string {
101101+ if i := strings.LastIndex(addr, "@"); i >= 0 {
102102+ return addr[i+1:]
103103+ }
104104+ return addr
105105+}
106106+107107+// prependListUnsubHeaders inserts List-Unsubscribe and List-Unsubscribe-Post
108108+// headers at the top of the raw message bytes. It must be called BEFORE DKIM
109109+// signing so the new headers are covered by the signature.
110110+func prependListUnsubHeaders(data []byte, listUnsub, listUnsubPost string) []byte {
111111+ prefix := "List-Unsubscribe: " + listUnsub + "\r\n" +
112112+ "List-Unsubscribe-Post: " + listUnsubPost + "\r\n"
113113+ out := make([]byte, 0, len(prefix)+len(data))
114114+ out = append(out, prefix...)
115115+ out = append(out, data...)
116116+ return out
117117+}
118118+119119+// prependHeader adds a single `Name: value` header at the top of the raw
120120+// message bytes. Like prependListUnsubHeaders, must be called before DKIM
121121+// signing so the signature covers it.
122122+func prependHeader(data []byte, name, value string) []byte {
123123+ prefix := name + ": " + value + "\r\n"
124124+ out := make([]byte, 0, len(prefix)+len(data))
125125+ out = append(out, prefix...)
126126+ out = append(out, data...)
127127+ return out
128128+}
+299
cmd/relay/public.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package main
44+55+// Public HTTPS listener setup. Bundles the SNI-aware multi-host TLS
66+// config, the site mux (marketing + enrollment + OAuth attestation +
77+// recovery), the infra mux (unsubscribe + healthz), the public router,
88+// and the listener goroutine.
99+1010+import (
1111+ "context"
1212+ "crypto/tls"
1313+ "fmt"
1414+ "log"
1515+ "net/http"
1616+ "strings"
1717+ "time"
1818+1919+ "atmosphere-mail/internal/admin"
2020+ adminui "atmosphere-mail/internal/admin/ui"
2121+ "atmosphere-mail/internal/atpoauth"
2222+ "atmosphere-mail/internal/relay"
2323+ "atmosphere-mail/internal/relaystore"
2424+)
2525+2626+// PublicDomain describes a single host served on the public HTTPS listener.
2727+// Each host can have its own TLS cert (via SNI) and a role that determines
2828+// what handlers it answers.
2929+type PublicDomain struct {
3030+ Host string `json:"host"` // SNI / Host header match, e.g. "atmosphereemail.org"
3131+ CertFile string `json:"certFile"` // path to TLS cert (fullchain)
3232+ KeyFile string `json:"keyFile"` // path to TLS private key
3333+ Role string `json:"role"` // "site", "infra", or "redirect"
3434+ RedirectTo string `json:"redirectTo"` // for Role=="redirect": target URL prefix, e.g. "https://atmosphereemail.org"
3535+}
3636+3737+// storeDomainLister adapts *relaystore.Store to the narrow
3838+// adminui.DomainLister interface so the enrollment landing can show
3939+// existing domains without a full store import.
4040+type storeDomainLister struct{ store *relaystore.Store }
4141+4242+func (s storeDomainLister) ListMemberDomains(ctx context.Context, did string) ([]string, error) {
4343+ domains, err := s.store.ListMemberDomains(ctx, did)
4444+ if err != nil {
4545+ return nil, err
4646+ }
4747+ names := make([]string, len(domains))
4848+ for i, d := range domains {
4949+ names[i] = d.Domain
5050+ }
5151+ return names, nil
5252+}
5353+5454+// publicDeps gathers everything setupPublicServer needs from main().
5555+type publicDeps struct {
5656+ cfg *RelayConfig
5757+ adminAPI *admin.API
5858+ didResolver *relay.DIDResolver
5959+ unsubscriber *relay.Unsubscriber
6060+ store *relaystore.Store
6161+ metrics *relay.Metrics
6262+ labelChecker *relay.LabelChecker
6363+ tlsConfig *tls.Config // SMTP TLS config reused as legacy single-cert fallback
6464+}
6565+6666+// publicSetup is what setupPublicServer hands back to main(). All fields
6767+// may be nil when the public listener is disabled (no PublicAddr or no
6868+// unsubscriber). The RecoverHandler and EnrollHandler own background
6969+// prune tickers that need explicit Close() at shutdown.
7070+type publicSetup struct {
7171+ Server *http.Server
7272+ RecoverHandler *adminui.RecoverHandler
7373+ EnrollHandler *adminui.EnrollHandler
7474+}
7575+7676+// setupPublicServer wires the public HTTPS listener — site mux, infra mux,
7777+// SNI cert routing, OAuth attestation, credential recovery, and the
7878+// listener goroutine. The goroutine is started before return; the returned
7979+// server is solely for Shutdown() at process exit. Returns a zero
8080+// publicSetup when the public listener is disabled.
8181+func setupPublicServer(deps publicDeps) publicSetup {
8282+ cfg := deps.cfg
8383+ if cfg.PublicAddr == "" || deps.unsubscriber == nil {
8484+ return publicSetup{}
8585+ }
8686+8787+ adminAPI := deps.adminAPI
8888+ store := deps.store
8989+ metrics := deps.metrics
9090+9191+ enrollHandler := adminui.NewEnrollHandler(adminAPI, deps.didResolver)
9292+ enrollHandler.SetDomainLister(storeDomainLister{store: store})
9393+ enrollHandler.SetFunnelRecorder(metrics)
9494+ // Bind enrollment to OAuth-verified DIDs. Without this wire,
9595+ // /admin/enroll-start and /admin/enroll accept any DID from a
9696+ // request body — letting an attacker who only owns a domain
9797+ // enroll under any victim's atproto identity.
9898+ adminAPI.SetEnrollAuthVerifier(enrollHandler)
9999+ if deps.labelChecker != nil {
100100+ enrollHandler.SetLabelStatusQuerier(deps.labelChecker)
101101+ }
102102+ healthHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
103103+ w.WriteHeader(http.StatusOK)
104104+ _, _ = w.Write([]byte("ok\n"))
105105+ })
106106+107107+ // Site mux: full public site (marketing, enrollment, legal, plus
108108+ // the functional unsubscribe endpoint and a health check so callers
109109+ // hitting the canonical host for those still work).
110110+ siteMux := http.NewServeMux()
111111+ siteMux.Handle("/", enrollHandler)
112112+ siteMux.Handle("/enroll", enrollHandler)
113113+ siteMux.Handle("/enroll/", enrollHandler)
114114+ siteMux.Handle("/u/", deps.unsubscriber.Handler())
115115+ siteMux.Handle("/healthz", healthHandler)
116116+ siteMux.HandleFunc("/verify-email", adminAPI.HandleVerifyEmail)
117117+118118+ // Self-service attestation publishing (atproto OAuth). Only active
119119+ // when the operator has configured SiteBaseURL — the client_id MUST
120120+ // equal the metadata URL per spec, so without a baseURL the metadata
121121+ // endpoint would publish a client_id we can't honor. The wizard's
122122+ // Step 4 button (in EnrollSuccess) POSTs to /enroll/attest/start.
123123+ var recoverHandler *adminui.RecoverHandler
124124+ if cfg.SiteBaseURL != "" {
125125+ oauthCfg := atpoauth.Config{
126126+ ClientID: cfg.SiteBaseURL + "/.well-known/atproto-oauth-client-metadata.json",
127127+ CallbackURL: cfg.SiteBaseURL + "/enroll/attest/callback",
128128+ Scopes: []string{"atproto", "repo:email.atmos.attestation"},
129129+ SigningKeyPath: cfg.StateDir + "/oauth-signing-key.pem",
130130+ }
131131+ oauthClient, err := atpoauth.NewClient(oauthCfg, store)
132132+ if err != nil {
133133+ log.Fatalf("atpoauth.NewClient: %v", err)
134134+ }
135135+ siteMux.Handle("/.well-known/atproto-oauth-client-metadata.json",
136136+ adminui.NewMetadataHandler(oauthClient, "Atmosphere Mail", cfg.SiteBaseURL))
137137+ pub := &adminui.AtpoauthPublisher{C: oauthClient}
138138+ attestHandler := adminui.NewAttestHandler(pub, store)
139139+ attestHandler.SetFunnelRecorder(metrics)
140140+ attestHandler.SetDIDHandleResolver(deps.didResolver)
141141+ attestHandler.RegisterRoutes(siteMux)
142142+143143+ // Self-service credential recovery. Shares the attest OAuth
144144+ // callback (indigo only supports one redirect URI per client)
145145+ // and dispatches on whether the session carries an attestation
146146+ // payload — empty means recovery. regenFn wraps the admin API's
147147+ // transport-agnostic RegenerateKey so the rotation path is
148148+ // identical between operator-triggered and member-triggered.
149149+ recoverHandler = adminui.NewRecoverHandler(pub, store, cfg.SiteBaseURL,
150150+ func(did, domain string) (string, error) {
151151+ _, apiKey, err := adminAPI.RegenerateKey(context.Background(), did, domain)
152152+ return apiKey, err
153153+ })
154154+ recoverHandler.SetHandleResolver(deps.didResolver)
155155+ recoverHandler.SetContactEmailChangedHook(func(ctx context.Context, domain, contactEmail string) {
156156+ adminAPI.TriggerEmailVerification(ctx, domain, contactEmail)
157157+ })
158158+ // Surface live label state on /account/manage. The existing
159159+ // labelChecker already speaks queryLabels XRPC for the SMTP
160160+ // fail-closed gate; reusing it means the manage page sees
161161+ // exactly the labels the relay does.
162162+ recoverHandler.SetLabelStatusQuerier(deps.labelChecker)
163163+ recoverHandler.RegisterRoutes(siteMux)
164164+ attestHandler.SetRecoveryIssuer(recoverHandler)
165165+ attestHandler.SetEnrollAuthIssuer(enrollHandler)
166166+ // Atomic enroll+publish: the wizard stashes credentials in
167167+ // enrollHandler when it kicks the publish-OAuth round-trip;
168168+ // attestHandler consumes them on a successful callback so the
169169+ // post-publish page can reveal the API key for the first time.
170170+ attestHandler.SetEnrollCredentialsStash(enrollHandler)
171171+ enrollHandler.SetPublisher(pub)
172172+ enrollHandler.SetAccountTicketIssuer(recoverHandler)
173173+174174+ log.Printf("atpoauth.enabled: client_id=%s callback=%s confidential=%v",
175175+ oauthCfg.ClientID, oauthCfg.CallbackURL, oauthClient.IsConfidential())
176176+ }
177177+178178+ // Infra mux: narrow surface for the SMTP domain. The List-Unsubscribe
179179+ // header points here by design (PublicBaseURL), so /u/ must remain
180180+ // addressable, but we deliberately don't serve the marketing UI on
181181+ // the infra host.
182182+ infraMux := http.NewServeMux()
183183+ infraMux.Handle("/u/", deps.unsubscriber.Handler())
184184+ infraMux.Handle("/healthz", healthHandler)
185185+ infraMux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
186186+ http.Error(w, "not found", http.StatusNotFound)
187187+ })
188188+189189+ var publicHandler http.Handler
190190+ var publicTLS *tls.Config
191191+192192+ if len(cfg.PublicDomains) > 0 {
193193+ // Build a SNI-aware cert map. One cert per PublicDomain; the
194194+ // TLS stack picks the right one via ClientHelloInfo.ServerName.
195195+ certByHost := make(map[string]*tls.Certificate, len(cfg.PublicDomains))
196196+ routes := make(map[string]relay.HostRoute, len(cfg.PublicDomains))
197197+ var anyCert *tls.Certificate
198198+ for _, pd := range cfg.PublicDomains {
199199+ role, err := parseHostRole(pd.Role)
200200+ if err != nil {
201201+ log.Fatalf("publicDomains: host=%s: %v", pd.Host, err)
202202+ }
203203+ routes[pd.Host] = relay.HostRoute{Role: role, RedirectTo: pd.RedirectTo}
204204+ if pd.CertFile == "" || pd.KeyFile == "" {
205205+ // Redirect-only hosts may share a wildcard cert with
206206+ // another entry — an empty cert path means "inherit".
207207+ continue
208208+ }
209209+ c, err := tls.LoadX509KeyPair(pd.CertFile, pd.KeyFile)
210210+ if err != nil {
211211+ log.Printf("public.tls_unavailable: host=%s error=%v (domain will fail TLS handshake until cert is provisioned)", pd.Host, err)
212212+ continue
213213+ }
214214+ cert := c
215215+ certByHost[strings.ToLower(pd.Host)] = &cert
216216+ if anyCert == nil {
217217+ anyCert = &cert
218218+ }
219219+ log.Printf("public.tls_loaded: host=%s cert=%s", pd.Host, pd.CertFile)
220220+ }
221221+ if anyCert == nil {
222222+ log.Fatalf("publicDomains: at least one domain must have a loadable cert")
223223+ }
224224+ publicTLS = &tls.Config{
225225+ MinVersion: tls.VersionTLS12,
226226+ GetCertificate: func(hello *tls.ClientHelloInfo) (*tls.Certificate, error) {
227227+ if c, ok := certByHost[strings.ToLower(hello.ServerName)]; ok {
228228+ return c, nil
229229+ }
230230+ // Unknown SNI: hand back any cert so the client gets a
231231+ // response (it'll fail hostname verification, which is
232232+ // the right outcome). Better than leaking a handshake
233233+ // error at the TCP layer.
234234+ return anyCert, nil
235235+ },
236236+ }
237237+ // Default fallback = siteMux — misdirected requests to unknown
238238+ // hosts get the marketing site, not a 404.
239239+ publicHandler = relay.NewPublicRouter(routes, siteMux, infraMux, siteMux)
240240+ for h, r := range routes {
241241+ log.Printf("public.route: host=%s role=%d redirect_to=%s", h, r.Role, r.RedirectTo)
242242+ }
243243+ } else {
244244+ // Legacy single-cert mode: every Host gets the full site handler.
245245+ if deps.tlsConfig == nil {
246246+ log.Printf("public.tls_unavailable: skipping public listener (no cert and no publicDomains)")
247247+ } else {
248248+ publicTLS = deps.tlsConfig
249249+ publicHandler = siteMux
250250+ log.Printf("public.mode: legacy_single_cert")
251251+ }
252252+ }
253253+254254+ var server *http.Server
255255+ if publicHandler != nil && publicTLS != nil {
256256+ server = &http.Server{
257257+ Addr: cfg.PublicAddr,
258258+ Handler: metrics.HTTPMiddleware(publicHandler),
259259+ TLSConfig: publicTLS,
260260+ ReadTimeout: 10 * time.Second,
261261+ WriteTimeout: 10 * time.Second,
262262+ IdleTimeout: 60 * time.Second,
263263+ }
264264+ publicErrCh := make(chan error, 1)
265265+ relay.GoSafe("public.serve", func() {
266266+ log.Printf("public HTTPS listening on %s", cfg.PublicAddr)
267267+ if err := server.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed {
268268+ publicErrCh <- err
269269+ }
270270+ })
271271+ relay.GoSafe("public.errwatch", func() {
272272+ if err := <-publicErrCh; err != nil {
273273+ log.Fatalf("public server: %v", err)
274274+ }
275275+ })
276276+ }
277277+278278+ return publicSetup{
279279+ Server: server,
280280+ RecoverHandler: recoverHandler,
281281+ EnrollHandler: enrollHandler,
282282+ }
283283+}
284284+285285+// parseHostRole maps the string role in RelayConfig.PublicDomains to the
286286+// typed relay.HostRole enum. Returns an error for unknown roles so typos
287287+// fail loudly at startup rather than silently falling through to fallback.
288288+func parseHostRole(s string) (relay.HostRole, error) {
289289+ switch strings.ToLower(strings.TrimSpace(s)) {
290290+ case "site":
291291+ return relay.RoleSite, nil
292292+ case "infra":
293293+ return relay.RoleInfra, nil
294294+ case "redirect":
295295+ return relay.RoleRedirect, nil
296296+ default:
297297+ return 0, fmt.Errorf("unknown role %q (must be site, infra, or redirect)", s)
298298+ }
299299+}
+476
cmd/relay/submission.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package main
44+55+// SMTP submission pipeline. The three core SMTP behaviors — AUTH-time
66+// member lookup, per-message rate/label checking, and DATA-time
77+// acceptance — are methods on a submissionHandler struct that owns its
88+// deps explicitly.
99+// Introducing interfaces would let us unit-test these methods in
1010+// isolation, but the integration harness already exercises them
1111+// against real components, so the abstraction would add code without
1212+// adding coverage.
1313+1414+import (
1515+ "context"
1616+ "errors"
1717+ "fmt"
1818+ "log"
1919+ "strings"
2020+ "time"
2121+2222+ "atmosphere-mail/internal/osprey"
2323+ "atmosphere-mail/internal/relay"
2424+ "atmosphere-mail/internal/relaystore"
2525+2626+ "github.com/emersion/go-smtp"
2727+)
2828+2929+// submissionHandler bundles the dependencies and methods that drive
3030+// the SMTP submission pipeline: AUTH-time member lookup, per-message
3131+// rate / label checking, and message acceptance into the queue.
3232+type submissionHandler struct {
3333+ store *relaystore.Store
3434+ queue *relay.Queue
3535+ metrics *relay.Metrics
3636+ rateLimiter *relay.RateLimiter
3737+ labelChecker *relay.LabelChecker
3838+ ospreyEnforcer *relay.OspreyEnforcer
3939+ ospreyEmitter *osprey.Emitter
4040+ unsubscriber *relay.Unsubscriber
4141+ operatorKeys *relay.DKIMKeys
4242+ cfg *RelayConfig
4343+ warmingCfg relay.WarmingConfig
4444+}
4545+4646+// Lookup is the SMTP AUTH-time member lookup. Resolves a DID (or, as
4747+// a fallback, a domain string) to a *relay.MemberWithDomains with
4848+// fully reconstructed DKIM keys for each registered domain. Applies
4949+// Osprey-derived auth-time policy: suspended DIDs and cold-cache
5050+// states with an unreachable Osprey are blocked here so the SMTP
5151+// session never gets to MAIL FROM.
5252+func (h *submissionHandler) Lookup(ctx context.Context, did string) (*relay.MemberWithDomains, error) {
5353+ member, domains, err := h.store.GetMemberWithDomains(ctx, did)
5454+ if err != nil {
5555+ return nil, err
5656+ }
5757+5858+ // Fallback: if DID lookup fails and username doesn't look like a DID,
5959+ // try domain-based lookup. This supports SMTP clients (e.g. nodemailer)
6060+ // that can't preserve percent-encoded colons in URL userinfo, making
6161+ // DID-based usernames impossible via SMTP URL configuration.
6262+ if member == nil && !strings.HasPrefix(did, "did:") {
6363+ m, d, err := h.store.GetMemberByDomain(ctx, did)
6464+ if err != nil {
6565+ return nil, err
6666+ }
6767+ if m != nil {
6868+ member = m
6969+ domains = []relaystore.MemberDomain{*d}
7070+ }
7171+ }
7272+7373+ if member == nil {
7474+ return nil, nil
7575+ }
7676+7777+ domainInfos := make([]relay.DomainInfo, len(domains))
7878+ for i, d := range domains {
7979+ rsaKey, edKey, err := deserializeDKIMKeys(d.DKIMRSAPriv, d.DKIMEdPriv)
8080+ if err != nil {
8181+ return nil, fmt.Errorf("deserialize DKIM keys for %s/%s: %w", did, d.Domain, err)
8282+ }
8383+ domainInfos[i] = relay.DomainInfo{
8484+ Domain: d.Domain,
8585+ APIKeyHash: d.APIKeyHash,
8686+ DKIMKeys: &relay.DKIMKeys{
8787+ Selector: d.DKIMSelector,
8888+ RSAPriv: rsaKey,
8989+ EdPriv: edKey,
9090+ },
9191+ DKIMSelector: d.DKIMSelector,
9292+ CreatedAt: d.CreatedAt,
9393+ }
9494+ }
9595+9696+ mwd := &relay.MemberWithDomains{
9797+ DID: member.DID,
9898+ Status: member.Status,
9999+ SendCount: member.SendCount,
100100+ HourlyLimit: member.HourlyLimit,
101101+ DailyLimit: member.DailyLimit,
102102+ CreatedAt: member.CreatedAt,
103103+ Domains: domainInfos,
104104+ }
105105+106106+ // Auth-time Osprey check: derive policy from labels. Suspended DIDs
107107+ // are blocked at the session level. Trust/throttle labels flow
108108+ // through to rate-limit computation at send time. Fail-stale: uses
109109+ // cached value if Osprey is unreachable — a previously suspended
110110+ // DID stays blocked even during a network partition.
111111+ if h.ospreyEnforcer != nil && mwd.Status == relaystore.StatusActive {
112112+ policy, err := h.ospreyEnforcer.GetPolicy(ctx, member.DID)
113113+ if errors.Is(err, relay.ErrOspreyColdCache) {
114114+ // Cold cache + Osprey unreachable: block AUTH rather
115115+ // than fail-open. The rejection is transient from
116116+ // the client's POV; once Osprey returns, the policy
117117+ // resolves normally.
118118+ log.Printf("osprey.enforce: did=%s action=block_auth reason=cold_cache_unreachable", member.DID)
119119+ mwd.Status = relaystore.StatusSuspended
120120+ }
121121+ if policy != nil && policy.Suspended {
122122+ log.Printf("osprey.enforce: did=%s action=block_auth reason=%s", member.DID, policy.SuspendReason)
123123+ mwd.Status = relaystore.StatusSuspended
124124+ }
125125+ if h.ospreyEnforcer.Reachable() {
126126+ h.metrics.OspreyReachable.Set(1)
127127+ } else {
128128+ h.metrics.OspreyReachable.Set(0)
129129+ }
130130+ }
131131+132132+ return mwd, nil
133133+}
134134+135135+// Check is the per-message pre-acceptance gate: rate limits with
136136+// warming + label policy applied, plus Osprey send-time enforcement
137137+// and labeler verification. Called once per RCPT TO at the SMTP
138138+// session boundary.
139139+func (h *submissionHandler) Check(ctx context.Context, member *relay.AuthMember, from, to string) error {
140140+ // Fetch the member's Osprey-derived policy up front so both rate
141141+ // limits and suspension checks use the same snapshot.
142142+ var policy *relay.LabelPolicy
143143+ if h.ospreyEnforcer != nil {
144144+ p, err := h.ospreyEnforcer.GetPolicy(ctx, member.DID)
145145+ if errors.Is(err, relay.ErrOspreyColdCache) {
146146+ // Cold cache + Osprey unreachable → 451 SMTP deferral.
147147+ // Client retries; by then either Osprey is back or
148148+ // the cache has been warmed.
149149+ return fmt.Errorf("451 osprey unreachable, please retry")
150150+ }
151151+ policy = p
152152+ }
153153+154154+ // Apply warming limits + label policy (highly_trusted skips warming,
155155+ // burst_warming halves the hourly limit, etc.).
156156+ hourly, daily := relay.WarmingLimitsForPolicy(h.warmingCfg, member.CreatedAt, member.HourlyLimit, member.DailyLimit, policy)
157157+158158+ // Check rate limits
159159+ if err := h.rateLimiter.Check(ctx, member.DID, hourly, daily); err != nil {
160160+ if rle, ok := err.(*relay.RateLimitError); ok {
161161+ rle.Tier = relay.MemberTier(h.warmingCfg, member.CreatedAt, time.Now())
162162+ h.metrics.RateLimitHits.WithLabelValues(rle.LimitType).Inc()
163163+ }
164164+ log.Printf("ratelimit.hit: did=%s hourly_limit=%d daily_limit=%d error=%v",
165165+ member.DID, hourly, daily, err)
166166+ h.metrics.MessagesRejected.WithLabelValues("rate_limit").Inc()
167167+ h.ospreyEmitter.Emit(ctx, osprey.EventData{
168168+ EventType: osprey.EventRelayRejected,
169169+ SenderDID: member.DID,
170170+ SenderDomain: member.Domain,
171171+ RejectReason: "rate_limit",
172172+ })
173173+ return err
174174+ }
175175+176176+ // Osprey send-time enforcement. Reuses the policy we fetched at
177177+ // the top of Check so we only hit the enforcer cache once per
178178+ // session. Fail-stale: stale cache > fail-open.
179179+ if h.ospreyEnforcer != nil {
180180+ if h.metrics.OspreyReachable != nil {
181181+ if h.ospreyEnforcer.Reachable() {
182182+ h.metrics.OspreyReachable.Set(1)
183183+ } else {
184184+ h.metrics.OspreyReachable.Set(0)
185185+ }
186186+ }
187187+ if policy != nil && policy.Suspended {
188188+ log.Printf("osprey.enforce: did=%s action=block_send reason=%s", member.DID, policy.SuspendReason)
189189+ h.metrics.OspreyChecksTotal.WithLabelValues("blocked").Inc()
190190+ h.metrics.MessagesRejected.WithLabelValues("osprey_suspended").Inc()
191191+ h.ospreyEmitter.Emit(ctx, osprey.EventData{
192192+ EventType: osprey.EventRelayRejected,
193193+ SenderDID: member.DID,
194194+ SenderDomain: member.Domain,
195195+ RejectReason: "osprey_auto_suspended",
196196+ })
197197+ return &smtp.SMTPError{
198198+ Code: 550,
199199+ EnhancedCode: smtp.EnhancedCode{5, 7, 1},
200200+ Message: "Account suspended by safety system — check status: GET /member/status?did=" + member.DID + " with Authorization: Bearer header",
201201+ }
202202+ }
203203+ h.metrics.OspreyChecksTotal.WithLabelValues("allowed").Inc()
204204+ }
205205+206206+ // Check labels (fail-closed)
207207+ ok, err := h.labelChecker.CheckLabels(ctx, member.DID, member.SendCount)
208208+ if err != nil {
209209+ log.Printf("label.check: did=%s result=error labeler_reachable=false error=%v", member.DID, err)
210210+ h.metrics.LabelerReachable.Set(0)
211211+ h.metrics.MessagesRejected.WithLabelValues("label_denied").Inc()
212212+ h.ospreyEmitter.Emit(ctx, osprey.EventData{
213213+ EventType: osprey.EventRelayRejected,
214214+ SenderDID: member.DID,
215215+ SenderDomain: member.Domain,
216216+ RejectReason: "label_unavailable",
217217+ })
218218+ return fmt.Errorf("451 temporary error — label verification unavailable")
219219+ }
220220+ h.metrics.LabelerReachable.Set(1)
221221+ if !ok {
222222+ log.Printf("label.check: did=%s result=denied", member.DID)
223223+ h.metrics.MessagesRejected.WithLabelValues("label_denied").Inc()
224224+ h.ospreyEmitter.Emit(ctx, osprey.EventData{
225225+ EventType: osprey.EventRelayRejected,
226226+ SenderDID: member.DID,
227227+ SenderDomain: member.Domain,
228228+ RejectReason: "label_denied",
229229+ })
230230+ return fmt.Errorf("550 sending not authorized — required labels missing")
231231+ }
232232+ log.Printf("label.check: did=%s result=ok cache_hit=false", member.DID)
233233+ return nil
234234+}
235235+236236+// Accept is the SMTP DATA-acceptance handler: per-batch capacity
237237+// pre-check, suppression filtering, atomic batch rate reservation,
238238+// per-recipient DKIM signing + persistence + enqueue, and finally
239239+// the relay_attempt event emission. Returns an error to the SMTP
240240+// client only when the entire batch should be rejected; partial
241241+// recipient failures are aggregated and logged but DON'T fail the
242242+// whole DATA (see duplicate-delivery rationale below).
243243+func (h *submissionHandler) Accept(member *relay.AuthMember, from string, to []string, data []byte) error {
244244+ // Pre-check queue capacity for the full batch BEFORE consuming rate budget.
245245+ // This prevents partial delivery: if we enqueue 2 of 5 recipients then fail,
246246+ // the client retries all 5, duplicating the first 2.
247247+ if !h.queue.HasCapacity(len(to)) {
248248+ h.metrics.MessagesRejected.WithLabelValues("queue_full").Inc()
249249+ return fmt.Errorf("451 delivery queue full — try again later")
250250+ }
251251+252252+ // Classify the message once from the X-Atmos-Category header.
253253+ // User-initiated transactional categories (login-link, password-reset,
254254+ // mfa-otp, verification) bypass List-Unsubscribe and the suppression
255255+ // list — both behaviors break the auth/login flow the recipient just
256256+ // initiated. Header is stripped before DKIM signing further down so
257257+ // the internal classification doesn't leak to receivers.
258258+ category := relay.ParseCategory(data)
259259+ data = relay.StripCategoryHeader(data)
260260+ isTransactional := category.IsUserInitiatedTransactional()
261261+262262+ // Filter out suppressed recipients BEFORE consuming rate budget so
263263+ // an unsubscribed recipient doesn't count against the member's daily
264264+ // limit. Rejecting the whole batch here would surprise senders who
265265+ // include a mix of subscribed and unsubscribed addresses — instead
266266+ // we quietly drop suppressed recipients and proceed with the rest.
267267+ // If ALL recipients are suppressed, return 550.
268268+ //
269269+ // Skip the suppression check entirely for user-initiated transactional
270270+ // mail: a stray unsub click on a previous OTP must not silently drop
271271+ // the next login link.
272272+ var deliverable []string
273273+ var suppressedCount int
274274+ if h.unsubscriber != nil && !isTransactional {
275275+ for _, r := range to {
276276+ supp, err := h.store.IsSuppressed(context.Background(), member.DID, r)
277277+ if err != nil {
278278+ log.Printf("suppression.check_error: did=%s recipient=%s error=%v", member.DID, r, err)
279279+ // Fail open — a DB read error shouldn't block a legitimate send.
280280+ deliverable = append(deliverable, r)
281281+ continue
282282+ }
283283+ if supp {
284284+ suppressedCount++
285285+ log.Printf("smtp.suppressed: did=%s recipient=%s", member.DID, r)
286286+ h.metrics.MessagesRejected.WithLabelValues("suppressed").Inc()
287287+ continue
288288+ }
289289+ deliverable = append(deliverable, r)
290290+ }
291291+ if len(deliverable) == 0 {
292292+ log.Printf("smtp.all_suppressed: did=%s recipients=%d", member.DID, len(to))
293293+ return &smtp.SMTPError{
294294+ Code: 550,
295295+ EnhancedCode: smtp.EnhancedCode{5, 7, 1},
296296+ Message: "All recipients have unsubscribed",
297297+ }
298298+ }
299299+ if suppressedCount > 0 {
300300+ log.Printf("smtp.partial_suppressed: did=%s total=%d suppressed=%d deliverable=%d",
301301+ member.DID, len(to), suppressedCount, len(deliverable))
302302+ }
303303+ } else {
304304+ deliverable = to
305305+ }
306306+307307+ // Atomically check rate limits AND record the sends for the full batch.
308308+ // This eliminates the TOCTOU race where concurrent sessions could both pass
309309+ // a check-only call before either records. Uses the same label policy as
310310+ // Check above (highly_trusted skips warming, burst_warming throttles).
311311+ var batchPolicy *relay.LabelPolicy
312312+ if h.ospreyEnforcer != nil {
313313+ p, err := h.ospreyEnforcer.GetPolicy(context.Background(), member.DID)
314314+ if errors.Is(err, relay.ErrOspreyColdCache) {
315315+ // Same cold-cache fail-closed as the per-msg path;
316316+ // reject the batch with 451 so the sender retries
317317+ // when Osprey is healthy again.
318318+ return fmt.Errorf("451 osprey unreachable, please retry")
319319+ }
320320+ batchPolicy = p
321321+ }
322322+ hourly, daily := relay.WarmingLimitsForPolicy(h.warmingCfg, member.CreatedAt, member.HourlyLimit, member.DailyLimit, batchPolicy)
323323+ if err := h.rateLimiter.CheckBatchAndRecord(context.Background(), member.DID, len(deliverable), hourly, daily); err != nil {
324324+ if rle, ok := err.(*relay.RateLimitError); ok {
325325+ rle.Tier = relay.MemberTier(h.warmingCfg, member.CreatedAt, time.Now())
326326+ h.metrics.RateLimitHits.WithLabelValues(rle.LimitType).Inc()
327327+ }
328328+ log.Printf("ratelimit.batch_reject: did=%s recipients=%d hourly_limit=%d daily_limit=%d error=%v",
329329+ member.DID, len(deliverable), hourly, daily, err)
330330+ h.metrics.MessagesRejected.WithLabelValues("rate_limit").Inc()
331331+ return err
332332+ }
333333+334334+ // Content fingerprint computed once from the original data (before
335335+ // per-recipient headers are prepended). Used for both the messages
336336+ // table (content-spray detection) and the Osprey event.
337337+ subject, body := extractSubjectAndBody(data)
338338+ contentFP := relay.ContentFingerprint(subject, body)
339339+340340+ // Multi-RCPT DATA fans out to one queue entry per recipient. If the
341341+ // loop returns early on a per-recipient error, recipients 1..N-1 are
342342+ // already enqueued and the SMTP client will retry the entire DATA
343343+ // (because we returned a transient error), duplicating those
344344+ // recipients. Instead, we collect per-recipient outcomes and only
345345+ // reject the whole DATA when ZERO recipients succeeded.
346346+ outcomes := make([]relay.RecipientOutcome, 0, len(deliverable))
347347+ for _, recipient := range deliverable {
348348+ outcome := relay.RecipientOutcome{Recipient: recipient}
349349+350350+ verpFrom := relay.VERPReturnPath(member.DID, recipient, h.cfg.Domain)
351351+352352+ // Build per-recipient message with its own List-Unsubscribe header.
353353+ // The header references a per-recipient token, so each recipient
354354+ // can unsubscribe only themselves (not the whole batch).
355355+ //
356356+ // Skip List-Unsubscribe entirely for user-initiated transactional
357357+ // mail: adding it to a login link or OTP encourages clicks
358358+ // that would lock the recipient out of their own auth flow.
359359+ perMsgData := data
360360+ if h.unsubscriber != nil && !isTransactional {
361361+ lu, lup := h.unsubscriber.HeaderValues(member.DID, recipient, time.Now())
362362+ perMsgData = prependListUnsubHeaders(data, lu, lup)
363363+ }
364364+ // X-Atmos-Member-Did: stamps the sending member's DID on every
365365+ // outbound message so inbound FBL/ARF reports can be attributed
366366+ // back to a member. Preserved by all major providers in Part 3
367367+ // of their ARF reports. Must come before DKIM signing so the
368368+ // signature covers it (and the DKIM signer includes X-Atmos-*
369369+ // headers in its signed list).
370370+ perMsgData = prependHeader(perMsgData, "X-Atmos-Member-Did", member.DID)
371371+372372+ // Stamp Feedback-ID BEFORE signing so both the member and operator
373373+ // DKIM signatures cover it. Receivers (Gmail in particular) only
374374+ // trust the Feedback-ID for FBL routing when it's authenticated.
375375+ // Category derives from the X-Atmos-Category header —
376376+ // user-initiated transactional mail collapses to "transactional"
377377+ // so receivers don't see internal product distinctions.
378378+ perMsgData = relay.PrependFeedbackID(perMsgData, category.FeedbackIDValue(), member.DID, member.Domain)
379379+380380+ // DKIM sign per-recipient (required because the prepended headers
381381+ // differ per recipient — a shared signature would break on the other
382382+ // recipients). Slight perf cost acceptable for the deliverability win.
383383+ //
384384+ // Dual-domain: member signature first (d=member.Domain, required
385385+ // for DMARC alignment) → operator signature on top (d=atmos.email,
386386+ // carries FBL routing).
387387+ signer := relay.NewDualDomainSigner(member.DKIMKeys, h.operatorKeys, member.Domain, h.cfg.OperatorDKIMDomain)
388388+ signed, signErr := signer.Sign(strings.NewReader(string(perMsgData)))
389389+ if signErr != nil {
390390+ outcome.Err = fmt.Errorf("DKIM sign: %w", signErr)
391391+ log.Printf("smtp.recipient_failed: did=%s recipient=%s stage=dkim error=%v", member.DID, recipient, signErr)
392392+ outcomes = append(outcomes, outcome)
393393+ continue
394394+ }
395395+396396+ // Log message to store
397397+ msgID, insErr := h.store.InsertMessage(context.Background(), &relaystore.Message{
398398+ MemberDID: member.DID,
399399+ FromAddr: from,
400400+ ToAddr: recipient,
401401+ MessageID: extractMessageID(string(data)),
402402+ Status: relaystore.MsgQueued,
403403+ CreatedAt: time.Now().UTC(),
404404+ ContentFingerprint: contentFP,
405405+ })
406406+ if insErr != nil {
407407+ outcome.Err = fmt.Errorf("log message: %w", insErr)
408408+ log.Printf("smtp.recipient_failed: did=%s recipient=%s stage=insert error=%v", member.DID, recipient, insErr)
409409+ outcomes = append(outcomes, outcome)
410410+ continue
411411+ }
412412+ outcome.MsgID = msgID
413413+414414+ // Enqueue for delivery — capacity was pre-checked above so this
415415+ // should only fail on spool I/O errors, not capacity.
416416+ if enqErr := h.queue.Enqueue(&relay.QueueEntry{
417417+ ID: msgID,
418418+ From: verpFrom,
419419+ To: recipient,
420420+ Data: signed,
421421+ MemberDID: member.DID,
422422+ }); enqErr != nil {
423423+ // Mark the row as failed so it doesn't masquerade as queued
424424+ // (the orphan-reconciliation janitor would catch it eventually,
425425+ // but immediate update keeps the messages table consistent).
426426+ if updErr := h.store.UpdateMessageStatus(context.Background(), msgID, relaystore.MsgFailed, 0); updErr != nil {
427427+ log.Printf("smtp.mark_failed_error: did=%s msg_id=%d error=%v", member.DID, msgID, updErr)
428428+ }
429429+ outcome.Err = fmt.Errorf("queue.enqueue: %w", enqErr)
430430+ log.Printf("smtp.recipient_failed: did=%s recipient=%s stage=enqueue msg_id=%d error=%v", member.DID, recipient, msgID, enqErr)
431431+ outcomes = append(outcomes, outcome)
432432+ continue
433433+ }
434434+435435+ // Only count the send AFTER successful enqueue — failed recipients
436436+ // shouldn't burn lifetime send-count budget. Rate counters were
437437+ // pre-recorded for the full batch by CheckBatchAndRecord above; that
438438+ // over-counts on partial failure but the warming/limit window is
439439+ // short enough that the impact is negligible vs. the complexity of
440440+ // rolling back per-recipient rate-counter rows.
441441+ if err := h.store.IncrementSendCount(context.Background(), member.DID); err != nil {
442442+ log.Printf("smtp.send_count_increment_error: did=%s msg_id=%d error=%v", member.DID, msgID, err)
443443+ }
444444+445445+ outcomes = append(outcomes, outcome)
446446+ }
447447+448448+ succeeded, failed, retryAll, lastErr := relay.AggregateRecipientOutcomes(outcomes)
449449+ if h.metrics.PartialDeliveryRecipients != nil {
450450+ if succeeded > 0 {
451451+ h.metrics.PartialDeliveryRecipients.WithLabelValues("succeeded").Add(float64(succeeded))
452452+ }
453453+ if failed > 0 {
454454+ h.metrics.PartialDeliveryRecipients.WithLabelValues("failed").Add(float64(failed))
455455+ }
456456+ }
457457+ if retryAll {
458458+ h.metrics.MessagesRejected.WithLabelValues("delivery_failed").Inc()
459459+ log.Printf("smtp.delivery_all_failed: did=%s recipients=%d last_error=%v", member.DID, len(deliverable), lastErr)
460460+ return fmt.Errorf("451 delivery queue error — try again later: %w", lastErr)
461461+ }
462462+ if failed > 0 {
463463+ if h.metrics.PartialDeliveries != nil {
464464+ h.metrics.PartialDeliveries.Inc()
465465+ }
466466+ log.Printf("smtp.partial_delivery: did=%s succeeded=%d failed=%d last_error=%v", member.DID, succeeded, failed, lastErr)
467467+ }
468468+469469+ // Emit relay_attempt event after successful queuing. Enriched
470470+ // with velocity counters so Osprey rules can do stateless
471471+ // burst + bounce reputation checks (SML has no windowed-count
472472+ // primitive). See cmd/relay/events.go for the field set.
473473+ emitRelayAttemptEvent(context.Background(), h.store, h.ospreyEmitter, member, len(deliverable), contentFP)
474474+475475+ return nil
476476+}
···9494 covers every member.
9595- **Atproto OAuth** (PAR + DPoP + PKCE + `private_key_jwt`) for
9696 self-service enrollment. Works against `bsky.social` and any
9797- federating ePDS — we've validated the full handshake with at
9898- least one non-bsky PDS.
9797+ federating self-hosted PDS — we've validated the full handshake
9898+ with at least one non-bsky PDS.
9999100100## What changed from the original plan
101101
+193
docs/offsite-backups.md
···11+# Offsite Restic Backups — Activation Runbook
22+33+Atmosphere Mail's local restic backup runs every 6 hours on each VPS,
44+writing snapshots to a Hetzner Cloud Volume attached to that same VPS.
55+Hetzner-native VPS snapshots (PR #337) cover the case where the volume
66+itself fails. This document covers the third layer: an offsite copy that
77+survives Hetzner-account-level loss (account suspension, region-wide
88+incident, billing failure).
99+1010+The `services.restic-offsite-copy` NixOS module ships dormant. Activate
1111+it per host using the runbook below.
1212+1313+## 1. Pick a destination
1414+1515+Three reasonable destinations, in increasing order of operational
1616+independence from Hetzner:
1717+1818+| Destination | Cost (5GB) | Vendor-loss protection | Setup effort |
1919+|---|---|---|---|
2020+| **SFTP via Tailnet** to a homelab host | $0 | Partial — homelab + Hetzner are independent failure domains | Lowest — SSH key only |
2121+| **Hetzner Storage Box** (BX11) | ~€3.20/mo for 1TB | None — same Hetzner account | Low — Robot console |
2222+| **Backblaze B2** | ~$0.03/mo at 5GB | Full — separate vendor | Medium — new account |
2323+2424+Recommendation: **SFTP via Tailnet to a homelab host** for the immediate
2525+gap (geographic + vendor independence at zero marginal cost), graduating
2626+to **B2 later** once the cooperative grows past a handful of members.
2727+2828+## 2. Provision credentials
2929+3030+### Option A: SFTP via Tailnet (recommended for now)
3131+3232+On the destination host (e.g. `big-nix`):
3333+3434+```bash
3535+# Create the backup directory and a dedicated user
3636+sudo useradd -m -d /srv/atmos-backup atmos-backup
3737+sudo install -d -o atmos-backup -g atmos-backup -m 0700 /srv/atmos-backup/relay
3838+sudo install -d -o atmos-backup -g atmos-backup -m 0700 /srv/atmos-backup/ops
3939+4040+# Generate an SSH key on each VPS, then authorize them here.
4141+# (Run on atmos-relay and atmos-ops separately to get two pubkeys.)
4242+sudo -u atmos-backup mkdir -p /srv/atmos-backup/.ssh
4343+sudo -u atmos-backup tee -a /srv/atmos-backup/.ssh/authorized_keys < /tmp/relay-and-ops.pub
4444+sudo chmod 600 /srv/atmos-backup/.ssh/authorized_keys
4545+```
4646+4747+On each VPS (atmos-relay, atmos-ops), generate the SSH key the offsite
4848+job will use:
4949+5050+```bash
5151+ssh root@atmos-relay 'ssh-keygen -t ed25519 -N "" -f /root/.ssh/restic-offsite -C atmos-relay-offsite'
5252+ssh root@atmos-relay 'cat /root/.ssh/restic-offsite.pub' # paste into authorized_keys above
5353+```
5454+5555+Then capture the destination host's SSH host key for pinning:
5656+5757+```bash
5858+ssh-keyscan -t ed25519 kafka-broker.internal > /tmp/restic-offsite-known-hosts
5959+```
6060+6161+Store that file's contents as a sops secret named
6262+`restic_offsite_known_hosts` (one per host in `relay.yaml` / `ops.yaml`).
6363+6464+### Option B: Backblaze B2
6565+6666+1. Create a Backblaze account, then create a private bucket per host:
6767+ `atmos-relay-backup` and `atmos-ops-backup`.
6868+2. Create an Application Key scoped to those buckets with `read+write`
6969+ capabilities. Save the `keyID` and `applicationKey`.
7070+3. Add to sops:
7171+7272+```bash
7373+sops infra/secrets/relay.yaml
7474+# add:
7575+# restic_b2_account_id: <keyID>
7676+# restic_b2_account_key: <applicationKey>
7777+7878+sops infra/secrets/ops.yaml
7979+# same keys
8080+```
8181+8282+## 3. Wire the sops template
8383+8484+Add to the host's NixOS config (in `default.nix` for atmos-relay, or
8585+`atmos-ops.nix` for atmos-ops) inside the existing sops block:
8686+8787+```nix
8888+# For B2:
8989+sops.secrets.restic_b2_account_id = {
9090+ owner = "root"; group = "root"; mode = "0400";
9191+ sopsFile = ../secrets/relay.yaml; # or ops.yaml
9292+};
9393+sops.secrets.restic_b2_account_key = {
9494+ owner = "root"; group = "root"; mode = "0400";
9595+ sopsFile = ../secrets/relay.yaml;
9696+};
9797+sops.templates."restic-offsite-env" = {
9898+ owner = "root"; group = "root"; mode = "0400";
9999+ content = ''
100100+ B2_ACCOUNT_ID=${config.sops.placeholder.restic_b2_account_id}
101101+ B2_ACCOUNT_KEY=${config.sops.placeholder.restic_b2_account_key}
102102+ '';
103103+};
104104+105105+# For SFTP via Tailnet (no env vars needed; SSH key + known_hosts only):
106106+sops.secrets.restic_offsite_known_hosts = {
107107+ owner = "root"; group = "root"; mode = "0400";
108108+ sopsFile = ../secrets/relay.yaml;
109109+};
110110+```
111111+112112+## 4. Enable the module
113113+114114+In the same file:
115115+116116+```nix
117117+services.restic-offsite-copy = {
118118+ enable = true;
119119+ sourceRepo = "/var/lib/atmos-backup/restic-repo";
120120+121121+ # B2:
122122+ destRepo = "b2:atmos-relay-backup:atmos-relay";
123123+ environmentFile = config.sops.templates."restic-offsite-env".path;
124124+125125+ # OR — SFTP via Tailnet:
126126+ destRepo = "sftp:atmos-backup@kafka-broker.internal:/srv/atmos-backup/relay";
127127+ sshKnownHostsFile = config.sops.secrets.restic_offsite_known_hosts.path;
128128+};
129129+```
130130+131131+(Pick exactly one `destRepo` per host — comment out the other.)
132132+133133+## 5. Deploy + verify
134134+135135+```bash
136136+# Deploy via Gitea Actions ops-deploy / relay-deploy workflow
137137+# (don't bypass CI — let the deploy run the standard path).
138138+139139+# After deploy, on the VPS:
140140+ssh root@atmos-relay 'systemctl list-timers restic-offsite-copy'
141141+ssh root@atmos-relay 'systemctl start restic-offsite-copy.service'
142142+ssh root@atmos-relay 'journalctl -u restic-offsite-copy.service --no-pager | tail -50'
143143+144144+# First run initializes the destination repo. You should see:
145145+# "Destination repo ... not initialized; initializing"
146146+# "created restic repository ... at <destRepo>"
147147+# "Copying snapshots from ... to ..."
148148+# <snapshot count>
149149+# "Offsite copy complete"
150150+151151+# Verify offsite contents (B2):
152152+ssh root@atmos-relay '
153153+ source <(grep ^B2_ /run/secrets/.../restic-offsite-env)
154154+ export B2_ACCOUNT_ID B2_ACCOUNT_KEY
155155+ restic --repo b2:atmos-relay-backup:atmos-relay \
156156+ --password-file /root/.restic-password \
157157+ snapshots
158158+'
159159+160160+# Verify offsite contents (SFTP):
161161+ssh root@atmos-relay '
162162+ restic --repo "sftp:atmos-backup@kafka-broker.internal:/srv/atmos-backup/relay" \
163163+ --password-file /root/.restic-password \
164164+ snapshots
165165+'
166166+```
167167+168168+The timer fires daily at 02:00 UTC (with up to 1h randomized delay).
169169+170170+## Recovery drill
171171+172172+Once a quarter, restore a snapshot to a scratch directory and verify:
173173+174174+```bash
175175+# From atmos-relay:
176176+mkdir /tmp/restore-test
177177+restic --repo <destRepo> --password-file /root/.restic-password \
178178+ restore latest --target /tmp/restore-test
179179+sqlite3 /tmp/restore-test/var/lib/atmos-backup/dumps/relay.sqlite "SELECT COUNT(*) FROM members"
180180+# ... compare against live `relay.sqlite`'s member count ±drift since snapshot
181181+rm -rf /tmp/restore-test
182182+```
183183+184184+If the count looks wildly wrong, the snapshot is suspect — investigate
185185+`backupPrepareCommand` in `default.nix` and the source SQLite hot-backup
186186+output before the next quarterly drill.
187187+188188+## Pricing note (B2 path)
189189+190190+5GB stored = ~$0.03/mo storage. Daily copies of incremental data ~50MB
191191+each = ~$0.0006/day in egress (Hetzner egress bills separately and is
192192+generous up to 20TB on cpx21 — well below). Total <$0.05/mo all-in for
193193+the foreseeable cooperative size.
···1010 image = "debian-12" # nixos-anywhere replaces with NixOS
1111 location = "ash" # Ashburn, VA
12121313+ # Hetzner-native daily snapshots, 7-day retention. +20% server price
1414+ # (~€1.60/mo). The relay volume holds member DKIM private keys,
1515+ # member records, attestation rkeys, and contact emails — none of
1616+ # which are reproducible elsewhere. Local restic backups live on the
1717+ # same volume as the data (#221), so a volume failure today destroys
1818+ # both data and backups simultaneously. VPS-level snapshots live on
1919+ # Hetzner's separate storage cluster and survive that failure mode.
2020+ backups = true
2121+1322 # Cloud-init: lock root password, inject SSH key for bootstrap.
1423 # chpasswd.expire: false prevents PAM from requiring password change
1524 # (Hetzner images mark root password expired by default).
···166175 server_type = "cpx21"
167176 image = "debian-12"
168177 location = "ash"
178178+179179+ # Hetzner-native daily snapshots, 7-day retention. +20% server price
180180+ # (~€1.60/mo). atmos-ops holds the labeler signing key, Osprey rule
181181+ # state, and the labels SQLite — recreating the labeler from scratch
182182+ # means re-issuing every label and breaks atproto label-history
183183+ # auditability. Snapshots are the single recovery primitive that
184184+ # survives volume failure.
185185+ backups = true
169186170187 user_data = <<-EOF
171188 #cloud-config
+90-4
infra/nixos/atmos-ops.nix
···1111{
1212 imports = [
1313 ./disko.nix
1414+ ./restic-offsite.nix
1415 ];
15161617 options = {
···119120120121 sops.secrets.bunny_api_key = {};
121122 sops.secrets.ntfy_gatus_token = {};
123123+ sops.secrets.restic_offsite_known_hosts = {};
124124+ sops.templates."restic-offsite-known-hosts" = {
125125+ content = config.sops.placeholder.restic_offsite_known_hosts;
126126+ };
122127123128 # Environment file for label-api (PG DSN)
124129 sops.templates."label-api-env" = {
···247252 dependsOn = [ "osprey-kafka" "osprey-postgres" ];
248253 };
249254250250- # Clone osprey rules from Gitea before worker starts
255255+ # Clone osprey rules from Gitea, sync into the bind-mount path, and
256256+ # restart the worker if anything changed. Shipped without
257257+ # RemainAfterExit=true (#251) — the previous one-shot-then-active
258258+ # pattern meant the unit ran exactly once on boot, after which any
259259+ # rule changes in the repo silently never reached production. The
260260+ # service is now idempotent and free to be re-triggered by:
261261+ # - the timer below (hourly autosync, defense in depth)
262262+ # - ops-deploy.yml after a NixOS switch on osprey/** path changes
263263+ # - manual `systemctl start osprey-rules-sync` for one-off pushes.
251264 systemd.services.osprey-rules-sync = {
252252- description = "Clone Osprey rules from Gitea";
265265+ description = "Sync Osprey rules from Gitea, restart worker on change";
253266 after = [ "network-online.target" ];
254267 wants = [ "network-online.target" ];
268268+ # Still wantedBy/before docker-osprey-worker so the FIRST boot
269269+ # gets rules in place before the worker tries to load them.
255270 wantedBy = [ "docker-osprey-worker.service" ];
256271 before = [ "docker-osprey-worker.service" ];
257272 serviceConfig = {
258273 Type = "oneshot";
259259- RemainAfterExit = true;
260274 EnvironmentFile = config.sops.templates."gitea-env".path;
261275 };
262262- path = [ pkgs.git ];
276276+ path = [ pkgs.git pkgs.coreutils pkgs.systemd ];
263277 script = ''
278278+ set -eu
264279 REPO_DIR=/var/lib/osprey-rules/repo
265280 COMBINED=/var/lib/osprey-rules/combined
266281 mkdir -p /var/lib/osprey-rules
···274289 "$REPO_DIR"
275290 fi
276291292292+ # Compute pre-sync content hash so we only restart the worker
293293+ # when something actually changed. Using a deterministic file
294294+ # listing (sort) so directory iteration order doesn't make the
295295+ # hash flap. Empty COMBINED/ on first boot hashes to the
296296+ # constant-empty-list digest, which is fine — different from
297297+ # any populated state.
298298+ PRE_HASH=""
299299+ if [ -d "$COMBINED" ]; then
300300+ PRE_HASH=$(find "$COMBINED" -type f -print0 | sort -z | xargs -0 sha256sum | sha256sum | cut -d' ' -f1)
301301+ fi
302302+277303 rm -rf "$COMBINED"
278304 mkdir -p "$COMBINED/config"
279305 cp -r "$REPO_DIR"/osprey/rules/. "$COMBINED/"
280306 cp "$REPO_DIR"/osprey/config/*.yaml "$COMBINED/config/"
307307+308308+ POST_HASH=$(find "$COMBINED" -type f -print0 | sort -z | xargs -0 sha256sum | sha256sum | cut -d' ' -f1)
309309+310310+ if [ "$PRE_HASH" != "$POST_HASH" ]; then
311311+ echo "osprey-rules-sync: rules changed (pre=$PRE_HASH post=$POST_HASH)"
312312+ # --no-block: don't deadlock on the worker's own pre-stop
313313+ # hooks, which can take 30s under Kafka rebalance. We're
314314+ # firing-and-forgetting; the next sync run will retry if
315315+ # the restart silently failed.
316316+ systemctl --no-block restart docker-osprey-worker.service || true
317317+ else
318318+ echo "osprey-rules-sync: no changes"
319319+ fi
281320 '';
321321+ };
322322+323323+ # Hourly resync as defense-in-depth so a missed deploy or unmerged
324324+ # local edit on a Gitea runner can't leave production stale for
325325+ # days. OnBootSec=5min lets boot finish before the first sync; the
326326+ # initial wantedBy/before docker-osprey-worker pairing already
327327+ # covered the boot-time sync via the service's own ordering.
328328+ systemd.timers.osprey-rules-sync = {
329329+ description = "Periodic Osprey rules sync from Gitea";
330330+ wantedBy = [ "timers.target" ];
331331+ timerConfig = {
332332+ OnBootSec = "5min";
333333+ OnUnitActiveSec = "1h";
334334+ Persistent = true;
335335+ };
282336 };
283337284338 # -------------------------------------------------------------------
···749803 Persistent = true;
750804 RandomizedDelaySec = "30m";
751805 };
806806+ pruneOpts = [
807807+ "--keep-daily 7"
808808+ "--keep-weekly 4"
809809+ "--keep-monthly 3"
810810+ ];
811811+ };
812812+813813+ # -------------------------------------------------------------------
814814+ # Offsite backup — daily copy to big-nix via Tailscale SFTP
815815+ # -------------------------------------------------------------------
816816+ systemd.services.restic-offsite-keygen = {
817817+ description = "Generate SSH key for offsite restic copy if missing";
818818+ after = [ "local-fs.target" ];
819819+ wantedBy = [ "multi-user.target" ];
820820+ serviceConfig.Type = "oneshot";
821821+ serviceConfig.RemainAfterExit = true;
822822+ script = ''
823823+ if [ ! -f /root/.ssh/restic-offsite ]; then
824824+ mkdir -p /root/.ssh && chmod 0700 /root/.ssh
825825+ ${pkgs.openssh}/bin/ssh-keygen -t ed25519 -N "" \
826826+ -f /root/.ssh/restic-offsite -C "atmos-ops-offsite"
827827+ chmod 0400 /root/.ssh/restic-offsite
828828+ fi
829829+ '';
830830+ };
831831+832832+ services.restic-offsite-copy = {
833833+ enable = true;
834834+ sourceRepo = "/var/lib/atmos-backup/restic-repo";
835835+ destRepo = "sftp:atmos-backup@kafka-broker.internal:/srv/atmos-backup/ops";
836836+ sshKnownHostsFile = config.sops.templates."restic-offsite-known-hosts".path;
837837+ afterUnits = [ "restic-password-init.service" "restic-offsite-keygen.service" "local-fs.target" ];
752838 pruneOpts = [
753839 "--keep-daily 7"
754840 "--keep-weekly 4"
···11+# SPDX-License-Identifier: AGPL-3.0-or-later
22+#
33+# Reusable NixOS module: copy a local restic repository to an offsite
44+# destination on a timer.
55+#
66+# Why this exists (#221):
77+# The local restic backups on atmos-relay and atmos-ops live on the
88+# same Hetzner Cloud Volume as the data they back up, with the
99+# restic password on the boot disk of the same VPS. A single volume
1010+# failure (or vendor-side incident on that VPS) destroys data and
1111+# "backups" simultaneously. PR #337 enabled Hetzner-native VPS
1212+# snapshots which survive volume failure but still live in the same
1313+# Hetzner account; this module adds a third layer that survives
1414+# account-level loss too.
1515+#
1616+# Design choices:
1717+# - Vendor-agnostic. `destRepo` accepts any restic-supported URL:
1818+# b2:bucket-name:path
1919+# s3:s3.example.com/bucket/path
2020+# sftp:user@host:/path/to/repo (works over Tailnet too)
2121+# rest:https://host:8000/path
2222+# - Copies the existing local repo rather than re-running the
2323+# backup. `restic copy --from-repo X` ships the snapshot graph
2424+# verbatim, so local and offsite always represent the same state
2525+# and there's no double work generating dumps.
2626+# - Default-off (`enable = false`). Importers wire it dormant; flip
2727+# `enable = true` only after the destination is provisioned and
2828+# credentials are in sops. No credential reference is made when
2929+# `enable = false` — sops never sees a missing-key error.
3030+# - Fails closed on missing source repo / source password — emits a
3131+# warning and exits 0 rather than spamming a failure-mail loop on
3232+# a freshly-provisioned host where the local repo isn't ready yet.
3333+{ config, lib, pkgs, ... }:
3434+3535+let
3636+ cfg = config.services.restic-offsite-copy;
3737+3838+ # Extract "user@host" from "sftp:user@host:/path" for the SSH command.
3939+ # restic's -o sftp.command needs the full SSH invocation including user@host.
4040+ sftpUserHost = builtins.head (builtins.split ":" (lib.removePrefix "sftp:" cfg.destRepo));
4141+4242+ sftpFlags = lib.optionalString (cfg.sshKnownHostsFile != null)
4343+ "-o 'sftp.command=ssh -i ${cfg.sshKeyPath} -o UserKnownHostsFile=${cfg.sshKnownHostsFile} -o StrictHostKeyChecking=yes ${sftpUserHost} -s sftp'";
4444+in
4545+{
4646+ options.services.restic-offsite-copy = {
4747+ enable = lib.mkEnableOption "Periodic copy of a local restic repository to an offsite restic repository";
4848+4949+ sourceRepo = lib.mkOption {
5050+ type = lib.types.str;
5151+ default = "";
5252+ description = ''
5353+ Filesystem path to the local restic repository to copy from.
5454+ Empty string is rejected by an assertion when `enable = true`,
5555+ so the option may be left unset on hosts that do not enable
5656+ the module.
5757+ '';
5858+ example = "/var/lib/atmos-backup/restic-repo";
5959+ };
6060+6161+ sourcePasswordFile = lib.mkOption {
6262+ type = lib.types.str;
6363+ default = "/root/.restic-password";
6464+ description = "Path to the password file for the local repo.";
6565+ };
6666+6767+ destRepo = lib.mkOption {
6868+ type = lib.types.str;
6969+ default = "";
7070+ description = ''
7171+ restic-formatted repository URL for the offsite destination.
7272+ Empty string is rejected by an assertion when `enable = true`.
7373+ Examples:
7474+ "b2:atmos-relay-backup:atmos-relay" (Backblaze B2)
7575+ "s3:s3.amazonaws.com/atmos-backup/atmos-ops" (AWS S3)
7676+ "sftp:scott@kafka-broker.internal:/srv/atmos-backup/relay" (SFTP via Tailnet)
7777+ '';
7878+ example = "b2:atmos-relay-backup:atmos-relay";
7979+ };
8080+8181+ destPasswordFile = lib.mkOption {
8282+ type = lib.types.str;
8383+ default = "/root/.restic-password";
8484+ description = ''
8585+ Path to the password file for the offsite repo. Defaults to the
8686+ same file as the source so a single rotated secret covers both —
8787+ the trade-off is that loss of /root/.restic-password requires
8888+ recovering it from the offsite copy via the volume-resident
8989+ copy, since the offsite is encrypted with the same key.
9090+ '';
9191+ };
9292+9393+ environmentFile = lib.mkOption {
9494+ type = lib.types.nullOr lib.types.path;
9595+ default = null;
9696+ description = ''
9797+ File providing backend-specific credentials as systemd
9898+ environment variables. Examples:
9999+ B2_ACCOUNT_ID=...
100100+ B2_ACCOUNT_KEY=...
101101+ AWS_ACCESS_KEY_ID=...
102102+ AWS_SECRET_ACCESS_KEY=...
103103+ Typically a sops template at /run/secrets/.../restic-offsite-env.
104104+ Owned by root, mode 0400.
105105+ '';
106106+ };
107107+108108+ afterUnits = lib.mkOption {
109109+ type = lib.types.listOf lib.types.str;
110110+ default = [ "restic-password-init.service" "local-fs.target" ];
111111+ description = "systemd units the copy must wait for before running.";
112112+ };
113113+114114+ onCalendar = lib.mkOption {
115115+ type = lib.types.str;
116116+ default = "*-*-* 02:00:00";
117117+ description = ''
118118+ systemd OnCalendar expression for the offsite-copy timer. Daily
119119+ at 02:00 by default — late enough that the every-6h local
120120+ backup at 00:00 has finished, early enough that any failure has
121121+ time to alert before the next business day.
122122+ '';
123123+ };
124124+125125+ randomizedDelaySec = lib.mkOption {
126126+ type = lib.types.str;
127127+ default = "1h";
128128+ description = "systemd RandomizedDelaySec for the offsite-copy timer.";
129129+ };
130130+131131+ sshKnownHostsFile = lib.mkOption {
132132+ type = lib.types.nullOr lib.types.str;
133133+ default = null;
134134+ description = ''
135135+ Path to a known_hosts file used when destRepo is an sftp:// URL.
136136+ Required for sftp destinations to avoid TOFU prompts on first
137137+ run. Typically populated via a sops template containing the
138138+ target host's SSH public key.
139139+ '';
140140+ example = "/run/secrets/restic-offsite-known-hosts";
141141+ };
142142+143143+ sshKeyPath = lib.mkOption {
144144+ type = lib.types.str;
145145+ default = "/root/.ssh/restic-offsite";
146146+ description = "Path to the SSH private key for SFTP destinations.";
147147+ };
148148+149149+ pruneOpts = lib.mkOption {
150150+ type = lib.types.listOf lib.types.str;
151151+ default = [];
152152+ description = ''
153153+ Retention policy flags passed to `restic forget --prune` on the
154154+ destination repo after each copy. Empty list skips pruning
155155+ (destination grows unbounded — not recommended).
156156+ '';
157157+ example = [ "--keep-daily 7" "--keep-weekly 4" "--keep-monthly 3" ];
158158+ };
159159+ };
160160+161161+ config = lib.mkIf cfg.enable {
162162+ assertions = [
163163+ {
164164+ assertion = cfg.sourceRepo != "";
165165+ message = "services.restic-offsite-copy.enable = true but sourceRepo is empty";
166166+ }
167167+ {
168168+ assertion = cfg.destRepo != "";
169169+ message = "services.restic-offsite-copy.enable = true but destRepo is empty";
170170+ }
171171+ ];
172172+173173+ systemd.services.restic-offsite-copy = {
174174+ description = "Copy local restic snapshots to offsite repository";
175175+ after = cfg.afterUnits;
176176+177177+ serviceConfig = {
178178+ Type = "oneshot";
179179+ User = "root";
180180+ Group = "root";
181181+ } // lib.optionalAttrs (cfg.environmentFile != null) {
182182+ EnvironmentFile = cfg.environmentFile;
183183+ };
184184+185185+ path = [ pkgs.restic pkgs.openssh ];
186186+187187+ script = ''
188188+ set -euo pipefail
189189+190190+ # Skip silently if the local repo isn't ready yet — happens on
191191+ # a freshly-provisioned host before the first local backup
192192+ # timer has fired. Better than crashing the timer in a loop.
193193+ if [ ! -f "${cfg.sourceRepo}/config" ]; then
194194+ echo "Source restic repo at ${cfg.sourceRepo} not yet initialized; skipping"
195195+ exit 0
196196+ fi
197197+ if [ ! -f "${cfg.sourcePasswordFile}" ]; then
198198+ echo "Source password file ${cfg.sourcePasswordFile} missing; skipping"
199199+ exit 0
200200+ fi
201201+202202+ # Initialize the destination repo if it doesn't yet exist.
203203+ # --copy-chunker-params makes the destination share the source's
204204+ # chunking params so subsequent `restic copy` calls don't have
205205+ # to recompute hashes — once initialized this flag is ignored.
206206+ if ! restic ${sftpFlags} --repo "${cfg.destRepo}" \
207207+ --password-file "${cfg.destPasswordFile}" \
208208+ cat config >/dev/null 2>&1; then
209209+ echo "Destination repo ${cfg.destRepo} not initialized; initializing"
210210+ restic ${sftpFlags} --repo "${cfg.destRepo}" \
211211+ --password-file "${cfg.destPasswordFile}" \
212212+ init \
213213+ --copy-chunker-params \
214214+ --from-repo "${cfg.sourceRepo}" \
215215+ --from-password-file "${cfg.sourcePasswordFile}"
216216+ fi
217217+218218+ echo "Copying snapshots from ${cfg.sourceRepo} to ${cfg.destRepo}"
219219+ restic ${sftpFlags} --repo "${cfg.destRepo}" \
220220+ --password-file "${cfg.destPasswordFile}" \
221221+ copy \
222222+ --from-repo "${cfg.sourceRepo}" \
223223+ --from-password-file "${cfg.sourcePasswordFile}"
224224+225225+ ${lib.optionalString (cfg.pruneOpts != []) ''
226226+ echo "Pruning offsite repo with: ${lib.concatStringsSep " " cfg.pruneOpts}"
227227+ restic ${sftpFlags} --repo "${cfg.destRepo}" \
228228+ --password-file "${cfg.destPasswordFile}" \
229229+ forget --prune ${lib.concatStringsSep " " cfg.pruneOpts}
230230+ ''}
231231+232232+ echo "Offsite copy complete"
233233+ '';
234234+ };
235235+236236+ systemd.timers.restic-offsite-copy = {
237237+ wantedBy = [ "timers.target" ];
238238+ timerConfig = {
239239+ OnCalendar = cfg.onCalendar;
240240+ Persistent = true;
241241+ RandomizedDelaySec = cfg.randomizedDelaySec;
242242+ };
243243+ };
244244+ };
245245+}
···2727// Splitting the handler into discrete phases (validate → load+verify →
2828// authorize → provision → persist → dispatch → respond) makes each step
2929// individually unit-testable and keeps handleEnroll itself a short
3030-// orchestration function. See #223.
3030+// orchestration function.
3131type enrollHTTPError struct {
3232 Status int
3333 Message string
···7373// --- Phase 2: load + verify -------------------------------------------------
74747575// loadAndVerifyPending fetches the pending enrollment by token, runs the
7676-// OAuth-cookie identity gate (#207), enforces the expiry cutoff, and
7676+// OAuth-cookie identity gate, enforces the expiry cutoff, and
7777// re-runs DNS TXT verification. Returns the pending row on success or an
7878// HTTP error otherwise.
7979//
···9191 return nil, enrollErrf(http.StatusNotFound, "token not found or already used")
9292 }
93939494- // OAuth-verified DID gate, second layer (#207). The pending row was
9494+ // OAuth-verified DID gate, second layer. The pending row was
9595 // created by handleEnrollStart, which already enforces the same
9696 // check, but a stale pending row from before the verifier was wired
9797 // or a path that bypasses /admin/enroll-start altogether (e.g. an
+146
internal/admin/enroll_start_phases.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package admin
44+55+import (
66+ "context"
77+ "encoding/json"
88+ "io"
99+ "log"
1010+ "net/http"
1111+ "strings"
1212+ "time"
1313+1414+ didpkg "atmosphere-mail/internal/did"
1515+ "atmosphere-mail/internal/enroll"
1616+ "atmosphere-mail/internal/relaystore"
1717+)
1818+1919+// enrollStartParsed holds the validated, normalized fields parsed from an
2020+// enroll-start request. Produced by validateEnrollStartRequest, consumed
2121+// by subsequent phases.
2222+type enrollStartParsed struct {
2323+ DID string
2424+ Domain string
2525+ ContactEmail string
2626+ Terms bool
2727+}
2828+2929+// --- Phase 1: validate -------------------------------------------------------
3030+3131+// validateEnrollStartRequest parses and normalises the JSON body from
3232+// POST /admin/enroll-start. Returns the parsed fields or an HTTP error.
3333+func validateEnrollStartRequest(r *http.Request) (*enrollStartParsed, *enrollHTTPError) {
3434+ var req EnrollStartRequest
3535+ if err := json.NewDecoder(io.LimitReader(r.Body, 4096)).Decode(&req); err != nil {
3636+ return nil, enrollErrf(http.StatusBadRequest, "invalid JSON body")
3737+ }
3838+ did := strings.TrimSpace(req.DID)
3939+ domain := strings.TrimSpace(strings.ToLower(req.Domain))
4040+ contactEmail := strings.TrimSpace(req.ContactEmail)
4141+4242+ if did == "" || domain == "" {
4343+ return nil, enrollErrf(http.StatusBadRequest, "did and domain fields required")
4444+ }
4545+ if !didpkg.Valid(did) {
4646+ return nil, enrollErrf(http.StatusBadRequest, "invalid DID format")
4747+ }
4848+ if !isValidDomain(domain) {
4949+ return nil, enrollErrf(http.StatusBadRequest, "invalid domain format")
5050+ }
5151+ if contactEmail != "" && !strings.Contains(contactEmail, "@") {
5252+ return nil, enrollErrf(http.StatusBadRequest, "contactEmail must be a valid email address")
5353+ }
5454+5555+ return &enrollStartParsed{
5656+ DID: did,
5757+ Domain: domain,
5858+ ContactEmail: contactEmail,
5959+ Terms: req.TermsAccepted,
6060+ }, nil
6161+}
6262+6363+// --- Phase 2: OAuth gate -----------------------------------------------------
6464+6565+// checkEnrollStartOAuth verifies that the caller's OAuth cookie matches
6666+// the claimed DID. When enrollAuthVerifier is nil (legacy deployments or
6767+// tests), this is a no-op.
6868+func (a *API) checkEnrollStartOAuth(r *http.Request, did string) *enrollHTTPError {
6969+ if a.enrollAuthVerifier == nil {
7070+ return nil
7171+ }
7272+ verifiedDID, ok := a.enrollAuthVerifier.VerifyAuthCookie(r)
7373+ if !ok {
7474+ log.Printf("admin.enroll_start.no_oauth: claimed_did=%s", did)
7575+ return enrollErrf(http.StatusForbidden, "identity verification required — sign in with your handle before enrolling a domain")
7676+ }
7777+ if !strings.EqualFold(verifiedDID, did) {
7878+ log.Printf("admin.enroll_start.did_mismatch: claimed=%s verified=%s", did, verifiedDID)
7979+ return enrollErrf(http.StatusForbidden, "claimed DID does not match the verified identity from your sign-in")
8080+ }
8181+ return nil
8282+}
8383+8484+// --- Phase 3: domain eligibility ---------------------------------------------
8585+8686+// checkEnrollStartEligibility confirms the domain is unclaimed and the DID
8787+// hasn't exceeded its per-account quota. Returns the existing domains list
8888+// (needed by the create phase for contact-email fallback) or an HTTP error.
8989+func (a *API) checkEnrollStartEligibility(ctx context.Context, did, domain string) ([]relaystore.MemberDomain, *enrollHTTPError) {
9090+ existing, err := a.store.GetMemberDomain(ctx, domain)
9191+ if err != nil {
9292+ log.Printf("admin.enroll_start: did=%s error=%v", did, err)
9393+ return nil, enrollErrf(http.StatusInternalServerError, "internal error")
9494+ }
9595+ if existing != nil {
9696+ if existing.DID == did {
9797+ return nil, enrollErrf(http.StatusConflict, "You've already enrolled this domain. Sign in at /account to manage it.")
9898+ }
9999+ return nil, enrollErrf(http.StatusConflict, "This domain is registered to another account.")
100100+ }
101101+102102+ existingDomains, err := a.store.ListMemberDomains(ctx, did)
103103+ if err != nil {
104104+ log.Printf("admin.enroll_start: did=%s list_domains_error=%v", did, err)
105105+ return nil, enrollErrf(http.StatusInternalServerError, "internal error")
106106+ }
107107+ if len(existingDomains) >= maxDomainsPerMember {
108108+ return nil, enrollErrf(http.StatusConflict, "domain limit reached — your account currently supports up to %d sending domains", maxDomainsPerMember)
109109+ }
110110+111111+ return existingDomains, nil
112112+}
113113+114114+// --- Phase 4: create pending -------------------------------------------------
115115+116116+// createPendingEnrollment generates a token and persists the pending row.
117117+// Returns the response payload or an HTTP error.
118118+func (a *API) createPendingEnrollment(ctx context.Context, parsed *enrollStartParsed, contactEmail string) (*EnrollStartResponse, *enrollHTTPError) {
119119+ token, err := enroll.NewToken()
120120+ if err != nil {
121121+ log.Printf("admin.enroll_start: did=%s token_error=%v", parsed.DID, err)
122122+ return nil, enrollErrf(http.StatusInternalServerError, "internal error")
123123+ }
124124+ now := time.Now().UTC()
125125+ pending := &relaystore.PendingEnrollment{
126126+ Token: token,
127127+ DID: parsed.DID,
128128+ Domain: parsed.Domain,
129129+ ContactEmail: contactEmail,
130130+ TermsAccepted: parsed.Terms,
131131+ CreatedAt: now,
132132+ ExpiresAt: now.Add(pendingEnrollmentTTL),
133133+ }
134134+ if err := a.store.CreatePendingEnrollment(ctx, pending); err != nil {
135135+ log.Printf("admin.enroll_start: did=%s domain=%s error=%v", parsed.DID, parsed.Domain, err)
136136+ return nil, enrollErrf(http.StatusInternalServerError, "internal error")
137137+ }
138138+139139+ log.Printf("admin.enroll_start: did=%s domain=%s token_created=true", parsed.DID, parsed.Domain)
140140+ return &EnrollStartResponse{
141141+ Token: token,
142142+ DNSName: enroll.RecordName(parsed.Domain),
143143+ DNSValue: enroll.ExpectedValue(token),
144144+ ExpiresAt: pending.ExpiresAt.Format(time.RFC3339),
145145+ }, nil
146146+}
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package admin
44+55+// Cross-component integration test: full self-service enrollment funnel
66+// through to SMTP AUTH success. The credential seam tested here:
77+//
88+// POST /admin/enroll-start
99+// → publish DNS TXT (stubbed via fakeLookuper)
1010+// → POST /admin/enroll (returns APIKey, member is Pending)
1111+// → SMTP AUTH must FAIL (the Pending gate)
1212+// → POST /admin/member/{did}/approve (operator approval)
1313+// → SMTP AUTH must SUCCEED (same APIKey)
1414+// → MAIL/RCPT/DATA round-trip — message lands in store
1515+//
1616+// This is installment 5 of #228, the final one in the integration-test
1717+// series. It pins the contract that an APIKey produced by /admin/enroll
1818+// is the same byte-for-byte string that SMTP AUTH accepts after the
1919+// operator approves the member — three components (admin API, store,
2020+// SMTP server) all agreeing on the credential lifecycle.
2121+//
2222+// Risk profile: zero — entirely additive, no production code touched.
2323+// Inlines its own cert-gen + SMTP server wiring rather than reaching
2424+// into the relay package's unexported test helpers, so package admin
2525+// doesn't grow new dependencies and the relay package's API stays
2626+// minimal.
2727+2828+import (
2929+ "bytes"
3030+ "context"
3131+ "crypto/ecdsa"
3232+ "crypto/elliptic"
3333+ "crypto/rand"
3434+ "crypto/tls"
3535+ "crypto/x509"
3636+ "crypto/x509/pkix"
3737+ "encoding/json"
3838+ "fmt"
3939+ "math/big"
4040+ "net"
4141+ "net/http"
4242+ "net/http/httptest"
4343+ gosmtp "net/smtp"
4444+ "sync"
4545+ "testing"
4646+ "time"
4747+4848+ "atmosphere-mail/internal/relay"
4949+ "atmosphere-mail/internal/relaystore"
5050+)
5151+5252+func TestIntegration_EnrollApprovalThenSMTPAuth(t *testing.T) {
5353+ ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
5454+ defer cancel()
5555+5656+ // --- Admin API + store, wired for self-service enroll ---
5757+ api, store, lk := testEnrollAPI(t)
5858+5959+ did := "did:plc:enrollroundtripaaaaaaaaa"
6060+ domain := "roundtrip.example.com"
6161+6262+ // --- Step 1: enroll-start ---
6363+ start := startEnrollment(t, api, did, domain)
6464+ if start.Token == "" {
6565+ t.Fatal("enroll-start returned empty token")
6666+ }
6767+ if start.DNSName == "" || start.DNSValue == "" {
6868+ t.Fatalf("enroll-start missing DNS instructions: name=%q value=%q", start.DNSName, start.DNSValue)
6969+ }
7070+7171+ // --- Step 2: simulate DNS publication ---
7272+ lk.records["_atmos-enroll."+domain] = []string{start.DNSValue}
7373+7474+ // --- Step 3: enroll completion → APIKey ---
7575+ body, _ := json.Marshal(EnrollRequest{Token: start.Token})
7676+ req := httptest.NewRequest(http.MethodPost, "/admin/enroll", bytes.NewReader(body))
7777+ w := httptest.NewRecorder()
7878+ api.ServeHTTP(w, req)
7979+ if w.Code != http.StatusOK {
8080+ t.Fatalf("/admin/enroll: status=%d body=%s", w.Code, w.Body.String())
8181+ }
8282+ var er EnrollResponse
8383+ if err := json.NewDecoder(w.Body).Decode(&er); err != nil {
8484+ t.Fatalf("decode enroll response: %v", err)
8585+ }
8686+ apiKey := er.APIKey
8787+ if apiKey == "" {
8888+ t.Fatal("enroll response missing APIKey — the credential seam this test pins")
8989+ }
9090+9191+ // Sanity: member must exist as Pending (not Active) — the operator
9292+ // approval gate is what installment 5 is here to exercise.
9393+ member, err := store.GetMember(ctx, did)
9494+ if err != nil || member == nil {
9595+ t.Fatalf("member not persisted after enroll: err=%v", err)
9696+ }
9797+ if member.Status != relaystore.StatusPending {
9898+ t.Fatalf("post-enroll member status=%q, want %q (the approval gate)", member.Status, relaystore.StatusPending)
9999+ }
100100+101101+ // --- Step 4: build a real SMTP server pointed at the same store ---
102102+ rateLimiter := relay.NewRateLimiter(store, relay.RateLimiterConfig{
103103+ DefaultHourlyLimit: 100,
104104+ DefaultDailyLimit: 1000,
105105+ GlobalPerMinute: 1000,
106106+ })
107107+108108+ const queueMaxSize = 4
109109+ var deliveryResults []relay.DeliveryResult
110110+ var deliveryMu sync.Mutex
111111+ queue := relay.NewQueue(func(r relay.DeliveryResult) {
112112+ deliveryMu.Lock()
113113+ deliveryResults = append(deliveryResults, r)
114114+ deliveryMu.Unlock()
115115+ }, relay.QueueConfig{MaxSize: queueMaxSize, RelayDomain: "relay.test"})
116116+117117+ lookup := func(ctx context.Context, lookupDID string) (*relay.MemberWithDomains, error) {
118118+ m, err := store.GetMember(ctx, lookupDID)
119119+ if err != nil || m == nil {
120120+ return nil, err
121121+ }
122122+ domains, err := store.ListMemberDomains(ctx, lookupDID)
123123+ if err != nil {
124124+ return nil, err
125125+ }
126126+ di := make([]relay.DomainInfo, 0, len(domains))
127127+ for _, d := range domains {
128128+ di = append(di, relay.DomainInfo{
129129+ Domain: d.Domain,
130130+ APIKeyHash: d.APIKeyHash,
131131+ })
132132+ }
133133+ return &relay.MemberWithDomains{
134134+ DID: m.DID,
135135+ Status: m.Status,
136136+ HourlyLimit: m.HourlyLimit,
137137+ DailyLimit: m.DailyLimit,
138138+ SendCount: m.SendCount,
139139+ CreatedAt: m.CreatedAt,
140140+ Domains: di,
141141+ }, nil
142142+ }
143143+144144+ sendCheck := func(ctx context.Context, member *relay.AuthMember, from, to string) error {
145145+ return rateLimiter.Check(ctx, member.DID, member.HourlyLimit, member.DailyLimit)
146146+ }
147147+148148+ var enqueuedIDs []int64
149149+ var enqueueMu sync.Mutex
150150+ onAccept := func(member *relay.AuthMember, from string, to []string, data []byte) error {
151151+ if !queue.HasCapacity(len(to)) {
152152+ return fmt.Errorf("451 queue full")
153153+ }
154154+ for _, recipient := range to {
155155+ msgID, err := store.InsertMessage(context.Background(), &relaystore.Message{
156156+ MemberDID: member.DID,
157157+ FromAddr: from,
158158+ ToAddr: recipient,
159159+ Status: relaystore.MsgQueued,
160160+ CreatedAt: time.Now().UTC(),
161161+ })
162162+ if err != nil {
163163+ return fmt.Errorf("InsertMessage: %w", err)
164164+ }
165165+ if err := queue.Enqueue(&relay.QueueEntry{
166166+ ID: msgID,
167167+ From: from,
168168+ To: recipient,
169169+ Data: data,
170170+ MemberDID: member.DID,
171171+ }); err != nil {
172172+ return fmt.Errorf("Enqueue: %w", err)
173173+ }
174174+ enqueueMu.Lock()
175175+ enqueuedIDs = append(enqueuedIDs, msgID)
176176+ enqueueMu.Unlock()
177177+ }
178178+ return nil
179179+ }
180180+181181+ smtpAddr, smtpCleanup := startTestSMTPServerForAdmin(t, lookup, sendCheck, onAccept)
182182+ defer smtpCleanup()
183183+184184+ // --- Step 5: SMTP AUTH must FAIL while member is Pending ---
185185+ //
186186+ // This is the inverse direction of the seam: the relay must reject
187187+ // authenticated submissions for a member who completed enrollment
188188+ // but hasn't been approved yet. If this assertion ever flips, the
189189+ // approval gate has been bypassed and shared-IP reputation is at
190190+ // risk from un-vetted self-service members.
191191+ if err := tryAuthOnly(smtpAddr, did, apiKey); err == nil {
192192+ t.Fatal("SMTP AUTH succeeded with Pending member — operator-approval gate is bypassed")
193193+ }
194194+195195+ // --- Step 6: operator approval ---
196196+ approveReq := httptest.NewRequest(http.MethodPost, "/admin/member/"+did+"/approve", nil)
197197+ approveReq.Header.Set("Authorization", "Bearer test-admin-token")
198198+ approveW := httptest.NewRecorder()
199199+ api.ServeHTTP(approveW, approveReq)
200200+ if approveW.Code != http.StatusOK {
201201+ t.Fatalf("/admin/member/%s/approve: status=%d body=%s", did, approveW.Code, approveW.Body.String())
202202+ }
203203+204204+ // Sanity: approval must have flipped the status in the store.
205205+ approved, err := store.GetMember(ctx, did)
206206+ if err != nil || approved == nil {
207207+ t.Fatalf("post-approve member lookup failed: err=%v", err)
208208+ }
209209+ if approved.Status != relaystore.StatusActive {
210210+ t.Fatalf("post-approve status=%q, want %q", approved.Status, relaystore.StatusActive)
211211+ }
212212+213213+ // --- Step 7: SMTP AUTH + full submission round-trip with SAME APIKey ---
214214+ if err := submitOneMessage(smtpAddr, did, apiKey, domain); err != nil {
215215+ t.Fatalf("post-approval SMTP submission failed: %v", err)
216216+ }
217217+218218+ // --- Assertions: end-to-end persistence ---
219219+ enqueueMu.Lock()
220220+ gotEnqueues := len(enqueuedIDs)
221221+ gotID := int64(-1)
222222+ if gotEnqueues > 0 {
223223+ gotID = enqueuedIDs[0]
224224+ }
225225+ enqueueMu.Unlock()
226226+ if gotEnqueues != 1 {
227227+ t.Fatalf("onAccept fired %d times, want exactly 1 after approval", gotEnqueues)
228228+ }
229229+ if gotID <= 0 {
230230+ t.Fatalf("InsertMessage returned id=%d, want > 0", gotID)
231231+ }
232232+233233+ msg, err := store.GetMessage(ctx, gotID)
234234+ if err != nil {
235235+ t.Fatalf("GetMessage(%d): %v", gotID, err)
236236+ }
237237+ if msg == nil {
238238+ t.Fatalf("GetMessage(%d) returned nil — message not persisted", gotID)
239239+ }
240240+ if msg.MemberDID != did {
241241+ t.Errorf("stored MemberDID=%q, want %q", msg.MemberDID, did)
242242+ }
243243+ if msg.FromAddr != "alice@"+domain {
244244+ t.Errorf("stored FromAddr=%q, want alice@%s", msg.FromAddr, domain)
245245+ }
246246+ if msg.Status != relaystore.MsgQueued {
247247+ t.Errorf("stored Status=%q, want %q", msg.Status, relaystore.MsgQueued)
248248+ }
249249+}
250250+251251+// startTestSMTPServerForAdmin builds a real relay.SMTPServer on a random
252252+// port with a self-signed cert for STARTTLS. This is the package-admin
253253+// counterpart to relay's internal testSMTPServer — it uses only the
254254+// exported relay surface so package admin doesn't need privileged access
255255+// into package relay's test internals.
256256+func startTestSMTPServerForAdmin(t *testing.T, lookup relay.MemberLookupFunc, check relay.SendCheckFunc, accept relay.OnAcceptFunc) (string, func()) {
257257+ t.Helper()
258258+259259+ cert, err := generateSelfSignedCertForAdminTest()
260260+ if err != nil {
261261+ t.Fatalf("generate test cert: %v", err)
262262+ }
263263+264264+ ln, err := net.Listen("tcp", "127.0.0.1:0")
265265+ if err != nil {
266266+ t.Fatalf("listen: %v", err)
267267+ }
268268+ addr := ln.Addr().String()
269269+ ln.Close()
270270+271271+ srv := relay.NewSMTPServer(relay.SMTPConfig{
272272+ ListenAddr: addr,
273273+ Domain: "relay.test",
274274+ TLSConfig: &tls.Config{
275275+ Certificates: []tls.Certificate{cert},
276276+ },
277277+ MaxMsgSize: 1024 * 1024,
278278+ }, lookup, check, accept)
279279+280280+ go srv.ListenAndServe()
281281+ for i := 0; i < 50; i++ {
282282+ conn, err := net.DialTimeout("tcp", addr, 100*time.Millisecond)
283283+ if err == nil {
284284+ conn.Close()
285285+ break
286286+ }
287287+ time.Sleep(10 * time.Millisecond)
288288+ }
289289+ return addr, func() { srv.Close() }
290290+}
291291+292292+// generateSelfSignedCertForAdminTest mirrors relay's generateTestCert but
293293+// is duplicated here because the relay one is unexported and only visible
294294+// inside the relay package's _test files.
295295+func generateSelfSignedCertForAdminTest() (tls.Certificate, error) {
296296+ key, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
297297+ if err != nil {
298298+ return tls.Certificate{}, err
299299+ }
300300+ template := &x509.Certificate{
301301+ SerialNumber: big.NewInt(1),
302302+ Subject: pkix.Name{Organization: []string{"AdminIntegrationTest"}},
303303+ NotBefore: time.Now(),
304304+ NotAfter: time.Now().Add(time.Hour),
305305+ KeyUsage: x509.KeyUsageDigitalSignature | x509.KeyUsageKeyEncipherment,
306306+ ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
307307+ IPAddresses: []net.IP{net.ParseIP("127.0.0.1")},
308308+ DNSNames: []string{"localhost"},
309309+ }
310310+ certDER, err := x509.CreateCertificate(rand.Reader, template, template, &key.PublicKey, key)
311311+ if err != nil {
312312+ return tls.Certificate{}, err
313313+ }
314314+ return tls.Certificate{Certificate: [][]byte{certDER}, PrivateKey: key}, nil
315315+}
316316+317317+// tryAuthOnly opens an SMTP session, does STARTTLS, and tries AUTH PLAIN.
318318+// Returns nil on AUTH success, error otherwise. Used by the test to
319319+// assert that a Pending member's APIKey is REJECTED at AUTH.
320320+func tryAuthOnly(addr, did, apiKey string) error {
321321+ c, err := gosmtp.Dial(addr)
322322+ if err != nil {
323323+ return fmt.Errorf("dial: %w", err)
324324+ }
325325+ defer c.Close()
326326+ if err := c.StartTLS(&tls.Config{InsecureSkipVerify: true, ServerName: "127.0.0.1"}); err != nil {
327327+ return fmt.Errorf("starttls: %w", err)
328328+ }
329329+ auth := gosmtp.PlainAuth("", did, apiKey, "127.0.0.1")
330330+ if err := c.Auth(auth); err != nil {
331331+ return fmt.Errorf("auth: %w", err)
332332+ }
333333+ _ = c.Quit()
334334+ return nil
335335+}
336336+337337+// submitOneMessage drives a full SMTP submission: dial → STARTTLS → AUTH →
338338+// MAIL → RCPT → DATA → QUIT. Returns nil on success.
339339+func submitOneMessage(addr, did, apiKey, fromDomain string) error {
340340+ c, err := gosmtp.Dial(addr)
341341+ if err != nil {
342342+ return fmt.Errorf("dial: %w", err)
343343+ }
344344+ defer c.Close()
345345+ if err := c.StartTLS(&tls.Config{InsecureSkipVerify: true, ServerName: "127.0.0.1"}); err != nil {
346346+ return fmt.Errorf("starttls: %w", err)
347347+ }
348348+ auth := gosmtp.PlainAuth("", did, apiKey, "127.0.0.1")
349349+ if err := c.Auth(auth); err != nil {
350350+ return fmt.Errorf("auth: %w", err)
351351+ }
352352+ if err := c.Mail("alice@" + fromDomain); err != nil {
353353+ return fmt.Errorf("mail: %w", err)
354354+ }
355355+ if err := c.Rcpt("bob@example.org"); err != nil {
356356+ return fmt.Errorf("rcpt: %w", err)
357357+ }
358358+ dw, err := c.Data()
359359+ if err != nil {
360360+ return fmt.Errorf("data open: %w", err)
361361+ }
362362+ body := fmt.Sprintf(
363363+ "From: alice@%s\r\nTo: bob@example.org\r\nSubject: enroll-roundtrip\r\n\r\nintegration test body\r\n",
364364+ fromDomain,
365365+ )
366366+ if _, err := fmt.Fprint(dw, body); err != nil {
367367+ return fmt.Errorf("data write: %w", err)
368368+ }
369369+ if err := dw.Close(); err != nil {
370370+ return fmt.Errorf("data close: %w", err)
371371+ }
372372+ if err := c.Quit(); err != nil {
373373+ return fmt.Errorf("quit: %w", err)
374374+ }
375375+ return nil
376376+}
+8-8
internal/admin/operator_dkim.go
···1414// Separate from relay.DKIMKeys so we never accidentally surface the private
1515// halves.
1616type operatorDKIMView struct {
1717- Domain string
1818- Selector string
1919- RSASelector string
2020- EdSelector string
2121- RSADNSName string
2222- EdDNSName string
2323- RSADNSValue string
2424- EdDNSValue string
1717+ Domain string
1818+ Selector string
1919+ RSASelector string
2020+ EdSelector string
2121+ RSADNSName string
2222+ EdDNSName string
2323+ RSADNSValue string
2424+ EdDNSValue string
2525}
26262727// SetOperatorDKIM attaches the operator DKIM keys to the admin API and
+68-1
internal/admin/ui/attest.go
···4949// leaked cookie cannot be replayed from a different browser. The
5050// legacy no-UA helper (IssueRecoveryTicket on *RecoverHandler) is
5151// retained for tests but deliberately NOT exposed here so production
5252-// callers can't accidentally bypass the binding (#212).
5252+// callers can't accidentally bypass the binding.
5353type RecoveryIssuer interface {
5454 IssueRecoveryTicketWithUA(did, domain, ua string) string
5555}
···7979 enrollAuthIssuer EnrollAuthIssuer
8080 funnel FunnelRecorder
8181 didResolver DIDHandleResolver
8282+ // credsStash, when set, is consulted on a successful publish to
8383+ // retrieve the credentials the wizard stashed before kicking the
8484+ // OAuth round-trip (atomic enroll+publish). Nil = legacy
8585+ // /account/manage publish flow: callback renders the minimal
8686+ // "attestation published" page only.
8787+ credsStash EnrollCredentialsStash
8288}
83898490// NewAttestHandler constructs the handler. pub and store must both be non-nil.
···108114// SetDIDHandleResolver wires DID→handle resolution for OAuth metrics.
109115func (h *AttestHandler) SetDIDHandleResolver(r DIDHandleResolver) {
110116 h.didResolver = r
117117+}
118118+119119+// SetEnrollCredentialsStash wires the wizard credentials carry-through.
120120+// When set, a successful publish callback consumes the stash entry for
121121+// (DID, domain) and renders the credentials inline as part of the
122122+// "attestation published" page (atomic enroll+publish).
123123+func (h *AttestHandler) SetEnrollCredentialsStash(s EnrollCredentialsStash) {
124124+ h.credsStash = s
111125}
112126113127func (h *AttestHandler) resolveHandle(ctx context.Context, did string) string {
···283297 rkey := sess.Domain() // lexicon says "key: any" — we use the domain
284298 if err := sess.PutRecord(ctx, "email.atmos.attestation", rkey, record); err != nil {
285299 log.Printf("attest.callback: did=%s put_record_error=%v", sess.AccountDID(), err)
300300+ // Atomic-publish failure path. If the wizard had stashed
301301+ // credentials, render them on a retry page so the user keeps
302302+ // their API key — they're already enrolled, just not yet
303303+ // published. The publish button on /account/manage (added in
304304+ //) covers retry.
305305+ if creds, ok := h.consumeStash(sess.AccountDID(), sess.Domain()); ok {
306306+ w.Header().Set("Content-Type", "text/html; charset=utf-8")
307307+ _ = templates.EnrollAttestationRetry(templates.EnrollAttestationRetryData{
308308+ DID: sess.AccountDID(),
309309+ Domain: sess.Domain(),
310310+ APIKey: creds.APIKey,
311311+ SMTPHost: creds.SMTPHost,
312312+ SMTPPort: creds.SMTPPort,
313313+ DKIMSelector: creds.DKIMSelector,
314314+ DKIMRSAName: creds.DKIMRSAName,
315315+ DKIMRSARecord: creds.DKIMRSARecord,
316316+ DKIMEdName: creds.DKIMEdName,
317317+ DKIMEdRecord: creds.DKIMEdRecord,
318318+ PublishError: "PDS rejected the record. This is usually transient — try again from /account in a few minutes.",
319319+ }).Render(r.Context(), w)
320320+ return
321321+ }
286322 h.renderError(w, r, "PDS rejected the record — please try again later")
287323 return
288324 }
···301337 log.Printf("attest.callback: did=%s domain=%s rkey=%s published=true",
302338 sess.AccountDID(), sess.Domain(), rkey)
303339 w.Header().Set("Content-Type", "text/html; charset=utf-8")
340340+ // Atomic-publish success path. When credentials were stashed
341341+ // at the wizard's /enroll/verify step, this is the user's first
342342+ // view of their API key — render it inline along with the
343343+ // "attestation published" confirmation. Otherwise (e.g., user
344344+ // reached publish via /account/manage's button per) fall back
345345+ // to the minimal page.
346346+ if creds, ok := h.consumeStash(sess.AccountDID(), sess.Domain()); ok {
347347+ _ = templates.EnrollAttestationCompleteWithCredentials(templates.AttestationPublishedData{
348348+ DID: sess.AccountDID(),
349349+ Domain: sess.Domain(),
350350+ APIKey: creds.APIKey,
351351+ SMTPHost: creds.SMTPHost,
352352+ SMTPPort: creds.SMTPPort,
353353+ DKIMSelector: creds.DKIMSelector,
354354+ DKIMRSAName: creds.DKIMRSAName,
355355+ DKIMRSARecord: creds.DKIMRSARecord,
356356+ DKIMEdName: creds.DKIMEdName,
357357+ DKIMEdRecord: creds.DKIMEdRecord,
358358+ }).Render(r.Context(), w)
359359+ return
360360+ }
304361 _ = templates.EnrollAttestationComplete(sess.AccountDID(), sess.Domain()).Render(r.Context(), w)
362362+}
363363+364364+// consumeStash pulls (and deletes) any stashed credentials for the given
365365+// (DID, domain). Returns (zero, false) if no stash is wired or the entry
366366+// is absent / expired.
367367+func (h *AttestHandler) consumeStash(did, domain string) (EnrollCredentials, bool) {
368368+ if h.credsStash == nil {
369369+ return EnrollCredentials{}, false
370370+ }
371371+ return h.credsStash.Consume(did, domain)
305372}
306373307374func (h *AttestHandler) renderError(w http.ResponseWriter, r *http.Request, message string) {
+519
internal/admin/ui/attest_atomic_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package ui
44+55+// Tests for #234 atomic enroll+publish: at the end of the wizard the
66+// credentials page is no longer rendered directly. Instead, the handler
77+// stashes the credentials and kicks the publish-OAuth round-trip; the
88+// post-publish callback renders the credentials. A user who closes the
99+// tab still has their attestation published — the funnel cliff that
1010+// stranded richferro.com and self.surf is closed.
1111+//
1212+// Tests for #236 (soften credentials warning) live alongside.
1313+1414+import (
1515+ "context"
1616+ "errors"
1717+ "net/http"
1818+ "net/http/httptest"
1919+ "strings"
2020+ "testing"
2121+ "time"
2222+2323+ "atmosphere-mail/internal/atpoauth"
2424+)
2525+2626+// fakeCompletedSession satisfies the CompletedSession interface for
2727+// callback-side tests. It records PutRecord invocations and lets a
2828+// per-call error be injected to drive the failure path.
2929+type fakeCompletedSession struct {
3030+ did string
3131+ domain string
3232+ attestation []byte
3333+3434+ putErr error
3535+ putCalled int
3636+ putLastCol string
3737+ putLastRkey string
3838+ putLastRecord any
3939+ closeCalledTimes int
4040+}
4141+4242+func (s *fakeCompletedSession) AccountDID() string { return s.did }
4343+func (s *fakeCompletedSession) Domain() string { return s.domain }
4444+func (s *fakeCompletedSession) Attestation() []byte { return s.attestation }
4545+func (s *fakeCompletedSession) PutRecord(ctx context.Context, collection, rkey string, record any) error {
4646+ s.putCalled++
4747+ s.putLastCol = collection
4848+ s.putLastRkey = rkey
4949+ s.putLastRecord = record
5050+ return s.putErr
5151+}
5252+func (s *fakeCompletedSession) Close(ctx context.Context) { s.closeCalledTimes++ }
5353+5454+// programmablePublisher mirrors fakePublisher but lets tests configure
5555+// what CompleteCallback returns. fakePublisher (in recover_test.go) hard-codes
5656+// nil/nil and is unsuitable for callback-flow tests.
5757+type programmablePublisher struct {
5858+ startURL string
5959+ startState string
6060+ startErr error
6161+ startCalled int
6262+ startOpts atpoauth.StartOptions
6363+ startID string
6464+6565+ completeSess *fakeCompletedSession
6666+ completeErr error
6767+}
6868+6969+func (p *programmablePublisher) StartAuthFlow(ctx context.Context, identifier string, opts atpoauth.StartOptions) (string, string, error) {
7070+ p.startCalled++
7171+ p.startOpts = opts
7272+ p.startID = identifier
7373+ if p.startErr != nil {
7474+ return "", "", p.startErr
7575+ }
7676+ state := p.startState
7777+ if state == "" {
7878+ state = "state-prog"
7979+ }
8080+ url := p.startURL
8181+ if url == "" {
8282+ url = "https://pds.example/oauth/authorize?x=1"
8383+ }
8484+ return url, state, nil
8585+}
8686+8787+func (p *programmablePublisher) CompleteCallback(ctx context.Context, params map[string][]string) (CompletedSession, error) {
8888+ if p.completeErr != nil {
8989+ return nil, p.completeErr
9090+ }
9191+ if p.completeSess == nil {
9292+ return nil, errors.New("programmablePublisher: completeSess unset in test")
9393+ }
9494+ return p.completeSess, nil
9595+}
9696+9797+// stashAttestStore satisfies AttestationStore for callback tests; records
9898+// SetAttestationPublished invocations so we can pin the stamp path.
9999+type stashAttestStore struct {
100100+ calls []string
101101+}
102102+103103+func (s *stashAttestStore) SetAttestationPublished(ctx context.Context, domain, rkey string, at time.Time) error {
104104+ s.calls = append(s.calls, domain+":"+rkey)
105105+ return nil
106106+}
107107+108108+// --- /enroll/verify flow tests (PR 2 / #234) ---
109109+110110+// TestEnrollVerify_WithPublisherKicksAttestOAuth pins the new atomic flow:
111111+// once OAuth identity verification is wired (Publisher set), a successful
112112+// /enroll/verify must NOT render credentials inline. Instead it stashes the
113113+// credentials and 302s the user into the publish-OAuth round-trip. The
114114+// credentials are revealed only after the publish callback returns.
115115+func TestEnrollVerify_WithPublisherKicksAttestOAuth(t *testing.T) {
116116+ pub := &programmablePublisher{
117117+ startURL: "https://pds.example/oauth/authorize?atomic=1",
118118+ }
119119+ fake := &fakeAdminAPI{
120120+ enrollStatus: http.StatusOK,
121121+ enrollBody: `{
122122+ "did": "did:plc:atomic1111111111aaaa",
123123+ "apiKey": "atmos_atomic_key_xyz",
124124+ "dkim": {
125125+ "selector": "atmos20260501",
126126+ "rsaRecord": "v=DKIM1; k=rsa; p=...",
127127+ "edRecord": "v=DKIM1; k=ed25519; p=...",
128128+ "rsaDnsName": "atmos20260501r._domainkey.atomic.example",
129129+ "edDnsName": "atmos20260501e._domainkey.atomic.example"
130130+ },
131131+ "smtp": {"host": "smtp.atmos.email", "port": 587}
132132+ }`,
133133+ }
134134+ h := NewEnrollHandler(fake, nil)
135135+ h.SetPublisher(pub)
136136+137137+ form := "domain=atomic.example&token=tok123"
138138+ req := httptest.NewRequest(http.MethodPost, "/enroll/verify", strings.NewReader(form))
139139+ req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
140140+ w := httptest.NewRecorder()
141141+ h.ServeHTTP(w, req)
142142+143143+ if w.Code != http.StatusFound {
144144+ t.Fatalf("status = %d, want 302 (atomic-publish redirect); body=%q", w.Code, w.Body.String())
145145+ }
146146+ loc := w.Header().Get("Location")
147147+ if loc != pub.startURL {
148148+ t.Errorf("Location = %q, want %q (publish authorize URL)", loc, pub.startURL)
149149+ }
150150+ if pub.startCalled != 1 {
151151+ t.Errorf("Publisher.StartAuthFlow called %d times, want 1", pub.startCalled)
152152+ }
153153+ if pub.startOpts.ExpectedDID != "did:plc:atomic1111111111aaaa" {
154154+ t.Errorf("StartOptions.ExpectedDID = %q, want did:plc:atomic1111111111aaaa", pub.startOpts.ExpectedDID)
155155+ }
156156+ if pub.startOpts.Domain != "atomic.example" {
157157+ t.Errorf("StartOptions.Domain = %q, want atomic.example", pub.startOpts.Domain)
158158+ }
159159+ // Attestation payload must be an email.atmos.attestation record, not the
160160+ // enroll-auth sentinel (which is for identity verification, distinct flow).
161161+ att := string(pub.startOpts.Attestation)
162162+ if !strings.Contains(att, `email.atmos.attestation`) {
163163+ t.Errorf("StartOptions.Attestation should carry the lexicon record, got %q", att)
164164+ }
165165+ if !strings.Contains(att, `atomic.example`) {
166166+ t.Errorf("StartOptions.Attestation should carry the domain, got %q", att)
167167+ }
168168+ if !strings.Contains(att, `atmos20260501r`) || !strings.Contains(att, `atmos20260501e`) {
169169+ t.Errorf("StartOptions.Attestation should carry both DKIM selectors, got %q", att)
170170+ }
171171+ // The credentials are stashed for retrieval on the callback. We don't
172172+ // pin internal storage here — that's covered in TestAttestCallback_*.
173173+ // But the response body MUST NOT contain the API key (it's not
174174+ // rendered until after publish completes).
175175+ if strings.Contains(w.Body.String(), "atmos_atomic_key_xyz") {
176176+ t.Error("API key leaked into the redirect response body — credentials must not render before publish")
177177+ }
178178+}
179179+180180+// TestEnrollVerify_WithoutPublisherFallsBackToLegacy pins that older
181181+// deployments without OAuth still render credentials directly via
182182+// EnrollSuccess, since they have no publish-OAuth path to redirect into.
183183+func TestEnrollVerify_WithoutPublisherFallsBackToLegacy(t *testing.T) {
184184+ fake := &fakeAdminAPI{
185185+ enrollStatus: http.StatusOK,
186186+ enrollBody: `{
187187+ "did": "did:plc:legacy11111111111aaa",
188188+ "apiKey": "atmos_legacy_key",
189189+ "dkim": {
190190+ "selector": "atmos20260501",
191191+ "rsaRecord": "v=DKIM1; k=rsa; p=...",
192192+ "edRecord": "v=DKIM1; k=ed25519; p=...",
193193+ "rsaDnsName": "atmos20260501r._domainkey.legacy.example",
194194+ "edDnsName": "atmos20260501e._domainkey.legacy.example"
195195+ },
196196+ "smtp": {"host": "smtp.atmos.email", "port": 587}
197197+ }`,
198198+ }
199199+ h := NewEnrollHandler(fake, nil)
200200+ // Note: no SetPublisher call — Publisher is nil, OAuth not wired.
201201+202202+ form := "domain=legacy.example&token=tok123"
203203+ req := httptest.NewRequest(http.MethodPost, "/enroll/verify", strings.NewReader(form))
204204+ req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
205205+ w := httptest.NewRecorder()
206206+ h.ServeHTTP(w, req)
207207+208208+ if w.Code != http.StatusOK {
209209+ t.Fatalf("status = %d, want 200 (legacy direct render); body=%q", w.Code, w.Body.String())
210210+ }
211211+ if !strings.Contains(w.Body.String(), "atmos_legacy_key") {
212212+ t.Error("legacy path should render API key inline (no OAuth to redirect into)")
213213+ }
214214+}
215215+216216+// TestEnrollVerify_PublisherStartFailureFallsBackInline: when atomic flow
217217+// is configured but the OAuth handshake fails to start, the user still
218218+// needs their credentials. We MUST NOT silently lose them — render them
219219+// inline with a banner explaining the publish step is now manual.
220220+func TestEnrollVerify_PublisherStartFailureFallsBackInline(t *testing.T) {
221221+ pub := &programmablePublisher{
222222+ startErr: errors.New("oauth metadata fetch failed"),
223223+ }
224224+ fake := &fakeAdminAPI{
225225+ enrollStatus: http.StatusOK,
226226+ enrollBody: `{
227227+ "did": "did:plc:fallback11111111aaaa",
228228+ "apiKey": "atmos_fallback_key",
229229+ "dkim": {
230230+ "selector": "atmos20260501",
231231+ "rsaRecord": "v=DKIM1; k=rsa; p=...",
232232+ "edRecord": "v=DKIM1; k=ed25519; p=...",
233233+ "rsaDnsName": "atmos20260501r._domainkey.fallback.example",
234234+ "edDnsName": "atmos20260501e._domainkey.fallback.example"
235235+ },
236236+ "smtp": {"host": "smtp.atmos.email", "port": 587}
237237+ }`,
238238+ }
239239+ h := NewEnrollHandler(fake, nil)
240240+ h.SetPublisher(pub)
241241+242242+ form := "domain=fallback.example&token=tok123"
243243+ req := httptest.NewRequest(http.MethodPost, "/enroll/verify", strings.NewReader(form))
244244+ req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
245245+ w := httptest.NewRecorder()
246246+ h.ServeHTTP(w, req)
247247+248248+ if w.Code != http.StatusOK {
249249+ t.Fatalf("status = %d, want 200 (inline fallback render); body=%q", w.Code, w.Body.String())
250250+ }
251251+ body := w.Body.String()
252252+ if !strings.Contains(body, "atmos_fallback_key") {
253253+ t.Error("credentials must NOT be lost when OAuth start fails — render inline as fallback")
254254+ }
255255+ // The user can still publish manually via the existing button.
256256+ if !strings.Contains(body, `action="/enroll/attest/start"`) {
257257+ t.Error("inline fallback page should still expose the manual publish form")
258258+ }
259259+}
260260+261261+// --- /enroll/attest/callback flow tests (PR 2 / #234) ---
262262+263263+// TestAttestCallback_RendersCredentialsWhenStashed pins the post-publish
264264+// success path: when the wizard previously stashed credentials for this
265265+// (did, domain), the callback page MUST display them so the user sees their
266266+// API key for the first time. This is the exact moment richferro.com would
267267+// have seen credentials had the atomic flow been live.
268268+func TestAttestCallback_RendersCredentialsWhenStashed(t *testing.T) {
269269+ did := "did:plc:callback111111111aaaa"
270270+ domain := "callback.example"
271271+ attBytes, err := atpoauth.MarshalAttestation(map[string]any{
272272+ "$type": "email.atmos.attestation",
273273+ "domain": domain,
274274+ "dkimSelectors": []string{"atmos20260501r", "atmos20260501e"},
275275+ "relayMember": true,
276276+ "createdAt": "2026-05-01T00:00:00Z",
277277+ })
278278+ if err != nil {
279279+ t.Fatalf("MarshalAttestation: %v", err)
280280+ }
281281+ pub := &programmablePublisher{
282282+ completeSess: &fakeCompletedSession{
283283+ did: did,
284284+ domain: domain,
285285+ attestation: attBytes,
286286+ },
287287+ }
288288+ store := &stashAttestStore{}
289289+ attH := NewAttestHandler(pub, store)
290290+291291+ // Simulate the wizard having stashed the credentials when the
292292+ // atomic-publish path kicked the OAuth round-trip.
293293+ stash := newCredsStashForTest(t)
294294+ attH.SetEnrollCredentialsStash(stash)
295295+ stash.Stash(did, domain, EnrollCredentials{
296296+ APIKey: "atmos_callback_key",
297297+ SMTPHost: "smtp.atmos.email",
298298+ SMTPPort: 587,
299299+ DKIMSelector: "atmos20260501",
300300+ DKIMRSAName: "atmos20260501r._domainkey.callback.example",
301301+ DKIMRSARecord: "v=DKIM1; k=rsa; p=AAA",
302302+ DKIMEdName: "atmos20260501e._domainkey.callback.example",
303303+ DKIMEdRecord: "v=DKIM1; k=ed25519; p=BBB",
304304+ })
305305+306306+ mux := http.NewServeMux()
307307+ attH.RegisterRoutes(mux)
308308+309309+ req := httptest.NewRequest(http.MethodGet, "/enroll/attest/callback?code=x&state=y", nil)
310310+ w := httptest.NewRecorder()
311311+ mux.ServeHTTP(w, req)
312312+313313+ if w.Code != http.StatusOK {
314314+ t.Fatalf("status = %d, want 200; body=%q", w.Code, w.Body.String())
315315+ }
316316+ body := w.Body.String()
317317+ bodyLower := strings.ToLower(body)
318318+ // Case-insensitive: the masthead uses lowercase "attestation" by
319319+ // design ("Enrolled · attestation published"), and the lede phrases
320320+ // "is live on your PDS". Any of these signals confirms the publish
321321+ // confirmation copy is present.
322322+ if !strings.Contains(bodyLower, "attestation published") &&
323323+ !strings.Contains(bodyLower, "is live on your pds") {
324324+ t.Error("callback page missing publish-confirmation copy")
325325+ }
326326+ if !strings.Contains(body, "atmos_callback_key") {
327327+ t.Error("callback page MUST render the stashed API key — first time the user sees it")
328328+ }
329329+ if !strings.Contains(body, "smtp.atmos.email") {
330330+ t.Error("callback page should render SMTP host")
331331+ }
332332+ if !strings.Contains(body, "atmos20260501r._domainkey.callback.example") {
333333+ t.Error("callback page should render RSA DKIM DNS name")
334334+ }
335335+ if !strings.Contains(body, "atmos20260501e._domainkey.callback.example") {
336336+ t.Error("callback page should render Ed25519 DKIM DNS name")
337337+ }
338338+ // Cookie/stash must be one-shot: a second visit (e.g. reload) must
339339+ // not re-render the API key. We pin this via the stash; the same
340340+ // did+domain key is gone after Consume.
341341+ if _, ok := stash.Consume(did, domain); ok {
342342+ t.Error("stash entry should have been consumed by the callback render")
343343+ }
344344+ // PutRecord must have been called with the correct collection.
345345+ sess := pub.completeSess
346346+ if sess.putCalled != 1 {
347347+ t.Errorf("PutRecord called %d times, want 1", sess.putCalled)
348348+ }
349349+ if sess.putLastCol != "email.atmos.attestation" {
350350+ t.Errorf("PutRecord collection = %q, want email.atmos.attestation", sess.putLastCol)
351351+ }
352352+ // And the labeler-stamp store call must have happened.
353353+ if len(store.calls) == 0 {
354354+ t.Error("SetAttestationPublished must be called after successful publish")
355355+ }
356356+}
357357+358358+// TestAttestCallback_RendersFallbackWithoutStashed pins backwards-compat:
359359+// when no credentials were stashed (e.g., user came via /account/manage's
360360+// publish button per #235, not via the wizard), the callback renders the
361361+// existing minimal "attestation published" page.
362362+func TestAttestCallback_RendersFallbackWithoutStashed(t *testing.T) {
363363+ did := "did:plc:fallback11111111aaaa"
364364+ domain := "fallback.example"
365365+ attBytes, err := atpoauth.MarshalAttestation(map[string]any{
366366+ "$type": "email.atmos.attestation",
367367+ "domain": domain,
368368+ "dkimSelectors": []string{"atmos20260501r", "atmos20260501e"},
369369+ "relayMember": true,
370370+ "createdAt": "2026-05-01T00:00:00Z",
371371+ })
372372+ if err != nil {
373373+ t.Fatalf("MarshalAttestation: %v", err)
374374+ }
375375+ pub := &programmablePublisher{
376376+ completeSess: &fakeCompletedSession{
377377+ did: did,
378378+ domain: domain,
379379+ attestation: attBytes,
380380+ },
381381+ }
382382+ store := &stashAttestStore{}
383383+ attH := NewAttestHandler(pub, store)
384384+ // Stash IS wired but contains nothing for this (did, domain).
385385+ stash := newCredsStashForTest(t)
386386+ attH.SetEnrollCredentialsStash(stash)
387387+388388+ mux := http.NewServeMux()
389389+ attH.RegisterRoutes(mux)
390390+391391+ req := httptest.NewRequest(http.MethodGet, "/enroll/attest/callback?code=x&state=y", nil)
392392+ w := httptest.NewRecorder()
393393+ mux.ServeHTTP(w, req)
394394+395395+ if w.Code != http.StatusOK {
396396+ t.Fatalf("status = %d, want 200", w.Code)
397397+ }
398398+ body := w.Body.String()
399399+ // The fallback page must NOT render an API-key value or a credential
400400+ // box. (The phrase "API key" appears in a CSS comment in the shared
401401+ // publicLayout; matching that would be brittle, so we instead pin
402402+ // the actual rendered .credential block — present on the success
403403+ // page when credentials are stashed, absent here.)
404404+ if strings.Contains(body, `class="credential-label"`) {
405405+ t.Errorf("fallback page should not render a credential block when no credentials stashed; body had .credential-label")
406406+ }
407407+ if !strings.Contains(body, domain) {
408408+ t.Error("fallback page should include the domain")
409409+ }
410410+}
411411+412412+// TestAttestCallback_PublishFailureRendersRetryWithStashedCreds: when
413413+// PutRecord fails after the OAuth pair (e.g., PDS 5xx), the user is
414414+// already enrolled — we MUST render their credentials so they don't lose
415415+// them and surface a retry path that points at /account/manage where
416416+// the publish button (from #235) lives.
417417+func TestAttestCallback_PublishFailureRendersRetryWithStashedCreds(t *testing.T) {
418418+ did := "did:plc:retry111111111111aa"
419419+ domain := "retry.example"
420420+ attBytes, err := atpoauth.MarshalAttestation(map[string]any{
421421+ "$type": "email.atmos.attestation",
422422+ "domain": domain,
423423+ "dkimSelectors": []string{"atmos20260501r", "atmos20260501e"},
424424+ "relayMember": true,
425425+ "createdAt": "2026-05-01T00:00:00Z",
426426+ })
427427+ if err != nil {
428428+ t.Fatalf("MarshalAttestation: %v", err)
429429+ }
430430+ pub := &programmablePublisher{
431431+ completeSess: &fakeCompletedSession{
432432+ did: did,
433433+ domain: domain,
434434+ attestation: attBytes,
435435+ putErr: errors.New("pds 502 bad gateway"),
436436+ },
437437+ }
438438+ attH := NewAttestHandler(pub, &stashAttestStore{})
439439+ stash := newCredsStashForTest(t)
440440+ attH.SetEnrollCredentialsStash(stash)
441441+ stash.Stash(did, domain, EnrollCredentials{
442442+ APIKey: "atmos_retry_key",
443443+ SMTPHost: "smtp.atmos.email",
444444+ SMTPPort: 587,
445445+ DKIMSelector: "atmos20260501",
446446+ DKIMRSAName: "atmos20260501r._domainkey.retry.example",
447447+ DKIMRSARecord: "v=DKIM1; k=rsa; p=AAA",
448448+ DKIMEdName: "atmos20260501e._domainkey.retry.example",
449449+ DKIMEdRecord: "v=DKIM1; k=ed25519; p=BBB",
450450+ })
451451+452452+ mux := http.NewServeMux()
453453+ attH.RegisterRoutes(mux)
454454+455455+ req := httptest.NewRequest(http.MethodGet, "/enroll/attest/callback?code=x&state=y", nil)
456456+ w := httptest.NewRecorder()
457457+ mux.ServeHTTP(w, req)
458458+459459+ body := w.Body.String()
460460+ // We MUST render the credentials so the user can save them — they're
461461+ // already enrolled, just not yet published.
462462+ if !strings.Contains(body, "atmos_retry_key") {
463463+ t.Error("retry page MUST render the stashed API key — user is enrolled, can't lose creds")
464464+ }
465465+ // And the page must point them at /account/manage to retry the publish.
466466+ if !strings.Contains(body, "/account/manage") {
467467+ t.Error("retry page should link to /account/manage for self-service publish retry")
468468+ }
469469+}
470470+471471+// --- #236: soften credentials warning ---
472472+473473+// TestEnrollSuccess_WarningCopyDoesNotMentionReEnroll pins the new copy:
474474+// the loss-aversion "the only remedy is to re-enroll" framing is replaced
475475+// with a /recover/start (or /account) self-service recovery reference.
476476+//
477477+// Asserted via grep across the package's HTML output rather than against
478478+// templ source so that the manual-edit workaround for the templ parse
479479+// error is verified end-to-end.
480480+func TestEnrollSuccess_WarningCopyDoesNotMentionReEnroll(t *testing.T) {
481481+ // Render the page in legacy mode (no Publisher) — that's the path
482482+ // that still includes the publish button + warning copy.
483483+ fake := &fakeAdminAPI{
484484+ enrollStatus: http.StatusOK,
485485+ enrollBody: `{
486486+ "did": "did:plc:warn11111111111aaaa",
487487+ "apiKey": "atmos_warning_key",
488488+ "dkim": {
489489+ "selector": "atmos20260501",
490490+ "rsaRecord": "v=DKIM1; k=rsa; p=...",
491491+ "edRecord": "v=DKIM1; k=ed25519; p=...",
492492+ "rsaDnsName": "atmos20260501r._domainkey.warn.example",
493493+ "edDnsName": "atmos20260501e._domainkey.warn.example"
494494+ },
495495+ "smtp": {"host": "smtp.atmos.email", "port": 587}
496496+ }`,
497497+ }
498498+ h := NewEnrollHandler(fake, nil)
499499+ form := "domain=warn.example&token=tok123"
500500+ req := httptest.NewRequest(http.MethodPost, "/enroll/verify", strings.NewReader(form))
501501+ req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
502502+ w := httptest.NewRecorder()
503503+ h.ServeHTTP(w, req)
504504+505505+ if w.Code != http.StatusOK {
506506+ t.Fatalf("legacy status = %d, want 200; body=%q", w.Code, w.Body.String())
507507+ }
508508+ body := strings.ToLower(w.Body.String())
509509+ if strings.Contains(body, "the only remedy is to re-enroll") {
510510+ t.Error("warning copy still says 're-enroll' — soften per #236 to point at /recover")
511511+ }
512512+ if strings.Contains(body, "only remedy") {
513513+ t.Error("warning copy still uses loss-aversion 'only remedy' framing")
514514+ }
515515+ // New copy MUST reference the self-service recovery path.
516516+ if !strings.Contains(body, "/account") && !strings.Contains(body, "/recover") {
517517+ t.Error("warning copy should reference /account or /recover for self-service recovery")
518518+ }
519519+}
+181
internal/admin/ui/creds_stash.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package ui
44+55+// Atomic enroll+publish credential stash.
66+//
77+// At the end of the wizard the handler kicks the publish-OAuth round-trip
88+// instead of rendering the credentials page. The credentials would be lost
99+// across the OAuth redirect — except for this stash, which holds them
1010+// in-memory keyed by (DID, domain) until the post-publish callback fetches
1111+// them. One-shot semantics: Consume removes the entry, so a reload of the
1212+// callback URL can't replay the API key.
1313+//
1414+// Memory pressure is bounded the same way recovery tickets are: TTL +
1515+// background prune ticker + a hard cap. Real volume is tiny (one entry
1616+// per ongoing enrollment, lifetime ~30s typical) so the cap exists only
1717+// to bound abuse, not normal operation.
1818+1919+import (
2020+ "context"
2121+ "log"
2222+ "sync"
2323+ "time"
2424+)
2525+2626+// EnrollCredentials is the carry-through view-model the wizard stashes
2727+// when it kicks the publish-OAuth round-trip. Mirrors the subset of
2828+// templates.EnrollResult the callback page actually displays — keeping
2929+// it package-local avoids a cycle with the templates package and lets
3030+// us pass the data into a templates.EnrollResult at render time.
3131+type EnrollCredentials struct {
3232+ APIKey string
3333+ SMTPHost string
3434+ SMTPPort int
3535+ DKIMSelector string
3636+ DKIMRSAName string
3737+ DKIMRSARecord string
3838+ DKIMEdName string
3939+ DKIMEdRecord string
4040+}
4141+4242+// EnrollCredentialsStash is the surface AttestHandler reads on callback.
4343+// EnrollHandler implements both halves; AttestHandler depends only on
4444+// Consume. Splitting into an interface keeps the wiring testable without
4545+// pulling EnrollHandler into AttestHandler tests.
4646+type EnrollCredentialsStash interface {
4747+ Consume(did, domain string) (EnrollCredentials, bool)
4848+}
4949+5050+const (
5151+ credsStashTTL = 15 * time.Minute
5252+ credsStashCap = 10_000
5353+ credsStashPruneEvery = 60 * time.Second
5454+)
5555+5656+type credsStashEntry struct {
5757+ creds EnrollCredentials
5858+ expiry time.Time
5959+}
6060+6161+// credsStash is the in-memory map. Embedded in EnrollHandler so the
6262+// wizard's verify step and the attest callback both reach it via
6363+// h.creds*. Tests use newCredsStashForTest to construct one in
6464+// isolation when wiring against AttestHandler directly.
6565+type credsStash struct {
6666+ mu sync.Mutex
6767+ entries map[string]credsStashEntry
6868+ cap int
6969+ ttl time.Duration
7070+7171+ pruneCancel context.CancelFunc
7272+ closeOnce sync.Once
7373+}
7474+7575+func newCredsStash() *credsStash {
7676+ pruneCtx, pruneCancel := context.WithCancel(context.Background())
7777+ s := &credsStash{
7878+ entries: make(map[string]credsStashEntry),
7979+ cap: credsStashCap,
8080+ ttl: credsStashTTL,
8181+ pruneCancel: pruneCancel,
8282+ }
8383+ go s.runPruneTicker(pruneCtx, credsStashPruneEvery)
8484+ return s
8585+}
8686+8787+// newCredsStashForTest builds a stash without the background prune
8888+// ticker — tests deal with TTL by manipulating entry timestamps
8989+// directly. The t.Cleanup hook closes the stash so tests don't leak.
9090+func newCredsStashForTest(t interface{ Cleanup(func()) }) *credsStash {
9191+ s := &credsStash{
9292+ entries: make(map[string]credsStashEntry),
9393+ cap: credsStashCap,
9494+ ttl: credsStashTTL,
9595+ }
9696+ t.Cleanup(s.Close)
9797+ return s
9898+}
9999+100100+// Close stops the background prune goroutine. Idempotent.
101101+func (s *credsStash) Close() {
102102+ s.closeOnce.Do(func() {
103103+ if s.pruneCancel != nil {
104104+ s.pruneCancel()
105105+ }
106106+ })
107107+}
108108+109109+func credsKey(did, domain string) string { return did + "|" + domain }
110110+111111+// Stash records (creds) for (did, domain). Overwrites any existing
112112+// entry with the same key — last write wins, matching the user's mental
113113+// model that re-running the wizard supersedes a previous attempt.
114114+func (s *credsStash) Stash(did, domain string, creds EnrollCredentials) {
115115+ s.mu.Lock()
116116+ defer s.mu.Unlock()
117117+ now := time.Now()
118118+ if len(s.entries) >= s.cap {
119119+ // Try a single prune pass; if still over cap, refuse silently.
120120+ // The wizard caller falls back to inline render in that case.
121121+ for k, v := range s.entries {
122122+ if now.After(v.expiry) {
123123+ delete(s.entries, k)
124124+ }
125125+ }
126126+ if len(s.entries) >= s.cap {
127127+ log.Printf("creds_stash: cap exhausted (%d entries); refusing to stash for did_hash=%s", len(s.entries), HashForLog(did))
128128+ return
129129+ }
130130+ }
131131+ s.entries[credsKey(did, domain)] = credsStashEntry{
132132+ creds: creds,
133133+ expiry: now.Add(s.ttl),
134134+ }
135135+}
136136+137137+// Consume returns and DELETES the entry for (did, domain). Returns
138138+// (zero, false) if absent or expired. One-shot semantics — a reloaded
139139+// callback page can't replay the API key.
140140+func (s *credsStash) Consume(did, domain string) (EnrollCredentials, bool) {
141141+ s.mu.Lock()
142142+ defer s.mu.Unlock()
143143+ k := credsKey(did, domain)
144144+ e, ok := s.entries[k]
145145+ if !ok {
146146+ return EnrollCredentials{}, false
147147+ }
148148+ delete(s.entries, k)
149149+ if time.Now().After(e.expiry) {
150150+ return EnrollCredentials{}, false
151151+ }
152152+ return e.creds, true
153153+}
154154+155155+// runPruneTicker drops expired entries on a fixed cadence.
156156+func (s *credsStash) runPruneTicker(ctx context.Context, interval time.Duration) {
157157+ if interval <= 0 {
158158+ interval = credsStashPruneEvery
159159+ }
160160+ t := time.NewTicker(interval)
161161+ defer t.Stop()
162162+ for {
163163+ select {
164164+ case <-ctx.Done():
165165+ return
166166+ case <-t.C:
167167+ s.pruneExpired()
168168+ }
169169+ }
170170+}
171171+172172+func (s *credsStash) pruneExpired() {
173173+ s.mu.Lock()
174174+ defer s.mu.Unlock()
175175+ now := time.Now()
176176+ for k, v := range s.entries {
177177+ if now.After(v.expiry) {
178178+ delete(s.entries, k)
179179+ }
180180+ }
181181+}
+109-5
internal/admin/ui/enroll.go
···98989999 mu sync.Mutex
100100 tickets map[string]enrollAuthTicket
101101+102102+ // creds holds (DID, domain) -> credentials between handleVerify
103103+ // (which kicks the publish-OAuth round-trip) and the attest
104104+ // callback that actually renders them. Previously the credentials
105105+ // were rendered inline before publish, with predictable results
106106+ // when users bailed before clicking the publish button.
107107+ creds *credsStash
101108}
102109103110// NewEnrollHandler constructs a public enrollment UI that delegates the
104111// start/verify business logic to adminAPI (typically *admin.API). Pass
105112// resolver to enable handle→DID resolution at /enroll/resolve.
106113func NewEnrollHandler(adminAPI http.Handler, resolver HandleResolver) *EnrollHandler {
107107- h := &EnrollHandler{adminAPI: adminAPI, resolver: resolver, mux: http.NewServeMux(), tickets: make(map[string]enrollAuthTicket)}
114114+ h := &EnrollHandler{
115115+ adminAPI: adminAPI,
116116+ resolver: resolver,
117117+ mux: http.NewServeMux(),
118118+ tickets: make(map[string]enrollAuthTicket),
119119+ creds: newCredsStash(),
120120+ }
108121 h.mux.HandleFunc("/", h.handleMarketing)
109122 h.mux.HandleFunc("/enroll", h.handleLanding)
110123 h.mux.HandleFunc("/enroll/auth", h.handleAuth)
···137150// ownership before the domain enrollment form is shown.
138151func (h *EnrollHandler) SetPublisher(pub Publisher) {
139152 h.pub = pub
153153+}
154154+155155+// Consume implements EnrollCredentialsStash so the AttestHandler can pull
156156+// the stashed credentials on a successful publish callback. Returns
157157+// (zero, false) if the entry is absent or expired. One-shot.
158158+func (h *EnrollHandler) Consume(did, domain string) (EnrollCredentials, bool) {
159159+ if h.creds == nil {
160160+ return EnrollCredentials{}, false
161161+ }
162162+ return h.creds.Consume(did, domain)
163163+}
164164+165165+// Close stops the background credentials-stash prune ticker. Idempotent.
166166+// Wired into main.go's shutdown path so the goroutine exits cleanly when
167167+// the process is terminating.
168168+func (h *EnrollHandler) Close() {
169169+ if h.creds != nil {
170170+ h.creds.Close()
171171+ }
140172}
141173142174// SetAccountTicketIssuer wires the recovery handler so that verified
···564596565597 h.recordStep("enroll_success")
566598 log.Printf("enroll.public_success: did=%s domain=%s", er.DID, domain)
599599+600600+ // Atomic enroll+publish. When OAuth is wired, stash the
601601+ // credentials and kick the publish round-trip. The callback at
602602+ // /enroll/attest/callback consumes the stash and renders both the
603603+ // "attestation published" confirmation AND the credentials. This
604604+ // closes the funnel cliff that stranded richferro.com / self.surf:
605605+ // even if the user bails after seeing the credentials page, the
606606+ // attestation is already on the PDS.
607607+ if h.pub != nil && h.creds != nil {
608608+ if loc, ok := h.kickAtomicPublish(r.Context(), er.DID, domain, result); ok {
609609+ http.Redirect(w, r, loc, http.StatusFound)
610610+ return
611611+ }
612612+ // kickAtomicPublish returned false: OAuth start failed. We must
613613+ // not lose the credentials — fall back to inline render below.
614614+ // The user can retry via the manual button on EnrollSuccess.
615615+ }
616616+567617 w.Header().Set("Content-Type", "text/html; charset=utf-8")
568618 _ = templates.EnrollSuccess(result).Render(r.Context(), w)
619619+}
620620+621621+// kickAtomicPublish stashes the credentials for (did, domain) and starts
622622+// the publish-OAuth round-trip. On success returns the authorize URL the
623623+// caller should 302 to. On failure logs and returns ("", false) so the
624624+// caller can fall back to inline credential rendering.
625625+func (h *EnrollHandler) kickAtomicPublish(ctx context.Context, did, domain string, result templates.EnrollResult) (string, bool) {
626626+ // Build the lexicon record. Mirrors AttestHandler.handleStart so
627627+ // the canonical payload doesn't drift between code paths.
628628+ record := map[string]any{
629629+ "$type": "email.atmos.attestation",
630630+ "domain": domain,
631631+ "dkimSelectors": []string{
632632+ result.DKIM.Selector + "r",
633633+ result.DKIM.Selector + "e",
634634+ },
635635+ "relayMember": true,
636636+ "createdAt": time.Now().UTC().Format(time.RFC3339),
637637+ }
638638+ attBytes, err := atpoauth.MarshalAttestation(record)
639639+ if err != nil {
640640+ log.Printf("enroll.atomic_publish: did=%s domain=%s marshal_error=%v", did, domain, err)
641641+ return "", false
642642+ }
643643+644644+ startCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
645645+ defer cancel()
646646+ authorizeURL, state, err := h.pub.StartAuthFlow(startCtx, did, atpoauth.StartOptions{
647647+ ExpectedDID: did,
648648+ Domain: domain,
649649+ Attestation: attBytes,
650650+ })
651651+ if err != nil {
652652+ log.Printf("enroll.atomic_publish: did=%s domain=%s start_error=%v", did, domain, err)
653653+ return "", false
654654+ }
655655+656656+ // Stash AFTER OAuth start succeeds. If start fails the user falls
657657+ // back to inline render, where they get the credentials directly —
658658+ // no stale stash to leak. Stashing before start would race with
659659+ // the manual-publish path that POSTs the same fields.
660660+ h.creds.Stash(did, domain, EnrollCredentials{
661661+ APIKey: result.APIKey,
662662+ SMTPHost: result.SMTPHost,
663663+ SMTPPort: result.SMTPPort,
664664+ DKIMSelector: result.DKIM.Selector,
665665+ DKIMRSAName: result.DKIM.RSADNSName,
666666+ DKIMRSARecord: result.DKIM.RSARecord,
667667+ DKIMEdName: result.DKIM.EdDNSName,
668668+ DKIMEdRecord: result.DKIM.EdRecord,
669669+ })
670670+671671+ log.Printf("enroll.atomic_publish: did=%s domain=%s state=%s authorize", did, domain, state)
672672+ return authorizeURL, true
569673}
570674571675// handleAuth kicks off the OAuth flow to verify DID ownership before
···766870//
767871// Cookie + User-Agent are forwarded so the inner admin API can look up
768872// the enroll-auth ticket the public UI set after a successful AT Proto
769769-// OAuth round-trip — the central defense for #207.
873873+// OAuth round-trip — the central defense against DID spoofing.
770874//
771875// RemoteAddr is also forwarded so the admin API's per-IP enroll-start
772876// rate limiter sees the real public client IP. Without this, every
773877// public enrollment request would share a single rate-limit bucket and
774878// a single attacker could exhaust it for all legitimate users from any
775775-// IP — closes #211.
879879+// IP.
776880//
777881// This used to construct an httptest.NewRequest + httptest.ResponseRecorder
778778-// in the production call chain (#222). The dependency on net/http/httptest
779779-// from non-test code masked the rate-limiter bypass that became #211 and
882882+// in the production call chain. The dependency on net/http/httptest
883883+// from non-test code masked a rate-limiter bypass and
780884// made the call site inscrutable to readers expecting test-only types not
781885// to leak. We now use http.NewRequestWithContext + an in-package response
782886// writer (inMemoryResponseWriter) so the type signatures match the rest
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package ui
44+55+// End-to-end enrollment-funnel integration test (#237).
66+//
77+// Drives the full atomic-publish path through `/enroll/verify` → the
88+// publish-OAuth redirect → `/enroll/attest/callback`, asserting that a
99+// member who walks the wizard with OAuth wired ends up with an
1010+// attestation record published to the (faux) PDS AND with the relay's
1111+// SetAttestationPublished stamp call made — i.e., they would actually
1212+// receive labels.
1313+//
1414+// The earlier per-step tests in attest_atomic_test.go each pin half of
1515+// the contract; this one wires both halves together so a regression in
1616+// the stash key, the OAuth payload shape, or the callback render path
1717+// would surface as a single failing test rather than depending on a
1818+// reviewer to hold the funnel in their head. This is the realization
1919+// of the SMTP-smoke / enrollment-funnel scenario described in #228 for
2020+// the publish path specifically.
2121+//
2222+// Faux PDS: programmablePublisher (defined in attest_atomic_test.go) is
2323+// reused — its CompleteCallback returns a pre-configured fakeCompletedSession
2424+// whose PutRecord we assert against.
2525+//
2626+// Faux admin API: fakeAdminAPI (defined in enroll_test.go) returns a
2727+// realistic /admin/enroll response shape so handleVerify constructs an
2828+// EnrollResult with the credentials we expect in the post-publish page.
2929+3030+import (
3131+ "fmt"
3232+ "net/http"
3333+ "net/http/httptest"
3434+ "net/url"
3535+ "strings"
3636+ "testing"
3737+ "time"
3838+3939+ "atmosphere-mail/internal/atpoauth"
4040+)
4141+4242+func TestEnrollmentFunnel_AtomicPublish_EndToEnd(t *testing.T) {
4343+ did := "did:plc:funnelend2endaaaa"
4444+ domain := "funnel.example.com"
4545+ apiKey := "atmos_funnel_apikey_xyz"
4646+ rsaName := "atmos20260501r._domainkey.funnel.example.com"
4747+ edName := "atmos20260501e._domainkey.funnel.example.com"
4848+4949+ // Faux PDS: returns an authorize URL on StartAuthFlow, and on the
5050+ // subsequent CompleteCallback returns a session with matching
5151+ // DID + domain plus a non-empty attestation byte slice (so the
5252+ // callback handler treats it as a real publish, not enroll-auth).
5353+ attBytes, err := atpoauth.MarshalAttestation(map[string]any{
5454+ "$type": "email.atmos.attestation",
5555+ "domain": domain,
5656+ "dkimSelectors": []string{"atmos20260501r", "atmos20260501e"},
5757+ "relayMember": true,
5858+ "createdAt": time.Now().UTC().Format(time.RFC3339),
5959+ })
6060+ if err != nil {
6161+ t.Fatalf("marshal attestation: %v", err)
6262+ }
6363+ sess := &fakeCompletedSession{
6464+ did: did,
6565+ domain: domain,
6666+ attestation: attBytes,
6767+ }
6868+ pub := &programmablePublisher{
6969+ startURL: "https://faux-pds.example/oauth/authorize?atomic=1",
7070+ completeSess: sess,
7171+ }
7272+7373+ // Faux admin API: returns the credentials block the wizard's
7474+ // `/admin/enroll` proxy expects, keyed off the same DID/domain we
7575+ // drive the funnel with.
7676+ fakeAdmin := &fakeAdminAPI{
7777+ enrollStatus: http.StatusOK,
7878+ enrollBody: fmt.Sprintf(`{
7979+ "did": %q,
8080+ "apiKey": %q,
8181+ "dkim": {
8282+ "selector": "atmos20260501",
8383+ "rsaRecord": "v=DKIM1; k=rsa; p=AAA",
8484+ "edRecord": "v=DKIM1; k=ed25519; p=BBB",
8585+ "rsaDnsName": %q,
8686+ "edDnsName": %q
8787+ },
8888+ "smtp": {"host": "smtp.atmos.email", "port": 587}
8989+ }`, did, apiKey, rsaName, edName),
9090+ }
9191+9292+ // Wire the two handlers together — exactly as cmd/relay/main.go does
9393+ // in production. The integration here is the credentials stash:
9494+ // EnrollHandler stashes on /enroll/verify, AttestHandler consumes on
9595+ // /enroll/attest/callback.
9696+ enrollH := NewEnrollHandler(fakeAdmin, nil)
9797+ enrollH.SetPublisher(pub)
9898+ store := &stashAttestStore{}
9999+ attestH := NewAttestHandler(pub, store)
100100+ attestH.SetEnrollCredentialsStash(enrollH)
101101+102102+ // Outer mux: /enroll/attest/* routes to attestH (more specific
103103+ // pattern wins under stdlib's mux), everything else falls through
104104+ // to enrollH which has its own internal mux for /enroll/verify
105105+ // among others.
106106+ mux := http.NewServeMux()
107107+ attestH.RegisterRoutes(mux)
108108+ mux.Handle("/", enrollH)
109109+110110+ // --- Step 1: POST /enroll/verify (wizard final step) ---
111111+ //
112112+ // Pre-#234 this rendered credentials inline with an optional publish
113113+ // button. Post-#234 it must redirect into the publish OAuth and
114114+ // stash the credentials for callback retrieval.
115115+ form := url.Values{}
116116+ form.Set("domain", domain)
117117+ form.Set("token", "tok-funnel-1")
118118+ req := httptest.NewRequest(http.MethodPost, "/enroll/verify",
119119+ strings.NewReader(form.Encode()))
120120+ req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
121121+ rec := httptest.NewRecorder()
122122+ mux.ServeHTTP(rec, req)
123123+124124+ if rec.Code != http.StatusFound {
125125+ t.Fatalf("step 1 /enroll/verify: status = %d, want 302 (atomic publish redirect); body=%q",
126126+ rec.Code, rec.Body.String())
127127+ }
128128+ if loc := rec.Header().Get("Location"); loc != pub.startURL {
129129+ t.Errorf("step 1: redirect Location = %q, want %q (publish authorize URL)", loc, pub.startURL)
130130+ }
131131+ if strings.Contains(rec.Body.String(), apiKey) {
132132+ t.Error("step 1: API key leaked into redirect body — must not be revealed before publish completes")
133133+ }
134134+ if pub.startCalled != 1 {
135135+ t.Errorf("step 1: Publisher.StartAuthFlow called %d times, want 1", pub.startCalled)
136136+ }
137137+ // And the OAuth StartOptions MUST carry the lexicon attestation —
138138+ // not the enroll-auth sentinel. This is what proves we're on the
139139+ // publish path, not the identity-verify path.
140140+ if !strings.Contains(string(pub.startOpts.Attestation), "email.atmos.attestation") {
141141+ t.Errorf("step 1: StartOptions.Attestation should carry the lexicon record; got %q",
142142+ pub.startOpts.Attestation)
143143+ }
144144+ if pub.startOpts.Domain != domain {
145145+ t.Errorf("step 1: StartOptions.Domain = %q, want %q", pub.startOpts.Domain, domain)
146146+ }
147147+148148+ // --- Step 2: GET /enroll/attest/callback (publish OAuth completes) ---
149149+ //
150150+ // In production this is hit by the user's browser after they
151151+ // approve the OAuth consent on their PDS. The faux publisher
152152+ // returns the pre-configured session; the handler runs PutRecord,
153153+ // stamps the relay store, and renders the credentials page using
154154+ // the values stashed in step 1.
155155+ req = httptest.NewRequest(http.MethodGet, "/enroll/attest/callback?code=x&state=y", nil)
156156+ rec = httptest.NewRecorder()
157157+ mux.ServeHTTP(rec, req)
158158+159159+ if rec.Code != http.StatusOK {
160160+ t.Fatalf("step 2 /enroll/attest/callback: status = %d, want 200; body=%q",
161161+ rec.Code, rec.Body.String())
162162+ }
163163+ body := rec.Body.String()
164164+165165+ // Pin: PutRecord was called with the lexicon collection + domain rkey.
166166+ // This is THE assertion that catches the original #233 bug — pre-fix,
167167+ // the wizard's success page never POSTed to /enroll/attest/start, so
168168+ // PutRecord was never called for users who bailed.
169169+ if sess.putCalled != 1 {
170170+ t.Errorf("PutRecord called %d times, want 1 — funnel never made it to PDS write", sess.putCalled)
171171+ }
172172+ if sess.putLastCol != "email.atmos.attestation" {
173173+ t.Errorf("PutRecord collection = %q, want email.atmos.attestation", sess.putLastCol)
174174+ }
175175+ if sess.putLastRkey != domain {
176176+ t.Errorf("PutRecord rkey = %q, want %q", sess.putLastRkey, domain)
177177+ }
178178+179179+ // Pin: relay's SetAttestationPublished stamp call hit the store.
180180+ // This is what populates member_domains.attestation_rkey — the
181181+ // column that was empty for richferro.com / self.surf.
182182+ if len(store.calls) != 1 {
183183+ t.Fatalf("SetAttestationPublished called %d times, want 1; calls=%v",
184184+ len(store.calls), store.calls)
185185+ }
186186+ wantStoreCall := domain + ":" + domain
187187+ if store.calls[0] != wantStoreCall {
188188+ t.Errorf("SetAttestationPublished call = %q, want %q", store.calls[0], wantStoreCall)
189189+ }
190190+191191+ // Pin: the user actually sees their credentials for the first time
192192+ // on the post-publish page. If this fails, the stash wiring or the
193193+ // callback render is broken even if the data path is correct.
194194+ if !strings.Contains(body, apiKey) {
195195+ t.Error("post-publish page MUST render API key — first time the user sees it")
196196+ }
197197+ if !strings.Contains(body, rsaName) {
198198+ t.Error("post-publish page should render RSA DKIM DNS name")
199199+ }
200200+ if !strings.Contains(body, edName) {
201201+ t.Error("post-publish page should render Ed25519 DKIM DNS name")
202202+ }
203203+ if !strings.Contains(strings.ToLower(body), "attestation") {
204204+ t.Error("post-publish page should reference the attestation having been published")
205205+ }
206206+207207+ // Pin: stash is one-shot. A second hit to the callback URL would
208208+ // not be able to re-render the API key (browser reload, share-link
209209+ // copy, etc.) — Consume removes the entry on first read.
210210+ if creds, ok := enrollH.Consume(did, domain); ok {
211211+ t.Errorf("stash entry should have been consumed by the callback; got creds=%+v", creds)
212212+ }
213213+}
214214+215215+// TestEnrollmentFunnel_PublishFailure_PreservesCredentials_E2E pins the
216216+// failure-path contract for the same end-to-end flow. If the PDS
217217+// rejects the PutRecord (e.g., 502 bad gateway), the user is already
218218+// enrolled at this point — losing their credentials would force them to
219219+// hit the #235 self-service path with a fresh OAuth and rotate. We
220220+// preserve the credentials by rendering them on a retry page that
221221+// links to /account/manage.
222222+func TestEnrollmentFunnel_PublishFailure_PreservesCredentials_E2E(t *testing.T) {
223223+ did := "did:plc:funnelfail22222aaa"
224224+ domain := "fail.example.com"
225225+ apiKey := "atmos_fail_apikey"
226226+227227+ attBytes, err := atpoauth.MarshalAttestation(map[string]any{
228228+ "$type": "email.atmos.attestation",
229229+ "domain": domain,
230230+ "dkimSelectors": []string{"atmos20260501r", "atmos20260501e"},
231231+ "relayMember": true,
232232+ "createdAt": time.Now().UTC().Format(time.RFC3339),
233233+ })
234234+ if err != nil {
235235+ t.Fatalf("marshal attestation: %v", err)
236236+ }
237237+ sess := &fakeCompletedSession{
238238+ did: did,
239239+ domain: domain,
240240+ attestation: attBytes,
241241+ // Inject a PDS-side failure on PutRecord — same shape as a real
242242+ // 5xx from the PDS or a network blip.
243243+ putErr: fmt.Errorf("pds 502 bad gateway"),
244244+ }
245245+ pub := &programmablePublisher{
246246+ startURL: "https://faux-pds.example/oauth/authorize?atomic=1",
247247+ completeSess: sess,
248248+ }
249249+ fakeAdmin := &fakeAdminAPI{
250250+ enrollStatus: http.StatusOK,
251251+ enrollBody: fmt.Sprintf(`{
252252+ "did": %q,
253253+ "apiKey": %q,
254254+ "dkim": {
255255+ "selector": "atmos20260501",
256256+ "rsaRecord": "v=DKIM1; k=rsa; p=AAA",
257257+ "edRecord": "v=DKIM1; k=ed25519; p=BBB",
258258+ "rsaDnsName": "atmos20260501r._domainkey.fail.example.com",
259259+ "edDnsName": "atmos20260501e._domainkey.fail.example.com"
260260+ },
261261+ "smtp": {"host": "smtp.atmos.email", "port": 587}
262262+ }`, did, apiKey),
263263+ }
264264+ enrollH := NewEnrollHandler(fakeAdmin, nil)
265265+ enrollH.SetPublisher(pub)
266266+ attestH := NewAttestHandler(pub, &stashAttestStore{})
267267+ attestH.SetEnrollCredentialsStash(enrollH)
268268+269269+ mux := http.NewServeMux()
270270+ attestH.RegisterRoutes(mux)
271271+ mux.Handle("/", enrollH)
272272+273273+ form := url.Values{}
274274+ form.Set("domain", domain)
275275+ form.Set("token", "tok-fail-1")
276276+ req := httptest.NewRequest(http.MethodPost, "/enroll/verify",
277277+ strings.NewReader(form.Encode()))
278278+ req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
279279+ rec := httptest.NewRecorder()
280280+ mux.ServeHTTP(rec, req)
281281+ if rec.Code != http.StatusFound {
282282+ t.Fatalf("/enroll/verify: status = %d, want 302; body=%q", rec.Code, rec.Body.String())
283283+ }
284284+285285+ req = httptest.NewRequest(http.MethodGet, "/enroll/attest/callback?code=x&state=y", nil)
286286+ rec = httptest.NewRecorder()
287287+ mux.ServeHTTP(rec, req)
288288+289289+ body := rec.Body.String()
290290+ if !strings.Contains(body, apiKey) {
291291+ t.Error("publish-failure retry page MUST render API key — user is enrolled, can't lose creds")
292292+ }
293293+ if !strings.Contains(body, "/account/manage") {
294294+ t.Error("publish-failure retry page should link to /account/manage for self-service retry (#235)")
295295+ }
296296+}
-1
internal/admin/ui/events.go
···279279 return out
280280}
281281282282-283282// handleShadowVerdicts renders /admin/shadow-verdicts — the events stream
284283// pre-filtered to events whose labels_applied contains any "shadow:"
285284// label. This is the bake-in surface for new rules authored in
+1-1
internal/admin/ui/handlers.go
···130130 // Static assets and GETs are read-only; CSRF middleware short-
131131 // circuits them. State-changing requests (POST/PUT/PATCH/DELETE)
132132 // must carry HX-Request: true and an Origin/Referer matching the
133133- // operator-configured allowlist. See CRIT #151.
133133+ // operator-configured allowlist. See CRIT review.
134134 csrf := RequireCSRF(h.allowedOrigins, CSRFOptions{RequireHTMX: true})
135135 csrf(h.mux).ServeHTTP(w, r)
136136}
···2233package ui
4455-// Log-safe hashing for credential-shaped values (OAuth state tokens,
66-// recovery ticket IDs, etc.). Never log the raw value — log the prefix
77-// of sha256(value) so operators can correlate events across lines
88-// without exposing a credential. Returns "<empty>" for empty inputs so
99-// a blank value is still visually distinct in logs.
55+// Thin back-compat wrapper. The implementation moved to
66+// internal/loghash so non-UI packages (notably the labeler) can redact
77+// DIDs in logs without importing UI code. Existing
88+// ui.HashForLog call sites keep working unchanged.
1091110import (
1212- "crypto/sha256"
1313- "encoding/hex"
1111+ "atmosphere-mail/internal/loghash"
1412)
15131616-// hashLogPrefixLen is the number of hex chars emitted by HashForLog.
1717-// 16 hex chars = 64 bits of the SHA-256 digest — ample for operator
1818-// correlation across log lines, while still a one-way function.
1919-const hashLogPrefixLen = 16
2020-2121-// HashForLog returns a short, deterministic hex prefix of sha256(s)
2222-// suitable for log output. Empty input returns the sentinel "<empty>"
2323-// so blank values are legible rather than invisible.
1414+// HashForLog is preserved as a back-compat alias for loghash.ForLog.
1515+// New code outside internal/admin/ui should call loghash.ForLog
1616+// directly.
2417func HashForLog(s string) string {
2525- if s == "" {
2626- return "<empty>"
2727- }
2828- sum := sha256.Sum256([]byte(s))
2929- return hex.EncodeToString(sum[:])[:hashLogPrefixLen]
1818+ return loghash.ForLog(s)
3019}
+1-1
internal/admin/ui/inproc.go
···991010// adminProxyResponse captures the response from invoking the admin API
1111// in-process. Replaces *httptest.ResponseRecorder so test-only types stay
1212-// out of the production call chain (#222).
1212+// out of the production call chain.
1313//
1414// Field names mirror the legacy ResponseRecorder API (`Code`, `Body`) so
1515// callers that read `resp.Code` and `resp.Body.String()` keep working
+2-2
internal/admin/ui/metadata.go
···1717// caching bytes because responses are rare (once per PAR) and the atomic
1818// config snapshot is not worth the complexity of a cache invalidation path.
1919type MetadataHandler struct {
2020- client *atpoauth.Client
2121- clientURI string // optional — if non-empty, populates client_uri
2020+ client *atpoauth.Client
2121+ clientURI string // optional — if non-empty, populates client_uri
2222 clientName string
2323}
2424
+86-32
internal/admin/ui/recover.go
···9595 // update so the admin API can trigger email re-verification. Nil =
9696 // no-op (verification feature not wired).
9797 onContactEmailChanged func(ctx context.Context, domain, contactEmail string)
9898+ // labels, when set, is consulted on /account/manage to render the
9999+ // signed-in DID's current label state and to broaden the publish
100100+ // button condition. Nil = legacy behavior: publish button
101101+ // gated only on attestation_rkey emptiness.
102102+ labels LabelStatusQuerier
9810399104 mu sync.Mutex
100105 tickets map[string]recoveryTicket
···122127// email re-verification without the UI package importing admin.
123128func (h *RecoverHandler) SetContactEmailChangedHook(fn func(ctx context.Context, domain, contactEmail string)) {
124129 h.onContactEmailChanged = fn
130130+}
131131+132132+// SetLabelStatusQuerier wires the labeler-XRPC query used by
133133+// /account/manage to surface live label state. When set, the
134134+// page shows which of `verified-mail-operator` and `relay-member` the
135135+// labeler currently issues for the signed-in DID, plus a re-publish
136136+// affordance when labels are missing despite a published attestation
137137+// (a state today's DB-stamp gate misses entirely).
138138+func (h *RecoverHandler) SetLabelStatusQuerier(q LabelStatusQuerier) {
139139+ h.labels = q
125140}
126141127142// RecoverRegenerateFunc rotates the API key for (did, domain) and
···247262// noReferrerHeader sets Referrer-Policy: strict-origin-when-cross-origin
248263// on every /account/* response.
249264//
250250-// History: this was "no-referrer" in the original CRIT bundle (#152) to
265265+// History: this was "no-referrer" in the original CRIT bundle to
251266// keep the then-URL-embedded ticket from leaking via Referer. After the
252267// ticket moved to an HttpOnly cookie, the URL itself carries no secret,
253268// and "no-referrer" started causing real harm: browsers that see it on
254269// the landing page strip BOTH Origin and Referer from the subsequent
255270// form POST, which makes our CSRF middleware reject every same-origin
256256-// POST with "forbidden: origin not allowed" (#178).
271271+// POST with "forbidden: origin not allowed".
257272//
258273// "strict-origin-when-cross-origin" is the modern browser default:
259274// - same-origin requests get the full Referer (our CSRF check works)
···336351337352// handleLanding renders the entry form where the member enters the
338353// handle or DID they originally enrolled.
354354+//
355355+// If a valid recovery ticket cookie is already present, redirects to
356356+// /account/manage instead of re-prompting for sign-in. Without this
357357+// hop, navigating /account/manage → /account/deliverability → /account
358358+// (or any other path that lands back at the bare /account URL) dumps a
359359+// signed-in member back at the sign-in form, even though their cookie
360360+// is still valid.
361361+//
362362+// Invalid / expired cookies fall through to the form — never redirect-
363363+// loop, never silently consume the ticket.
339364func (h *RecoverHandler) handleLanding(w http.ResponseWriter, r *http.Request) {
340365 if r.Method != http.MethodGet {
341366 http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
342367 return
343368 }
369369+ if id, ok := recoveryTicketFromCookie(r); ok {
370370+ if _, ok := h.lookupTicket(id, r.UserAgent()); ok {
371371+ http.Redirect(w, r, "/account/manage", http.StatusFound)
372372+ return
373373+ }
374374+ }
344375 w.Header().Set("Content-Type", "text/html; charset=utf-8")
345376 _ = templates.RecoverLanding("").Render(r.Context(), w)
346377}
···390421 // Attestation deliberately nil.
391422 })
392423 if err != nil {
393393- // Audit #162: log detail server-side, surface a generic
424424+ // log detail server-side, surface a generic
394425 // message to the user. Upstream error strings can carry PDS
395426 // hostnames, network internals, and indigo-specific tokens
396427 // that don't belong in a browser.
···477508 return
478509 }
479510511511+ // Query the labeler for live label state. Nil querier or
512512+ // any error/empty result is rendered as "label status unavailable"
513513+ // in the template so we never hide the rest of the page on a
514514+ // transient labeler outage.
515515+ var labels []string
516516+ var labelsKnown bool
517517+ if h.labels != nil {
518518+ qctx, qcancel := context.WithTimeout(r.Context(), 3*time.Second)
519519+ ls, err := h.labels.QueryLabels(qctx, ticket.did)
520520+ qcancel()
521521+ if err == nil {
522522+ labels = ls
523523+ labelsKnown = true
524524+ } else {
525525+ log.Printf("recover.manage: did_hash=%s label_query_error=%v",
526526+ HashForLog(ticket.did), err)
527527+ }
528528+ }
529529+480530 w.Header().Set("Content-Type", "text/html; charset=utf-8")
481531 _ = templates.RecoverManage(templates.RecoverManageData{
482482- DID: ticket.did,
483483- Domain: ticket.domain,
484484- DKIMSelector: memberDomain.DKIMSelector,
485485- ContactEmail: memberDomain.ContactEmail,
486486- EmailVerified: memberDomain.EmailVerified,
487487- ExpiresAt: ticket.expiry.Format(time.RFC3339),
532532+ DID: ticket.did,
533533+ Domain: ticket.domain,
534534+ DKIMSelector: memberDomain.DKIMSelector,
535535+ ContactEmail: memberDomain.ContactEmail,
536536+ EmailVerified: memberDomain.EmailVerified,
537537+ AttestationPublished: memberDomain.AttestationRkey != "",
538538+ Labels: labels,
539539+ LabelsKnown: labelsKnown,
540540+ ExpiresAt: ticket.expiry.Format(time.RFC3339),
488541 }).Render(r.Context(), w)
489542}
490543···554607555608 w.Header().Set("Content-Type", "text/html; charset=utf-8")
556609 _ = templates.DeliverabilityPage(templates.DeliverabilityData{
557557- DID: ticket.did,
558558- Domain: ticket.domain,
559559- Status: member.Status,
560560- SuspendReason: member.SuspendReason,
561561- Sent14d: total,
562562- Bounced14d: bounced,
563563- Complaints14d: complaints,
564564- BounceRate: bounceRate,
565565- DailySends: daily,
566566- HourlyLimit: member.HourlyLimit,
567567- DailyLimit: member.DailyLimit,
568568- WarmingTier: warmingTier,
569569- WarmingLabel: warmingLabel,
610610+ DID: ticket.did,
611611+ Domain: ticket.domain,
612612+ Status: member.Status,
613613+ SuspendReason: member.SuspendReason,
614614+ Sent14d: total,
615615+ Bounced14d: bounced,
616616+ Complaints14d: complaints,
617617+ BounceRate: bounceRate,
618618+ DailySends: daily,
619619+ HourlyLimit: member.HourlyLimit,
620620+ DailyLimit: member.DailyLimit,
621621+ WarmingTier: warmingTier,
622622+ WarmingLabel: warmingLabel,
570623 }).Render(r.Context(), w)
571624}
572625···700753 return
701754 }
702755 email := strings.TrimSpace(r.FormValue("contact_email"))
703703- // Audit #156: empty is OK (unset); non-empty must parse as a
756756+ // empty is OK (unset); non-empty must parse as a
704757 // valid RFC 5322 address. net/mail.ParseAddress is stricter than
705758 // the old strings.Contains("@") check — it rejects
706759 // "not@valid@addr", bare domains, etc.
···773826 return
774827 }
775828 data := templates.RecoverManageData{
776776- DID: ticket.did,
777777- Domain: ticket.domain,
778778- DKIMSelector: memberDomain.DKIMSelector,
779779- ContactEmail: memberDomain.ContactEmail,
780780- EmailVerified: memberDomain.EmailVerified,
781781- ExpiresAt: ticket.expiry.Format(time.RFC3339),
782782- Message: message,
783783- MessageErr: isError,
829829+ DID: ticket.did,
830830+ Domain: ticket.domain,
831831+ DKIMSelector: memberDomain.DKIMSelector,
832832+ ContactEmail: memberDomain.ContactEmail,
833833+ EmailVerified: memberDomain.EmailVerified,
834834+ AttestationPublished: memberDomain.AttestationRkey != "",
835835+ ExpiresAt: ticket.expiry.Format(time.RFC3339),
836836+ Message: message,
837837+ MessageErr: isError,
784838 }
785839 w.Header().Set("Content-Type", "text/html; charset=utf-8")
786840 _ = templates.RecoverManage(data).Render(r.Context(), w)
···920974// lose their session under Strict, which is wrong. Lax still blocks
921975// cross-site POSTs (CSRF protection) but allows cookies on top-level
922976// GETs — which is exactly the navigation pattern recovery produces.
923923-// See #180.
977977+//
924978func setRecoveryCookie(w http.ResponseWriter, ticket string) {
925979 http.SetCookie(w, &http.Cookie{
926980 Name: RecoveryCookieName,
+413
internal/admin/ui/recover_test.go
···121121 }
122122}
123123124124+// TestRecover_LandingRedirectsWhenSignedIn covers #239: navigating back
125125+// to /account from any sub-page (e.g. /account/deliverability) must NOT
126126+// re-prompt for sign-in if the recovery cookie is still valid.
127127+func TestRecover_LandingRedirectsWhenSignedIn(t *testing.T) {
128128+ store := newRecoverTestStore(t)
129129+ did := "did:plc:landing1111111111111aa"
130130+ seedRecoverMember(t, store, did, "landing.example.com")
131131+132132+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
133133+ target := h.IssueRecoveryTicket(did, "landing.example.com")
134134+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
135135+136136+ mux := http.NewServeMux()
137137+ h.RegisterRoutes(mux)
138138+139139+ req := httptest.NewRequest(http.MethodGet, "/account", nil)
140140+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
141141+ rec := httptest.NewRecorder()
142142+ mux.ServeHTTP(rec, req)
143143+144144+ if rec.Code != http.StatusFound {
145145+ t.Fatalf("status = %d, want 302; body=%q", rec.Code, rec.Body.String())
146146+ }
147147+ if loc := rec.Header().Get("Location"); loc != "/account/manage" {
148148+ t.Errorf("redirect = %q, want /account/manage", loc)
149149+ }
150150+}
151151+152152+// TestRecover_LandingFallsThroughOnInvalidCookie guards against a redirect
153153+// loop on stale cookies: an invalid/expired ticket cookie must cause
154154+// /account to render the sign-in form, not redirect back to /account/manage
155155+// (which would itself bounce back to /account, looping).
156156+func TestRecover_LandingFallsThroughOnInvalidCookie(t *testing.T) {
157157+ h := NewRecoverHandler(&fakePublisher{}, newRecoverTestStore(t), "https://example.com", nil)
158158+ mux := http.NewServeMux()
159159+ h.RegisterRoutes(mux)
160160+161161+ req := httptest.NewRequest(http.MethodGet, "/account", nil)
162162+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: "ticket-that-was-never-issued"})
163163+ rec := httptest.NewRecorder()
164164+ mux.ServeHTTP(rec, req)
165165+166166+ if rec.Code != http.StatusOK {
167167+ t.Fatalf("status = %d, want 200; body=%q", rec.Code, rec.Body.String())
168168+ }
169169+ if !strings.Contains(rec.Body.String(), `action="/account/start"`) {
170170+ t.Error("stale-cookie landing should still render the sign-in form")
171171+ }
172172+}
173173+174174+// TestRecover_DeliverabilityHasSingleTopnav covers #239's second papercut:
175175+// /account/deliverability must not stack two `topnav` bars (the layout's
176176+// "← home" + a redundant "← Account" breadcrumb). A single nav bar is the
177177+// expected visual treatment.
178178+func TestRecover_DeliverabilityHasSingleTopnav(t *testing.T) {
179179+ store := newRecoverTestStore(t)
180180+ did := "did:plc:singlenav1111111111111"
181181+ domain := "singlenav.example.com"
182182+ seedRecoverMember(t, store, did, domain)
183183+184184+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
185185+ target := h.IssueRecoveryTicket(did, domain)
186186+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
187187+188188+ mux := http.NewServeMux()
189189+ h.RegisterRoutes(mux)
190190+191191+ req := httptest.NewRequest(http.MethodGet, "/account/deliverability", nil)
192192+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
193193+ rec := httptest.NewRecorder()
194194+ mux.ServeHTTP(rec, req)
195195+196196+ if rec.Code != http.StatusOK {
197197+ t.Fatalf("status = %d, want 200", rec.Code)
198198+ }
199199+ body := rec.Body.String()
200200+ if got := strings.Count(body, `class="topnav"`); got != 1 {
201201+ t.Errorf("deliverability topnav count = %d, want exactly 1 (publicLayout's only)", got)
202202+ }
203203+ // The contextual back-link is preserved as a non-stacked inline link.
204204+ if !strings.Contains(body, `href="/account/manage"`) {
205205+ t.Error("deliverability should still link back to /account/manage inline")
206206+ }
207207+}
208208+124209func TestRecover_StartLooksUpDIDAndRedirects(t *testing.T) {
125210 store := newRecoverTestStore(t)
126211 did := "did:plc:recover1111111111111aa"
···781866 t.Errorf("status = 200 — query-string ticket must not be accepted")
782867 }
783868}
869869+870870+// --- #235 self-service publish for stuck (enrolled-but-unpublished) members ---
871871+//
872872+// Real members richferro.com (2026-04-28) and self.surf (2026-04-30) finished
873873+// the enrollment wizard but never clicked the publish button on the credentials
874874+// page. Their member_domains rows have attestation_rkey='' so the labeler never
875875+// sees them. /account/manage must render a publish-attestation form for any
876876+// signed-in domain whose attestation_rkey is empty, posting the same fields
877877+// /enroll/attest/start already accepts so no new HTTP handler is needed.
878878+879879+func setRecoverDomainAttestation(t *testing.T, s *relaystore.Store, domain, rkey string) {
880880+ t.Helper()
881881+ if err := s.SetAttestationPublished(context.Background(), domain, rkey, time.Now().UTC()); err != nil {
882882+ t.Fatalf("SetAttestationPublished: %v", err)
883883+ }
884884+}
885885+886886+func TestRecover_ManageShowsPublishButtonForUnpublishedDomain(t *testing.T) {
887887+ store := newRecoverTestStore(t)
888888+ did := "did:plc:unpub111111111111111"
889889+ domain := "stuck.example.com"
890890+ seedRecoverMember(t, store, did, domain)
891891+ // Deliberately NOT publishing — attestation_rkey stays "" — this is
892892+ // the state the two real stuck members are in.
893893+894894+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
895895+ target := h.IssueRecoveryTicket(did, domain)
896896+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
897897+898898+ req := httptest.NewRequest(http.MethodGet, "/account/manage", nil)
899899+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
900900+ rec := httptest.NewRecorder()
901901+ mux := http.NewServeMux()
902902+ h.RegisterRoutes(mux)
903903+ mux.ServeHTTP(rec, req)
904904+905905+ if rec.Code != http.StatusOK {
906906+ t.Fatalf("status = %d, want 200", rec.Code)
907907+ }
908908+ body := rec.Body.String()
909909+ if !strings.Contains(body, `action="/enroll/attest/start"`) {
910910+ t.Error("manage page missing publish-attestation form for unpublished domain")
911911+ }
912912+ for _, want := range []string{
913913+ `name="did"`,
914914+ `name="domain"`,
915915+ `name="dkim_selector"`,
916916+ "atmos20260420", // dkim selector seeded by seedRecoverMember
917917+ domain,
918918+ did,
919919+ } {
920920+ if !strings.Contains(body, want) {
921921+ t.Errorf("manage page publish form missing %q", want)
922922+ }
923923+ }
924924+}
925925+926926+func TestRecover_ManageHidesPublishButtonForPublishedDomain(t *testing.T) {
927927+ store := newRecoverTestStore(t)
928928+ did := "did:plc:pubok11111111111111"
929929+ domain := "published.example.com"
930930+ seedRecoverMember(t, store, did, domain)
931931+ setRecoverDomainAttestation(t, store, domain, domain)
932932+933933+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
934934+ target := h.IssueRecoveryTicket(did, domain)
935935+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
936936+937937+ req := httptest.NewRequest(http.MethodGet, "/account/manage", nil)
938938+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
939939+ rec := httptest.NewRecorder()
940940+ mux := http.NewServeMux()
941941+ h.RegisterRoutes(mux)
942942+ mux.ServeHTTP(rec, req)
943943+944944+ if rec.Code != http.StatusOK {
945945+ t.Fatalf("status = %d, want 200", rec.Code)
946946+ }
947947+ body := rec.Body.String()
948948+ if strings.Contains(body, `action="/enroll/attest/start"`) {
949949+ t.Error("manage page should not show publish form when attestation already published")
950950+ }
951951+}
952952+953953+func TestRecover_ManagePublishButtonRendersOnlyForUnpublishedDomain_MultiDomain(t *testing.T) {
954954+ store := newRecoverTestStore(t)
955955+ did := "did:plc:multipub11111111111"
956956+ publishedDomain := "live.example.com"
957957+ stuckDomain := "stuck.example.com"
958958+ seedRecoverMember(t, store, did, publishedDomain)
959959+ addRecoverDomain(t, store, did, stuckDomain)
960960+ // Only the first one has attestation_rkey set; the second is stuck.
961961+ setRecoverDomainAttestation(t, store, publishedDomain, publishedDomain)
962962+963963+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
964964+ mux := http.NewServeMux()
965965+ h.RegisterRoutes(mux)
966966+967967+ // Sub-test 1: select the stuck domain → manage page shows publish form.
968968+ target := h.IssueRecoveryTicket(did, stuckDomain)
969969+ stuckTicket := strings.TrimPrefix(target, "/account/manage?ticket=")
970970+ req := httptest.NewRequest(http.MethodGet, "/account/manage", nil)
971971+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: stuckTicket})
972972+ rec := httptest.NewRecorder()
973973+ mux.ServeHTTP(rec, req)
974974+ if rec.Code != http.StatusOK {
975975+ t.Fatalf("stuck domain manage status = %d, want 200", rec.Code)
976976+ }
977977+ if !strings.Contains(rec.Body.String(), `action="/enroll/attest/start"`) {
978978+ t.Error("stuck domain manage page must show publish form")
979979+ }
980980+981981+ // Sub-test 2: select the published domain → manage page hides publish form.
982982+ target = h.IssueRecoveryTicket(did, publishedDomain)
983983+ pubTicket := strings.TrimPrefix(target, "/account/manage?ticket=")
984984+ req = httptest.NewRequest(http.MethodGet, "/account/manage", nil)
985985+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: pubTicket})
986986+ rec = httptest.NewRecorder()
987987+ mux.ServeHTTP(rec, req)
988988+ if rec.Code != http.StatusOK {
989989+ t.Fatalf("published domain manage status = %d, want 200", rec.Code)
990990+ }
991991+ if strings.Contains(rec.Body.String(), `action="/enroll/attest/start"`) {
992992+ t.Error("published domain manage page must not show publish form")
993993+ }
994994+}
995995+996996+// --- #240 label-state on /account/manage ---
997997+//
998998+// Pre-#240 the publish button was gated only on attestation_rkey. A
999999+// user whose attestation was published but whose DKIM TXT records were
10001000+// missing got no labels, no diagnostic, and no path forward. These
10011001+// tests pin the new contract: live label state from the labeler XRPC
10021002+// is surfaced on the manage page, and a re-publish button is offered
10031003+// when verified-mail-operator is missing despite a published
10041004+// attestation.
10051005+10061006+// fakeLabelStatusQuerier returns a pre-set list (or error) so tests can
10071007+// drive each label-state branch deterministically.
10081008+type fakeLabelStatusQuerier struct {
10091009+ labels []string
10101010+ err error
10111011+}
10121012+10131013+func (f *fakeLabelStatusQuerier) QueryLabels(ctx context.Context, did string) ([]string, error) {
10141014+ return f.labels, f.err
10151015+}
10161016+10171017+func TestRecover_ManageRendersLabelStatus_HappyPath(t *testing.T) {
10181018+ store := newRecoverTestStore(t)
10191019+ did := "did:plc:labelhappy11111111aa"
10201020+ domain := "happy.example.com"
10211021+ seedRecoverMember(t, store, did, domain)
10221022+ setRecoverDomainAttestation(t, store, domain, domain)
10231023+10241024+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
10251025+ h.SetLabelStatusQuerier(&fakeLabelStatusQuerier{
10261026+ labels: []string{"verified-mail-operator", "relay-member"},
10271027+ })
10281028+ target := h.IssueRecoveryTicket(did, domain)
10291029+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
10301030+10311031+ req := httptest.NewRequest(http.MethodGet, "/account/manage", nil)
10321032+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
10331033+ rec := httptest.NewRecorder()
10341034+ mux := http.NewServeMux()
10351035+ h.RegisterRoutes(mux)
10361036+ mux.ServeHTTP(rec, req)
10371037+10381038+ if rec.Code != http.StatusOK {
10391039+ t.Fatalf("status = %d, want 200", rec.Code)
10401040+ }
10411041+ body := rec.Body.String()
10421042+ if !strings.Contains(body, "Label status") {
10431043+ t.Error("manage page missing Label status section")
10441044+ }
10451045+ if !strings.Contains(body, "verified-mail-operator") || !strings.Contains(body, "✓ active") {
10461046+ t.Error("manage page should show verified-mail-operator as active")
10471047+ }
10481048+ if strings.Contains(body, `action="/enroll/attest/start"`) {
10491049+ t.Error("publish form should NOT show when both labels are active and attestation is published")
10501050+ }
10511051+}
10521052+10531053+func TestRecover_ManageShowsRepublishWhenLabelMissingDespitePublished(t *testing.T) {
10541054+ // The exact "silently broken" state #240 fixes: attestation_rkey is
10551055+ // set (DB stamp says we published), but the labeler hasn't issued
10561056+ // verified-mail-operator (typically because DKIM TXT is missing in
10571057+ // DNS). Without #240 the page shows nothing actionable.
10581058+ store := newRecoverTestStore(t)
10591059+ did := "did:plc:labelmiss111111111aa"
10601060+ domain := "missing.example.com"
10611061+ seedRecoverMember(t, store, did, domain)
10621062+ setRecoverDomainAttestation(t, store, domain, domain)
10631063+10641064+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
10651065+ h.SetLabelStatusQuerier(&fakeLabelStatusQuerier{
10661066+ labels: nil, // labeler reachable, no labels for this DID
10671067+ })
10681068+ target := h.IssueRecoveryTicket(did, domain)
10691069+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
10701070+10711071+ req := httptest.NewRequest(http.MethodGet, "/account/manage", nil)
10721072+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
10731073+ rec := httptest.NewRecorder()
10741074+ mux := http.NewServeMux()
10751075+ h.RegisterRoutes(mux)
10761076+ mux.ServeHTTP(rec, req)
10771077+10781078+ if rec.Code != http.StatusOK {
10791079+ t.Fatalf("status = %d, want 200", rec.Code)
10801080+ }
10811081+ body := rec.Body.String()
10821082+ if !strings.Contains(body, "missing") {
10831083+ t.Error("manage page should mark labels as missing")
10841084+ }
10851085+ // Re-publish form MUST be present even though attestation_rkey is
10861086+ // set — that's the #240 broadening.
10871087+ if !strings.Contains(body, `action="/enroll/attest/start"`) {
10881088+ t.Error("re-publish form should be present when labels are missing despite published attestation")
10891089+ }
10901090+ // Diagnostic copy should mention DKIM as the likely cause.
10911091+ if !strings.Contains(strings.ToLower(body), "dkim") {
10921092+ t.Error("manage page should mention DKIM as the likely cause when published attestation has no labels")
10931093+ }
10941094+}
10951095+10961096+func TestRecover_ManageHandlesUnreachableLabeler(t *testing.T) {
10971097+ // Labeler outage must not push users toward a republish that won't
10981098+ // help. Render "status unavailable" without prompting action.
10991099+ store := newRecoverTestStore(t)
11001100+ did := "did:plc:labelerdown111111aa"
11011101+ domain := "outage.example.com"
11021102+ seedRecoverMember(t, store, did, domain)
11031103+ setRecoverDomainAttestation(t, store, domain, domain)
11041104+11051105+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
11061106+ h.SetLabelStatusQuerier(&fakeLabelStatusQuerier{
11071107+ err: context.DeadlineExceeded,
11081108+ })
11091109+ target := h.IssueRecoveryTicket(did, domain)
11101110+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
11111111+11121112+ req := httptest.NewRequest(http.MethodGet, "/account/manage", nil)
11131113+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
11141114+ rec := httptest.NewRecorder()
11151115+ mux := http.NewServeMux()
11161116+ h.RegisterRoutes(mux)
11171117+ mux.ServeHTTP(rec, req)
11181118+11191119+ if rec.Code != http.StatusOK {
11201120+ t.Fatalf("status = %d, want 200", rec.Code)
11211121+ }
11221122+ body := rec.Body.String()
11231123+ if !strings.Contains(strings.ToLower(body), "unavailable") {
11241124+ t.Error("manage page should explicitly note when label status is unavailable")
11251125+ }
11261126+ // Don't aggressively show the re-publish form on labeler outage —
11271127+ // re-publish doesn't fix labeler unreachability.
11281128+ if strings.Contains(body, `action="/enroll/attest/start"`) {
11291129+ t.Error("re-publish form should not be shown when labeler is unreachable AND attestation is already published")
11301130+ }
11311131+}
11321132+11331133+func TestRecover_ManagePublishStillShowsForUnpublishedDomain_WithLabelQuerier(t *testing.T) {
11341134+ // Back-compat with #235: the original publish-when-rkey-empty path
11351135+ // still works even with a label querier wired. (The #240 broadening
11361136+ // only ADDS conditions; it doesn't remove the original.)
11371137+ store := newRecoverTestStore(t)
11381138+ did := "did:plc:labelunpub111111aaa"
11391139+ domain := "unpublished.example.com"
11401140+ seedRecoverMember(t, store, did, domain)
11411141+ // Deliberately NOT calling setRecoverDomainAttestation — rkey stays "".
11421142+11431143+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
11441144+ h.SetLabelStatusQuerier(&fakeLabelStatusQuerier{labels: nil})
11451145+ target := h.IssueRecoveryTicket(did, domain)
11461146+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
11471147+11481148+ req := httptest.NewRequest(http.MethodGet, "/account/manage", nil)
11491149+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
11501150+ rec := httptest.NewRecorder()
11511151+ mux := http.NewServeMux()
11521152+ h.RegisterRoutes(mux)
11531153+ mux.ServeHTTP(rec, req)
11541154+11551155+ body := rec.Body.String()
11561156+ if !strings.Contains(body, `action="/enroll/attest/start"`) {
11571157+ t.Error("publish form must show for unpublished domains regardless of label state")
11581158+ }
11591159+ if !strings.Contains(body, ">Publish attestation<") {
11601160+ t.Error("unpublished case should use 'Publish attestation' (not 'Re-publish') heading")
11611161+ }
11621162+}
11631163+11641164+func TestRecover_ManageWithoutLabelQuerier_BackCompat(t *testing.T) {
11651165+ // Pre-#240 deployments (or tests) without a label querier must
11661166+ // continue to work — no Label status section, publish gate falls
11671167+ // back to attestation_rkey-only.
11681168+ store := newRecoverTestStore(t)
11691169+ did := "did:plc:nolabelquerier1111aa"
11701170+ domain := "noquerier.example.com"
11711171+ seedRecoverMember(t, store, did, domain)
11721172+ setRecoverDomainAttestation(t, store, domain, domain)
11731173+11741174+ h := NewRecoverHandler(&fakePublisher{}, store, "https://example.com", nil)
11751175+ // No SetLabelStatusQuerier call — h.labels stays nil.
11761176+ target := h.IssueRecoveryTicket(did, domain)
11771177+ ticket := strings.TrimPrefix(target, "/account/manage?ticket=")
11781178+11791179+ req := httptest.NewRequest(http.MethodGet, "/account/manage", nil)
11801180+ req.AddCookie(&http.Cookie{Name: RecoveryCookieName, Value: ticket})
11811181+ rec := httptest.NewRecorder()
11821182+ mux := http.NewServeMux()
11831183+ h.RegisterRoutes(mux)
11841184+ mux.ServeHTTP(rec, req)
11851185+11861186+ if rec.Code != http.StatusOK {
11871187+ t.Fatalf("status = %d, want 200", rec.Code)
11881188+ }
11891189+ body := rec.Body.String()
11901190+ // Section header is still rendered (with "unavailable" copy) so
11911191+ // users get a consistent layout. The re-publish form must NOT
11921192+ // appear when we don't have label state to act on.
11931193+ if strings.Contains(body, `action="/enroll/attest/start"`) {
11941194+ t.Error("publish form must not appear when label state is unknown and attestation is published")
11951195+ }
11961196+}
+2-2
internal/admin/ui/review_queue.go
···60606161// handleList renders the review queue page. Two buckets:
6262//
6363-// primary: every member whose current status is "suspended"
6464-// recent: members with a "reactivated" review note in the last 7 days
6363+// primary: every member whose current status is "suspended"
6464+// recent: members with a "reactivated" review note in the last 7 days
6565//
6666// The primary bucket drives the count badge in the nav and is the
6767// focus of the workflow. Recent is shown below as context so ops can
+1-1
internal/admin/ui/sanitize.go
···33package ui
4455// Helpers for validating + defanging user-supplied values before they
66-// flow into log lines or the store. Audit #156 covers two gotchas:
66+// flow into log lines or the store. covers two gotchas:
77//
88// 1. Log injection via CRLF in a form field — `log.Printf("foo=%s",
99// val)` with val containing "\r\nFAKE:" produces a forged log line
+216
internal/admin/ui/templates/attest_published.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package templates
44+55+// Post-publish callback templates for the atomic enroll+publish flow.
66+//
77+// Hand-written templ.ComponentFunc values — same style as templates/recover.go —
88+// because the .templ source for /enroll has a pre-existing parse error around
99+// the inline JS at enroll.templ:627 that prevents `templ generate` from
1010+// running on this package. Mirroring recover.go's pattern keeps the
1111+// authoring style consistent and avoids touching the generated _templ.go
1212+// for unrelated functions.
1313+1414+import (
1515+ "context"
1616+ "fmt"
1717+ "html"
1818+ "io"
1919+ "strings"
2020+2121+ "github.com/a-h/templ"
2222+)
2323+2424+// AttestationPublishedData drives the post-callback page that combines
2525+// the "attestation published" confirmation with the just-revealed
2626+// credentials. It carries the same data EnrollResult does — duplicated
2727+// rather than reused so render code stays explicit about which fields
2828+// are needed (no surprise zero-values from a partially-populated
2929+// EnrollResult passed through OAuth round-trip stash).
3030+type AttestationPublishedData struct {
3131+ DID string
3232+ Domain string
3333+ APIKey string
3434+ SMTPHost string
3535+ SMTPPort int
3636+ DKIMSelector string
3737+ DKIMRSAName string
3838+ DKIMRSARecord string
3939+ DKIMEdName string
4040+ DKIMEdRecord string
4141+}
4242+4343+// EnrollAttestationCompleteWithCredentials is the new post-publish
4444+// landing page rendered by /enroll/attest/callback after a successful
4545+// PutRecord, when the wizard had stashed credentials for this (DID,
4646+// domain). Reveals the API key + DKIM TXT records here for the first
4747+// time. Previously this content lived on a pre-publish page that users
4848+// frequently bailed from before clicking publish.
4949+func EnrollAttestationCompleteWithCredentials(d AttestationPublishedData) templ.Component {
5050+ return templ.ComponentFunc(func(ctx context.Context, w io.Writer) error {
5151+ inner := templ.ComponentFunc(func(_ context.Context, w io.Writer) error {
5252+ var b strings.Builder
5353+ b.WriteString(`<h1 class="masthead masthead-sub">Enrolled · attestation published</h1>`)
5454+ fmt.Fprintf(&b, `<p class="lede">Your <code>email.atmos.attestation</code> record is live on your PDS, signed by <code>%s</code>. Save the API key below — this page is your only chance to copy it.</p>`,
5555+ html.EscapeString(d.DID))
5656+5757+ // API key — the only thing in a boxed credential card so it
5858+ // reads as the page's primary artifact.
5959+ b.WriteString(`<section class="section">`)
6060+ b.WriteString(`<span class="step-marker">credentials · shown once</span>`)
6161+ b.WriteString(`<h2>Your API key</h2>`)
6262+ b.WriteString(`<div class="credential">`)
6363+ b.WriteString(`<div class="credential-label">api key · shown once</div>`)
6464+ fmt.Fprintf(&b, `<pre><code id="atmos-api-key">%s</code></pre>`, html.EscapeString(d.APIKey))
6565+ b.WriteString(`<div class="credential-note">Acts as your SMTP password. We only store the hash. If you lose it, sign in at <a href="/account">Account</a> to rotate — re-enrollment is not required.</div>`)
6666+ b.WriteString(`</div>`)
6767+ b.WriteString(`</section>`)
6868+6969+ // SMTP submission.
7070+ b.WriteString(`<section class="section">`)
7171+ b.WriteString(`<h2>SMTP submission</h2>`)
7272+ b.WriteString(`<ul class="bullets">`)
7373+ fmt.Fprintf(&b, `<li>Host: <code>%s</code></li>`, html.EscapeString(d.SMTPHost))
7474+ fmt.Fprintf(&b, `<li>Port: <code>%d</code> (STARTTLS)</li>`, d.SMTPPort)
7575+ fmt.Fprintf(&b, `<li>Username: <code>%s</code></li>`, html.EscapeString(d.DID))
7676+ b.WriteString(`<li>Password: the API key above</li>`)
7777+ b.WriteString(`</ul>`)
7878+ b.WriteString(`</section>`)
7979+8080+ // DKIM.
8181+ b.WriteString(`<section class="section">`)
8282+ b.WriteString(`<h2>DKIM records to publish</h2>`)
8383+ fmt.Fprintf(&b, `<p class="section-lede">Add these two TXT records in DNS for <code>%s</code>. The labeler verifies them before issuing <code>verified-mail-operator</code>.</p>`,
8484+ html.EscapeString(d.Domain))
8585+ b.WriteString(`<div class="dns-block">`)
8686+ fmt.Fprintf(&b, `<div class="dns-block-label">%s</div>`, html.EscapeString(d.DKIMRSAName))
8787+ fmt.Fprintf(&b, `<pre>%s</pre>`, html.EscapeString(d.DKIMRSARecord))
8888+ b.WriteString(`</div>`)
8989+ b.WriteString(`<div class="dns-block">`)
9090+ fmt.Fprintf(&b, `<div class="dns-block-label">%s</div>`, html.EscapeString(d.DKIMEdName))
9191+ fmt.Fprintf(&b, `<pre>%s</pre>`, html.EscapeString(d.DKIMEdRecord))
9292+ b.WriteString(`</div>`)
9393+ b.WriteString(`</section>`)
9494+9595+ // SPF / DMARC.
9696+ b.WriteString(`<section class="section">`)
9797+ b.WriteString(`<h2>SPF and DMARC</h2>`)
9898+ b.WriteString(`<p class="section-lede">Recommended. Big-provider inboxes weight these heavily.</p>`)
9999+ b.WriteString(`<pre>@ TXT "v=spf1 ip4:87.99.138.77 -all"
100100+_dmarc TXT "v=DMARC1; p=reject; adkim=r; aspf=r; rua=mailto:postmaster@atmos.email"</pre>`)
101101+ b.WriteString(`</section>`)
102102+103103+ // What happens next.
104104+ b.WriteString(`<section class="section">`)
105105+ b.WriteString(`<span class="step-marker">what happens next</span>`)
106106+ b.WriteString(`<h2>Pending operator approval</h2>`)
107107+ b.WriteString(`<p class="section-lede">Your account exists but is <strong>not yet active</strong>. SMTP submission will reject with <code>535 5.7.8</code> until an operator approves the enrollment — usually within 24 hours.</p>`)
108108+ b.WriteString(`<ul class="bullets">`)
109109+ b.WriteString(`<li>The labeler reads your record and verifies DKIM in DNS.</li>`)
110110+ b.WriteString(`<li>If DKIM checks out, your DID gets <code>verified-mail-operator</code> and (if you opted in) <code>relay-member</code>.</li>`)
111111+ b.WriteString(`<li>To revoke: delete the atproto record from your PDS. The labeler reconciles on its next pass.</li>`)
112112+ b.WriteString(`<li>Lost the key later? Sign in at <a href="/account">Account</a> to rotate.</li>`)
113113+ b.WriteString(`</ul>`)
114114+ fmt.Fprintf(&b, `<p style="margin-top: 1.5rem;">Domain: <code>%s</code></p>`, html.EscapeString(d.Domain))
115115+ b.WriteString(`</section>`)
116116+117117+ _, err := io.WriteString(w, b.String())
118118+ return err
119119+ })
120120+ return publicLayout("Enrolled — "+d.Domain, false).Render(templ.WithChildren(ctx, inner), w)
121121+ })
122122+}
123123+124124+// EnrollAttestationRetryData drives the failure-path retry page when
125125+// the publish OAuth completed but PutRecord failed (e.g., PDS 5xx). The
126126+// member is enrolled but their attestation isn't on the PDS — we
127127+// surface their credentials here too so they don't lose them, and link
128128+// to /account/manage where the publish-attestation form (from)
129129+// lives so they can retry self-service.
130130+type EnrollAttestationRetryData struct {
131131+ DID string
132132+ Domain string
133133+ APIKey string
134134+ SMTPHost string
135135+ SMTPPort int
136136+ DKIMSelector string
137137+ DKIMRSAName string
138138+ DKIMRSARecord string
139139+ DKIMEdName string
140140+ DKIMEdRecord string
141141+ // PublishError is the user-facing summary of the publish failure.
142142+ // Kept short / non-sensitive — the detailed error goes to logs only.
143143+ PublishError string
144144+}
145145+146146+// EnrollAttestationRetry renders when /enroll/attest/callback received
147147+// the OAuth pair but the subsequent PutRecord rejected with a 5xx (or
148148+// any error). The user is enrolled — that step happened in the wizard
149149+// before publish — so credentials are still revealed; the only thing
150150+// missing is the on-PDS record, which they can retry from /account.
151151+func EnrollAttestationRetry(d EnrollAttestationRetryData) templ.Component {
152152+ return templ.ComponentFunc(func(ctx context.Context, w io.Writer) error {
153153+ inner := templ.ComponentFunc(func(_ context.Context, w io.Writer) error {
154154+ var b strings.Builder
155155+ b.WriteString(`<h1 class="masthead masthead-sub">Enrolled · attestation pending</h1>`)
156156+ b.WriteString(`<p class="lede">Your account is created and your credentials are below — but the attestation record didn't make it onto your PDS just now. Sign in at <a href="/account">Account</a> when you're ready to retry the publish step.</p>`)
157157+158158+ b.WriteString(`<div class="error-note" role="alert">`)
159159+ b.WriteString(`<strong>Publish failed:</strong> `)
160160+ if d.PublishError != "" {
161161+ b.WriteString(html.EscapeString(d.PublishError))
162162+ } else {
163163+ b.WriteString(`PDS rejected the record. This is usually transient — try again from /account in a few minutes.`)
164164+ }
165165+ b.WriteString(`</div>`)
166166+167167+ // Credentials.
168168+ b.WriteString(`<section class="section">`)
169169+ b.WriteString(`<span class="step-marker">credentials · shown once</span>`)
170170+ b.WriteString(`<h2>Your API key</h2>`)
171171+ b.WriteString(`<div class="credential">`)
172172+ b.WriteString(`<div class="credential-label">api key · shown once</div>`)
173173+ fmt.Fprintf(&b, `<pre><code>%s</code></pre>`, html.EscapeString(d.APIKey))
174174+ b.WriteString(`<div class="credential-note">Save this. We only store the hash. Lost it later? Sign in at <a href="/account">Account</a> to rotate.</div>`)
175175+ b.WriteString(`</div>`)
176176+ b.WriteString(`</section>`)
177177+178178+ // SMTP.
179179+ b.WriteString(`<section class="section">`)
180180+ b.WriteString(`<h2>SMTP submission</h2>`)
181181+ b.WriteString(`<ul class="bullets">`)
182182+ fmt.Fprintf(&b, `<li>Host: <code>%s</code></li>`, html.EscapeString(d.SMTPHost))
183183+ fmt.Fprintf(&b, `<li>Port: <code>%d</code> (STARTTLS)</li>`, d.SMTPPort)
184184+ fmt.Fprintf(&b, `<li>Username: <code>%s</code></li>`, html.EscapeString(d.DID))
185185+ b.WriteString(`<li>Password: the API key above</li>`)
186186+ b.WriteString(`</ul>`)
187187+ b.WriteString(`</section>`)
188188+189189+ // DKIM.
190190+ b.WriteString(`<section class="section">`)
191191+ b.WriteString(`<h2>DKIM records to publish</h2>`)
192192+ fmt.Fprintf(&b, `<p class="section-lede">Add these two TXT records for <code>%s</code> while you wait to retry the attestation.</p>`,
193193+ html.EscapeString(d.Domain))
194194+ b.WriteString(`<div class="dns-block">`)
195195+ fmt.Fprintf(&b, `<div class="dns-block-label">%s</div>`, html.EscapeString(d.DKIMRSAName))
196196+ fmt.Fprintf(&b, `<pre>%s</pre>`, html.EscapeString(d.DKIMRSARecord))
197197+ b.WriteString(`</div>`)
198198+ b.WriteString(`<div class="dns-block">`)
199199+ fmt.Fprintf(&b, `<div class="dns-block-label">%s</div>`, html.EscapeString(d.DKIMEdName))
200200+ fmt.Fprintf(&b, `<pre>%s</pre>`, html.EscapeString(d.DKIMEdRecord))
201201+ b.WriteString(`</div>`)
202202+ b.WriteString(`</section>`)
203203+204204+ // Retry CTA.
205205+ b.WriteString(`<section class="section">`)
206206+ b.WriteString(`<h2>Retry the publish step</h2>`)
207207+ b.WriteString(`<p class="section-lede">After saving the credentials above, sign in at <a href="/account">Account</a> — the publish-attestation button is exposed for any domain whose record isn't on the PDS yet.</p>`)
208208+ fmt.Fprintf(&b, `<p><a class="btn" href="/account/manage">Sign in to /account/manage</a></p>`)
209209+ b.WriteString(`</section>`)
210210+211211+ _, err := io.WriteString(w, b.String())
212212+ return err
213213+ })
214214+ return publicLayout("Enrolled — retry attestation", false).Render(templ.WithChildren(ctx, inner), w)
215215+ })
216216+}
+6-1
internal/admin/ui/templates/deliverability.go
···4343 inner := templ.ComponentFunc(func(_ context.Context, w io.Writer) error {
4444 var b strings.Builder
45454646- b.WriteString(`<nav class="topnav" aria-label="breadcrumb"><a href="/account" class="topnav-home">← Account</a></nav>`)
4646+ // Single masthead. The earlier topnav-stacked breadcrumb
4747+ // rendered atop publicLayout's own "← home" topnav, giving
4848+ // /account/deliverability a doubled-up header. Now
4949+ // the parent-link is rendered inline beneath the lede so
5050+ // there's exactly one horizontal nav band on the page.
4751 b.WriteString(`<h1 class="masthead masthead-sub">Deliverability</h1>`)
4852 fmt.Fprintf(&b, `<p class="lede">Sending reputation for <code>%s</code>.</p>`, html.EscapeString(d.Domain))
5353+ b.WriteString(`<p class="section-lede" style="margin-top: -0.5rem; margin-bottom: 1.25rem;"><a href="/account/manage">← Back to account</a></p>`)
49545055 // Status banner
5156 if d.Status == "suspended" {
+86-40
internal/admin/ui/templates/enroll.templ
···10701070 </section>
1071107110721072 <section class="section">
10731073- <h2>DKIM records to publish</h2>
10731073+ <h2>DNS records — required before sending</h2>
10741074+ <div class="error-note" role="alert">
10751075+ <strong>SMTP submission will reject until these records are live in DNS.</strong>
10761076+ There is no grace period — the relay verifies SPF and DKIM on every
10771077+ send attempt. Publish all records below before configuring your mail client.
10781078+ </div>
10791079+10801080+ <h3>DKIM</h3>
10741081 <p class="section-lede">
10751075- Add these two TXT records in DNS for <code>{ result.Domain }</code>.
10761076- The labeler verifies them before issuing <code>verified-mail-operator</code>.
10821082+ Add these two TXT records for <code>{ result.Domain }</code>.
10831083+ The labeler also verifies them before issuing <code>verified-mail-operator</code>.
10771084 </p>
1078108510791086 <div class="dns-block">
···10841091 <div class="dns-block-label">{ result.DKIM.EdDNSName }</div>
10851092 <pre>{ result.DKIM.EdRecord }</pre>
10861093 </div>
10871087- </section>
1088109410891089- <section class="section">
10901090- <h2>SPF and DMARC</h2>
10951095+ <h3>SPF and DMARC</h3>
10911096 <p class="section-lede">
10921092- Recommended. Big-provider inboxes weight these heavily.
10971097+ SPF is required. DMARC is strongly recommended — big-provider
10981098+ inboxes weight it heavily.
10931099 </p>
10941100 <pre>{ `@ TXT "v=spf1 ip4:87.99.138.77 -all"
10951101_dmarc TXT "v=DMARC1; p=reject; adkim=r; aspf=r; rua=mailto:postmaster@atmos.email"` }</pre>
···11201126 <strong>Copy your API key and DKIM records before clicking.</strong>
11211127 Publishing redirects you to your PDS and back to a confirmation
11221128 page — this page (with the credentials above) is not re-shown
11231123- afterwards, and we only store a hash of the key. If you lose
11241124- the key, the only remedy is to re-enroll.
11291129+ afterwards, and we only store a hash of the key. If you lose the
11301130+ key later, sign in at <a href="/account">Account</a> to rotate —
11311131+ re-enrollment is not required.
11251132 </div>
11261133 <form action="/enroll/attest/start" method="POST">
11271134 <input type="hidden" name="did" value={ result.DID }/>
···11411148 <span class="step-marker">Step five · what happens next</span>
11421149 <h2>Pending operator approval</h2>
11431150 <p class="section-lede">
11441144- Your account exists but is <strong>not yet active</strong>. SMTP
11451145- submission will reject with <code>535 5.7.8</code> until an
11461146- operator approves the enrollment — usually within 24 hours. The
11471147- manual gate is a shared-reputation safeguard, not a judgment of
11481148- you; it exists because one bad sender burns deliverability for
11491149- every other member on this relay.
11511151+ Your account exists but is <strong>not yet active</strong>. Two
11521152+ gates must pass before you can send:
11501153 </p>
11541154+ <ol class="bullets">
11551155+ <li>
11561156+ <strong>DNS verification</strong> — the relay checks SPF and DKIM
11571157+ on every send attempt. Publish the records above and allow a few
11581158+ minutes for propagation.
11591159+ </li>
11601160+ <li>
11611161+ <strong>Operator approval</strong> — typically within 24 hours. SMTP
11621162+ will reject with <code>535 5.7.8</code> until approved. The manual
11631163+ gate is a shared-reputation safeguard; it exists because one bad
11641164+ sender burns deliverability for every other member on this relay.
11651165+ </li>
11661166+ </ol>
11511167 <ul class="bullets">
11521152- <li>Publish the DKIM and (optionally) SPF/DMARC records above.</li>
11531153- <li>DNS propagation is usually minutes, occasionally an hour.</li>
11541168 <li>
11551169 Approval confirmation is sent to the operator's Matrix room
11561156- automatically. Once approved your next SMTP submission will
11571157- succeed — no ping from us required.
11701170+ automatically. Once both gates pass, your next SMTP submission
11711171+ will succeed — no ping from us required.
11581172 </li>
11591173 <li>
11601174 Questions, or enrollment stuck >24h?
···14761490 <span class="step-marker">§4 · Sharing</span>
14771491 <h2>Who else sees this</h2>
14781492 <p>
14791479- Send events and bounce outcomes are evaluated by our
14801480- internal Trust & Safety rules engine (Osprey) to
14811481- derive reputation labels (e.g. <code>highly_trusted</code>,
14821482- <code>auto_suspended</code>). Labels are published via an
14831483- atproto labeler and are intentionally public — any
14841484- consumer of the labeler can read them. We do not share
14851485- message content, recipient lists, or API keys with anyone.
14931493+ We publish a small set of <strong>public atproto labels</strong>
14941494+ about your DID via our cooperative labeler at
14951495+ <code>labeler.atmos.email</code>. Today that's
14961496+ <code>verified-mail-operator</code> and
14971497+ <code>relay-member</code>. These are signed, network-visible,
14981498+ and any atproto consumer can read them — intentionally so,
14991499+ since the point is to let third parties verify you're a
15001500+ cooperative member.
15011501+ </p>
15021502+ <p>
15031503+ Send events and bounce outcomes feed our internal Trust
15041504+ & Safety rules engine (Osprey), which derives
15051505+ operational reputation signals (e.g. <code>highly_trusted</code>,
15061506+ <code>auto_suspended</code>). These are
15071507+ <strong>internal-only</strong> — they drive throttling,
15081508+ warming, and SMTP-time enforcement, but they are not
15091509+ published as atproto labels and do not leave the relay's
15101510+ process boundary.
15111511+ </p>
15121512+ <p>
15131513+ We do not share message content, recipient lists, or API
15141514+ keys with anyone.
14861515 </p>
14871516 </section>
14881517···15771606 <span class="step-marker">§4 · Honor unsubscribes</span>
15781607 <h2>One-click unsubscribe</h2>
15791608 <p>
15801580- Every message sent through the relay carries RFC 8058
15811581- <code>List-Unsubscribe</code> and <code>List-Unsubscribe-Post</code>
15821582- headers. When a recipient triggers an unsubscribe, that
15831583- address is added to your suppression list and further
15841584- attempts to send to it will be quietly dropped. Attempting
15851585- to work around the suppression list — by re-enrolling the
15861586- same address under a variant, rotating domains, or
15871587- stripping the header — is a terminating offense.
16091609+ Every <em>bulk</em> message sent through the relay carries
16101610+ RFC 8058 <code>List-Unsubscribe</code> and
16111611+ <code>List-Unsubscribe-Post</code> headers. When a recipient
16121612+ triggers an unsubscribe, that address is added to your
16131613+ suppression list and further bulk attempts to send to it
16141614+ will be quietly dropped. Attempting to work around the
16151615+ suppression list — by re-enrolling the same address under a
16161616+ variant, rotating domains, or stripping the header — is a
16171617+ terminating offense.
16181618+ </p>
16191619+ <p>
16201620+ User-initiated transactional mail (login links, password
16211621+ resets, MFA codes, address verification) is exempt from
16221622+ both behaviors. Tag those messages with the
16231623+ <code>X-Atmos-Category</code> header
16241624+ (<code>login-link</code>, <code>password-reset</code>,
16251625+ <code>mfa-otp</code>, or <code>verification</code>) and
16261626+ the relay will skip the unsubscribe header and bypass the
16271627+ suppression list, so an accidental click on a previous
16281628+ message can't lock the recipient out of their own auth
16291629+ flow. Untagged mail defaults to <code>bulk</code> — the
16301630+ strict policy above applies.
15881631 </p>
15891632 </section>
15901633···16431686 tooling. atproto already provides the portable identity
16441687 primitive that other protocols still lack; email just
16451688 needed the plumbing to route around the reputation
16461646- bottleneck. The relay is MIT-licensed, the Osprey rules
16471647- live in the open, and the labeler feed is public, so
16481648- anyone with the source can audit how deliverability
16891689+ bottleneck. The relay is AGPL-3.0-licensed, the Osprey
16901690+ rules live in the open, and the labeler feed is public,
16911691+ so anyone with the source can audit how deliverability
16491692 decisions are made.
16501693 </p>
16511694 </section>
···16731716 and Ed25519) whose public keys you publish in DNS. The
16741717 relay signs outbound mail on your behalf, tracks
16751718 delivery and bounce outcomes, and emits those events to
16761676- a Trust & Safety rules engine (Osprey) that labels
16771677- reputation via an atproto labeler. Labels drive
16781678- throttling, warming, and suspension decisions.
17191719+ a Trust & Safety rules engine (Osprey). Osprey-derived
17201720+ signals drive throttling, warming, and suspension
17211721+ decisions internally, while a separate cooperative
17221722+ labeler publishes public atproto identity labels
17231723+ (<code>verified-mail-operator</code>, <code>relay-member</code>)
17241724+ on member DIDs.
16791725 </p>
16801726 </section>
16811727
+7-7
internal/admin/ui/templates/enroll_templ.go
···633633 if templ_7745c5c3_Err != nil {
634634 return templ_7745c5c3_Err
635635 }
636636- templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 47, "</code></li><li>Password: the API key above</li></ul></section><section class=\"section\"><h2>DKIM records to publish</h2><p class=\"section-lede\">Add these two TXT records in DNS for <code>")
636636+ templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 47, "</code></li><li>Password: the API key above</li></ul></section><section class=\"section\"><h2>DNS records — required before sending</h2><div class=\"error-note\" role=\"alert\"><strong>SMTP submission will reject until these records are live in DNS.</strong> There is no grace period — the relay verifies SPF and DKIM on every send attempt. Publish all records below before configuring your mail client.</div><h3>DKIM</h3><p class=\"section-lede\">Add these two TXT records for <code>")
637637 if templ_7745c5c3_Err != nil {
638638 return templ_7745c5c3_Err
639639 }
···698698 if templ_7745c5c3_Err != nil {
699699 return templ_7745c5c3_Err
700700 }
701701- templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 52, "</pre></div></section><section class=\"section\"><h2>SPF and DMARC</h2><p class=\"section-lede\">Recommended. Big-provider inboxes weight these heavily.</p><pre>")
701701+ templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 52, "</pre></div><h3>SPF and DMARC</h3><p class=\"section-lede\">SPF is required. DMARC is strongly recommended — big-provider inboxes weight it heavily.</p><pre>")
702702 if templ_7745c5c3_Err != nil {
703703 return templ_7745c5c3_Err
704704 }
···739739 return templ_7745c5c3_Err
740740 }
741741 } else {
742742- templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 56, "<div class=\"error-note\" role=\"alert\"><strong>Copy your API key and DKIM records before clicking.</strong> Publishing redirects you to your PDS and back to a confirmation page — this page (with the credentials above) is not re-shown afterwards, and we only store a hash of the key. If you lose the key, the only remedy is to re-enroll.</div><form action=\"/enroll/attest/start\" method=\"POST\"><input type=\"hidden\" name=\"did\" value=\"")
742742+ templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 56, "<div class=\"error-note\" role=\"alert\"><strong>Copy your API key and DKIM records before clicking.</strong> Publishing redirects you to your PDS and back to a confirmation page — this page (with the credentials above) is not re-shown afterwards, and we only store a hash of the key. If you lose the key later, sign in at <a href=\"/account\">Account</a> to rotate — re-enrollment is not required.</div><form action=\"/enroll/attest/start\" method=\"POST\"><input type=\"hidden\" name=\"did\" value=\"")
743743 if templ_7745c5c3_Err != nil {
744744 return templ_7745c5c3_Err
745745 }
···783783 return templ_7745c5c3_Err
784784 }
785785 }
786786- templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 60, "</section><section class=\"section\"><span class=\"step-marker\">Step five · what happens next</span><h2>Pending operator approval</h2><p class=\"section-lede\">Your account exists but is <strong>not yet active</strong>. SMTP submission will reject with <code>535 5.7.8</code> until an operator approves the enrollment — usually within 24 hours. The manual gate is a shared-reputation safeguard, not a judgment of you; it exists because one bad sender burns deliverability for every other member on this relay.</p><ul class=\"bullets\"><li>Publish the DKIM and (optionally) SPF/DMARC records above.</li><li>DNS propagation is usually minutes, occasionally an hour.</li><li>Approval confirmation is sent to the operator's Matrix room automatically. Once approved your next SMTP submission will succeed — no ping from us required.</li><li>Questions, or enrollment stuck >24h? <a href=\"https://bsky.app/profile/scottlanoue.com\">Contact the operator</a>.</li></ul></section><section class=\"section\"><h2>Verify once approved</h2><p class=\"section-lede\">Paste this into a terminal after approval lands. It sends a test message through the relay and prints the server response. Replace the destination address with somewhere you control.</p><pre>")
786786+ templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 60, "</section><section class=\"section\"><span class=\"step-marker\">Step five · what happens next</span><h2>Pending operator approval</h2><p class=\"section-lede\">Your account exists but is <strong>not yet active</strong>. Two gates must pass before you can send:</p><ol class=\"bullets\"><li><strong>DNS verification</strong> — the relay checks SPF and DKIM on every send attempt. Publish the records above and allow a few minutes for propagation.</li><li><strong>Operator approval</strong> — typically within 24 hours. SMTP will reject with <code>535 5.7.8</code> until approved. The manual gate is a shared-reputation safeguard; it exists because one bad sender burns deliverability for every other member on this relay.</li></ol><ul class=\"bullets\"><li>Approval confirmation is sent to the operator's Matrix room automatically. Once both gates pass, your next SMTP submission will succeed — no ping from us required.</li><li>Questions, or enrollment stuck >24h? <a href=\"https://bsky.app/profile/scottlanoue.com\">Contact the operator</a>.</li></ul></section><section class=\"section\"><h2>Verify once approved</h2><p class=\"section-lede\">Paste this into a terminal after approval lands. It sends a test message through the relay and prints the server response. Replace the destination address with somewhere you control.</p><pre>")
787787 if templ_7745c5c3_Err != nil {
788788 return templ_7745c5c3_Err
789789 }
···11091109 if templ_7745c5c3_Err != nil {
11101110 return templ_7745c5c3_Err
11111111 }
11121112- templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 70, "</p><p class=\"lede\">Atmosphere Mail LLC operates the relay. Here is exactly what we collect, why, and for how long.</p><section class=\"section\"><span class=\"step-marker\">§1 · What we collect</span><h2>The data we hold</h2><ul class=\"bullets\"><li><strong>Your DID</strong> and registered sending domain(s).</li><li><strong>A salted hash of your API key</strong> — the plaintext key is only ever shown once, at enrollment.</li><li><strong>DKIM keypairs</strong> issued to your domain. Private keys are stored encrypted at rest and never leave our servers.</li><li><strong>Send logs</strong>: per-message sender DID, recipient address, From/To headers, timestamps, delivery status code, and bounce disposition. We do <em>not</em> store message bodies after handoff to the queue.</li><li><strong>Rate-limit counters</strong>: short-window send counts per DID used to enforce hourly and daily limits.</li><li><strong>Bounce records</strong>: inbound DSN classifications per DID so we can suspend senders with pathological bounce rates.</li><li><strong>Suppression list</strong>: recipients who used the one-click unsubscribe header, keyed per sender DID.</li><li><strong>IP addresses</strong> of SMTP clients, kept only in transient logs for abuse investigation and rotated out under the retention schedule below.</li></ul></section><section class=\"section\"><span class=\"step-marker\">§2 · What we do not collect</span><h2>Data we deliberately avoid</h2><p>We do not retain full message bodies past delivery. We do not set web tracking cookies, fingerprint browsers, or embed third-party analytics on any of our pages. We do not sell or rent member data to anyone, under any circumstances.</p></section><section class=\"section\"><span class=\"step-marker\">§3 · Retention</span><h2>How long we keep it</h2><ul class=\"bullets\"><li><strong>Terminal message logs</strong> (sent, bounced): 30 days, then purged.</li><li><strong>Rate-limit counters</strong>: 48 hours rolling window.</li><li><strong>Suppression entries</strong>: for the life of the member record — unsubscribes must persist.</li><li><strong>Member record</strong>: indefinitely while active; removed on request.</li></ul></section><section class=\"section\"><span class=\"step-marker\">§4 · Sharing</span><h2>Who else sees this</h2><p>Send events and bounce outcomes are evaluated by our internal Trust & Safety rules engine (Osprey) to derive reputation labels (e.g. <code>highly_trusted</code>, <code>auto_suspended</code>). Labels are published via an atproto labeler and are intentionally public — any consumer of the labeler can read them. We do not share message content, recipient lists, or API keys with anyone.</p></section><section class=\"section\"><span class=\"step-marker\">§5 · Your rights</span><h2>Access, correction, deletion</h2><p>You can fetch your member status and current labels via the API-key-authenticated <code>/member/status</code> endpoint. To correct or delete your member record, write to <a href=\"mailto:postmaster@atmos.email\">postmaster@atmos.email</a> from a mailbox you can prove control of (or sign the request with your DID's signing key). We respond to verified requests within 14 days.</p></section><section class=\"section\"><span class=\"step-marker\">§6 · Security</span><h2>How we protect it</h2><p>API keys are stored as salted hashes. DKIM private keys are encrypted at rest. Host access is restricted to the LLC's operations team and uses hardware-keyed SSH. If we discover a breach that exposes member data we will notify affected members without undue delay.</p></section><section class=\"section\"><span class=\"step-marker\">§7 · Contact</span><h2>Reach us</h2><p>Atmosphere Mail LLC — <a href=\"mailto:postmaster@atmos.email\">postmaster@atmos.email</a></p></section>")
11121112+ templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 70, "</p><p class=\"lede\">Atmosphere Mail LLC operates the relay. Here is exactly what we collect, why, and for how long.</p><section class=\"section\"><span class=\"step-marker\">§1 · What we collect</span><h2>The data we hold</h2><ul class=\"bullets\"><li><strong>Your DID</strong> and registered sending domain(s).</li><li><strong>A salted hash of your API key</strong> — the plaintext key is only ever shown once, at enrollment.</li><li><strong>DKIM keypairs</strong> issued to your domain. Private keys are stored encrypted at rest and never leave our servers.</li><li><strong>Send logs</strong>: per-message sender DID, recipient address, From/To headers, timestamps, delivery status code, and bounce disposition. We do <em>not</em> store message bodies after handoff to the queue.</li><li><strong>Rate-limit counters</strong>: short-window send counts per DID used to enforce hourly and daily limits.</li><li><strong>Bounce records</strong>: inbound DSN classifications per DID so we can suspend senders with pathological bounce rates.</li><li><strong>Suppression list</strong>: recipients who used the one-click unsubscribe header, keyed per sender DID.</li><li><strong>IP addresses</strong> of SMTP clients, kept only in transient logs for abuse investigation and rotated out under the retention schedule below.</li></ul></section><section class=\"section\"><span class=\"step-marker\">§2 · What we do not collect</span><h2>Data we deliberately avoid</h2><p>We do not retain full message bodies past delivery. We do not set web tracking cookies, fingerprint browsers, or embed third-party analytics on any of our pages. We do not sell or rent member data to anyone, under any circumstances.</p></section><section class=\"section\"><span class=\"step-marker\">§3 · Retention</span><h2>How long we keep it</h2><ul class=\"bullets\"><li><strong>Terminal message logs</strong> (sent, bounced): 30 days, then purged.</li><li><strong>Rate-limit counters</strong>: 48 hours rolling window.</li><li><strong>Suppression entries</strong>: for the life of the member record — unsubscribes must persist.</li><li><strong>Member record</strong>: indefinitely while active; removed on request.</li></ul></section><section class=\"section\"><span class=\"step-marker\">§4 · Sharing</span><h2>Who else sees this</h2><p>We publish a small set of <strong>public atproto labels</strong> about your DID via our cooperative labeler at <code>labeler.atmos.email</code>. Today that's <code>verified-mail-operator</code> and <code>relay-member</code>. These are signed, network-visible, and any atproto consumer can read them — intentionally so, since the point is to let third parties verify you're a cooperative member.</p><p>Send events and bounce outcomes feed our internal Trust & Safety rules engine (Osprey), which derives operational reputation signals (e.g. <code>highly_trusted</code>, <code>auto_suspended</code>). These are <strong>internal-only</strong> — they drive throttling, warming, and SMTP-time enforcement, but they are not published as atproto labels and do not leave the relay's process boundary.</p><p>We do not share message content, recipient lists, or API keys with anyone.</p></section><section class=\"section\"><span class=\"step-marker\">§5 · Your rights</span><h2>Access, correction, deletion</h2><p>You can fetch your member status and current labels via the API-key-authenticated <code>/member/status</code> endpoint. To correct or delete your member record, write to <a href=\"mailto:postmaster@atmos.email\">postmaster@atmos.email</a> from a mailbox you can prove control of (or sign the request with your DID's signing key). We respond to verified requests within 14 days.</p></section><section class=\"section\"><span class=\"step-marker\">§6 · Security</span><h2>How we protect it</h2><p>API keys are stored as salted hashes. DKIM private keys are encrypted at rest. Host access is restricted to the LLC's operations team and uses hardware-keyed SSH. If we discover a breach that exposes member data we will notify affected members without undue delay.</p></section><section class=\"section\"><span class=\"step-marker\">§7 · Contact</span><h2>Reach us</h2><p>Atmosphere Mail LLC — <a href=\"mailto:postmaster@atmos.email\">postmaster@atmos.email</a></p></section>")
11131113 if templ_7745c5c3_Err != nil {
11141114 return templ_7745c5c3_Err
11151115 }
···11721172 if templ_7745c5c3_Err != nil {
11731173 return templ_7745c5c3_Err
11741174 }
11751175- templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 72, "</p><p class=\"lede\">Shared-IP email only works when every member sends responsibly. These rules are how we protect the pool's reputation on your behalf.</p><section class=\"section\"><span class=\"step-marker\">§1 · Your own mail only</span><h2>Send on your own behalf</h2><p>The relay is for mail originating from <em>you</em> — transactional, operational, or personal correspondence sent from the domain you enrolled. Do not resell relay credentials, relay mail for third parties, or use the service as a public-facing SMTP gateway.</p></section><section class=\"section\"><span class=\"step-marker\">§2 · No spam</span><h2>No unsolicited bulk mail</h2><p>You must have prior permission from every recipient. Scraped lists, purchased lists, and \"opt-out only\" mailing strategies are prohibited. We enforce volume caps, bounce rate thresholds, domain-spray detection, and velocity rules; crossing any of them will cost your DID its reputation labels and may trigger automatic suspension.</p></section><section class=\"section\"><span class=\"step-marker\">§3 · No abuse</span><h2>Prohibited content</h2><ul class=\"bullets\"><li>Phishing, credential harvesting, or impersonation of third parties.</li><li>Malware, ransomware, exploit payloads, or links to them.</li><li>Fraud, scams, illegal goods, or content that violates US federal or Washington state law.</li><li>Content targeting or harassing an individual, or inciting violence against a group.</li><li>Unauthorized use of another person's name, likeness, or identity.</li></ul></section><section class=\"section\"><span class=\"step-marker\">§4 · Honor unsubscribes</span><h2>One-click unsubscribe</h2><p>Every message sent through the relay carries RFC 8058 <code>List-Unsubscribe</code> and <code>List-Unsubscribe-Post</code> headers. When a recipient triggers an unsubscribe, that address is added to your suppression list and further attempts to send to it will be quietly dropped. Attempting to work around the suppression list — by re-enrolling the same address under a variant, rotating domains, or stripping the header — is a terminating offense.</p></section><section class=\"section\"><span class=\"step-marker\">§5 · Cooperate with investigations</span><h2>Abuse complaints</h2><p>If we receive an abuse report about mail from your DID we may ask you to explain it. Failure to respond within a reasonable window (48 hours by default) can result in suspension pending review. Report abuse by others to <a href=\"mailto:abuse@atmos.email\">abuse@atmos.email</a>.</p></section><section class=\"section\"><span class=\"step-marker\">§6 · Consequences</span><h2>What happens when you break the rules</h2><p>We apply the lightest intervention that fixes the problem. In order of increasing severity: a reputation label that throttles hourly volume; a temporary suspension pending operator review; permanent removal of the DID and its domains from the relay. Appeals go to <a href=\"mailto:postmaster@atmos.email\">postmaster@atmos.email</a>.</p></section>")
11751175+ templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 72, "</p><p class=\"lede\">Shared-IP email only works when every member sends responsibly. These rules are how we protect the pool's reputation on your behalf.</p><section class=\"section\"><span class=\"step-marker\">§1 · Your own mail only</span><h2>Send on your own behalf</h2><p>The relay is for mail originating from <em>you</em> — transactional, operational, or personal correspondence sent from the domain you enrolled. Do not resell relay credentials, relay mail for third parties, or use the service as a public-facing SMTP gateway.</p></section><section class=\"section\"><span class=\"step-marker\">§2 · No spam</span><h2>No unsolicited bulk mail</h2><p>You must have prior permission from every recipient. Scraped lists, purchased lists, and \"opt-out only\" mailing strategies are prohibited. We enforce volume caps, bounce rate thresholds, domain-spray detection, and velocity rules; crossing any of them will cost your DID its reputation labels and may trigger automatic suspension.</p></section><section class=\"section\"><span class=\"step-marker\">§3 · No abuse</span><h2>Prohibited content</h2><ul class=\"bullets\"><li>Phishing, credential harvesting, or impersonation of third parties.</li><li>Malware, ransomware, exploit payloads, or links to them.</li><li>Fraud, scams, illegal goods, or content that violates US federal or Washington state law.</li><li>Content targeting or harassing an individual, or inciting violence against a group.</li><li>Unauthorized use of another person's name, likeness, or identity.</li></ul></section><section class=\"section\"><span class=\"step-marker\">§4 · Honor unsubscribes</span><h2>One-click unsubscribe</h2><p>Every <em>bulk</em> message sent through the relay carries RFC 8058 <code>List-Unsubscribe</code> and <code>List-Unsubscribe-Post</code> headers. When a recipient triggers an unsubscribe, that address is added to your suppression list and further bulk attempts to send to it will be quietly dropped. Attempting to work around the suppression list — by re-enrolling the same address under a variant, rotating domains, or stripping the header — is a terminating offense.</p><p>User-initiated transactional mail (login links, password resets, MFA codes, address verification) is exempt from both behaviors. Tag those messages with the <code>X-Atmos-Category</code> header (<code>login-link</code>, <code>password-reset</code>, <code>mfa-otp</code>, or <code>verification</code>) and the relay will skip the unsubscribe header and bypass the suppression list, so an accidental click on a previous message can't lock the recipient out of their own auth flow. Untagged mail defaults to <code>bulk</code> — the strict policy above applies.</p></section><section class=\"section\"><span class=\"step-marker\">§5 · Cooperate with investigations</span><h2>Abuse complaints</h2><p>If we receive an abuse report about mail from your DID we may ask you to explain it. Failure to respond within a reasonable window (48 hours by default) can result in suspension pending review. Report abuse by others to <a href=\"mailto:abuse@atmos.email\">abuse@atmos.email</a>.</p></section><section class=\"section\"><span class=\"step-marker\">§6 · Consequences</span><h2>What happens when you break the rules</h2><p>We apply the lightest intervention that fixes the problem. In order of increasing severity: a reputation label that throttles hourly volume; a temporary suspension pending operator review; permanent removal of the DID and its domains from the relay. Appeals go to <a href=\"mailto:postmaster@atmos.email\">postmaster@atmos.email</a>.</p></section>")
11761176 if templ_7745c5c3_Err != nil {
11771177 return templ_7745c5c3_Err
11781178 }
···12351235 if templ_7745c5c3_Err != nil {
12361236 return templ_7745c5c3_Err
12371237 }
12381238- templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 74, "</a> — a Washington-based software developer working on open-source infrastructure for the atproto ecosystem.</p><p>Freedom in software comes from open source and shared tooling. atproto already provides the portable identity primitive that other protocols still lack; email just needed the plumbing to route around the reputation bottleneck. The relay is MIT-licensed, the Osprey rules live in the open, and the labeler feed is public, so anyone with the source can audit how deliverability decisions are made.</p></section><section class=\"section\"><span class=\"step-marker\">§2 · The entity</span><h2>Who's on the contract</h2><p>The relay is operated by <strong>Atmosphere Mail LLC</strong>, a Washington State limited liability company formed in 2026 to give the project a stable legal counterparty. The LLC exists to sign agreements, hold infrastructure, and absorb liability on behalf of the cooperative — it does not operate for profit.</p></section><section class=\"section\"><span class=\"step-marker\">§3 · How it works</span><h2>Architecture</h2><p>Domain ownership is verified via DNS TXT record — the same primitive used by Let's Encrypt and Google Workspace. Each enrolled domain is issued a DKIM keypair (RSA and Ed25519) whose public keys you publish in DNS. The relay signs outbound mail on your behalf, tracks delivery and bounce outcomes, and emits those events to a Trust & Safety rules engine (Osprey) that labels reputation via an atproto labeler. Labels drive throttling, warming, and suspension decisions.</p></section><section class=\"section\"><span class=\"step-marker\">§4 · Source</span><h2>Open, auditable</h2><p>The relay, admin UI, Osprey rules, and labeler code all live at <a href=\"https://tangled.org/scottlanoue.com/atmosphere-mail\">tangled.org/scottlanoue.com/atmosphere-mail</a>. Bug reports and patches welcome.</p></section><section class=\"section\"><span class=\"step-marker\">§5 · Contact</span><h2>Reach us</h2><p>Operational questions: <a href=\"mailto:postmaster@atmos.email\">postmaster@atmos.email</a>. Abuse reports: <a href=\"mailto:abuse@atmos.email\">abuse@atmos.email</a>.</p></section>")
12381238+ templ_7745c5c3_Err = templruntime.WriteString(templ_7745c5c3_Buffer, 74, "</a> — a Washington-based software developer working on open-source infrastructure for the atproto ecosystem.</p><p>Freedom in software comes from open source and shared tooling. atproto already provides the portable identity primitive that other protocols still lack; email just needed the plumbing to route around the reputation bottleneck. The relay is AGPL-3.0-licensed, the Osprey rules live in the open, and the labeler feed is public, so anyone with the source can audit how deliverability decisions are made.</p></section><section class=\"section\"><span class=\"step-marker\">§2 · The entity</span><h2>Who's on the contract</h2><p>The relay is operated by <strong>Atmosphere Mail LLC</strong>, a Washington State limited liability company formed in 2026 to give the project a stable legal counterparty. The LLC exists to sign agreements, hold infrastructure, and absorb liability on behalf of the cooperative — it does not operate for profit.</p></section><section class=\"section\"><span class=\"step-marker\">§3 · How it works</span><h2>Architecture</h2><p>Domain ownership is verified via DNS TXT record — the same primitive used by Let's Encrypt and Google Workspace. Each enrolled domain is issued a DKIM keypair (RSA and Ed25519) whose public keys you publish in DNS. The relay signs outbound mail on your behalf, tracks delivery and bounce outcomes, and emits those events to a Trust & Safety rules engine (Osprey). Osprey-derived signals drive throttling, warming, and suspension decisions internally, while a separate cooperative labeler publishes public atproto identity labels (<code>verified-mail-operator</code>, <code>relay-member</code>) on member DIDs.</p></section><section class=\"section\"><span class=\"step-marker\">§4 · Source</span><h2>Open, auditable</h2><p>The relay, admin UI, Osprey rules, and labeler code all live at <a href=\"https://tangled.org/scottlanoue.com/atmosphere-mail\">tangled.org/scottlanoue.com/atmosphere-mail</a>. Bug reports and patches welcome.</p></section><section class=\"section\"><span class=\"step-marker\">§5 · Contact</span><h2>Reach us</h2><p>Operational questions: <a href=\"mailto:postmaster@atmos.email\">postmaster@atmos.email</a>. Abuse reports: <a href=\"mailto:abuse@atmos.email\">abuse@atmos.email</a>.</p></section>")
12391239 if templ_7745c5c3_Err != nil {
12401240 return templ_7745c5c3_Err
12411241 }
+1-1
internal/admin/ui/templates/marketing.go
···7474 b.WriteString(`<section class="section">`)
7575 b.WriteString(`<h2>Where this is, honestly</h2>`)
7676 b.WriteString(`<ul class="bullets">`)
7777- b.WriteString(`<li><strong>Member self-hosting</strong>: your PDS, your DID, your domain. This is how it works today. If you run an ePDS, you are the intended user.</li>`)
7777+ b.WriteString(`<li><strong>Member self-hosting</strong>: your PDS, your DID, your domain. This is how it works today. If you run a self-hosted PDS, you are the intended user.</li>`)
7878 b.WriteString(`<li><strong>Relay operator self-hosting</strong>: the code is designed for other operators to run their own instance (pluggable notification webhook, configurable operator DKIM domain, Terraform in <code>infra/</code>). One relay runs today, operated by the project maintainer. Anyone who wants to stand up a second cooperative has a path, and the operator docs are still being written.</li>`)
7979 b.WriteString(`<li><strong>Cross-pool federation</strong>: multiple relays sharing reputation via a shared blocklist any mail server can check, indexed through atproto. Phase 4 in the <a href="/about">roadmap</a>, not yet built.</li>`)
8080 b.WriteString(`</ul>`)
+6-6
internal/admin/ui/templates/member_detail_rich.go
···40404141 // DNS + attestation check results. Each section renders green when
4242 // OK is true and red with Message otherwise.
4343- DKIMRSA CheckResult
4444- DKIMEd CheckResult
4545- Attestation CheckResult
4343+ DKIMRSA CheckResult
4444+ DKIMEd CheckResult
4545+ Attestation CheckResult
46464747 // Send activity. 14 buckets, oldest-to-newest. Used for the sparkline.
4848- SendsByDay []int64
4949- SendsTotal int64
5050- SendsBounced int64
4848+ SendsByDay []int64
4949+ SendsTotal int64
5050+ SendsBounced int64
5151 ComplaintCount int64
52525353 // Recent events (relay_events) — top 20.
+101-4
internal/admin/ui/templates/recover.go
···2929// would balloon the diff.
3030type RecoverManageData struct {
3131 // Ticket is intentionally absent — the recovery ticket now lives in
3232- // an HttpOnly cookie, not in rendered HTML (see CRIT #152). The
3232+ // an HttpOnly cookie, not in rendered HTML (see CRIT review). The
3333 // field was removed to force every call site to stop embedding it.
3434- DID string
3535- Domain string
3636- DKIMSelector string // base selector; full names are <sel>r and <sel>e
3434+ DID string
3535+ Domain string
3636+ DKIMSelector string // base selector; full names are <sel>r and <sel>e
3737 ContactEmail string // current value; may be empty
3838 EmailVerified bool
3939 ExpiresAt string // RFC3339 display for the session-expiry footer
4040+4141+ // AttestationPublished reports whether the email.atmos.attestation
4242+ // record exists in the member's PDS for this domain. False renders a
4343+ // publish-attestation button that POSTs the same fields the wizard's
4444+ // final step posts to /enroll/attest/start, so a member who finished
4545+ // enrollment but bailed before the publish OAuth round-trip can
4646+ // self-recover from /account/manage.
4747+ AttestationPublished bool
4848+4949+ // Labels are the active labels currently issued for DID by the
5050+ // labeler XRPC. Empty slice = no labels. Used for the "Label
5151+ // status" section. LabelsKnown distinguishes "labeler reachable,
5252+ // no labels" from "we couldn't query the labeler" — the former
5353+ // drives the re-publish nudge, the latter renders an unobtrusive
5454+ // "status unavailable" line so a labeler outage doesn't push the
5555+ // user toward an action that won't help.
5656+ Labels []string
5757+ LabelsKnown bool
40584159 // Message / MessageErr drive an optional banner rendered at the top
4260 // of the page — populated after a contact-email update or any
···378396 b.WriteString(`<p class="section-lede">View your sending reputation: bounce rate, complaints, daily volume, and warming progress.</p>`)
379397 b.WriteString(`<a href="/account/deliverability" class="btn">View deliverability →</a>`)
380398 b.WriteString(`</section>`)
399399+400400+ // Label status. Surfaces the labeler's view of the
401401+ // signed-in DID — the source of truth for whether the relay
402402+ // will accept SMTP submissions for this account. Previously the
403403+ // page only showed a publish button when the relay's DB stamp
404404+ // said "no attestation_rkey", missing the case where the
405405+ // attestation was published but the labeler rejected DKIM and
406406+ // no labels got issued. That state silently broke sending.
407407+ hasOperatorLabel := false
408408+ hasRelayLabel := false
409409+ for _, l := range d.Labels {
410410+ switch l {
411411+ case "verified-mail-operator":
412412+ hasOperatorLabel = true
413413+ case "relay-member":
414414+ hasRelayLabel = true
415415+ }
416416+ }
417417+ b.WriteString(`<section class="section">`)
418418+ b.WriteString(`<h2>Label status</h2>`)
419419+ if !d.LabelsKnown {
420420+ b.WriteString(`<p class="section-lede">Label status is currently unavailable — the labeler may be temporarily unreachable. Try refreshing in a minute. If you just enrolled, allow up to a minute for the labeler to pick up your record.</p>`)
421421+ } else {
422422+ b.WriteString(`<p class="section-lede">These are the labels the atproto labeler currently issues for your DID. Receivers see them via the public labeler feed; the relay also gates SMTP submission on <code>verified-mail-operator</code> and <code>relay-member</code> being active.</p>`)
423423+ b.WriteString(`<ul class="bullets">`)
424424+ if hasOperatorLabel {
425425+ b.WriteString(`<li><strong>verified-mail-operator</strong> ✓ active</li>`)
426426+ } else {
427427+ b.WriteString(`<li><strong>verified-mail-operator</strong> — missing</li>`)
428428+ }
429429+ if hasRelayLabel {
430430+ b.WriteString(`<li><strong>relay-member</strong> ✓ active</li>`)
431431+ } else {
432432+ b.WriteString(`<li><strong>relay-member</strong> — missing</li>`)
433433+ }
434434+ b.WriteString(`</ul>`)
435435+ if !hasOperatorLabel && d.AttestationPublished {
436436+ // Most common reason for missing labels despite a
437437+ // published attestation: DKIM TXT records aren't in
438438+ // DNS yet (or were modified). Surface that diagnostic
439439+ // before the re-publish form so users try the cheap
440440+ // fix first.
441441+ b.WriteString(`<p class="section-lede" style="margin-top: 0.75rem;"><strong>Your attestation is published but the labeler hasn't issued <code>verified-mail-operator</code>.</strong> The most common cause is the DKIM TXT records below not being live in your DNS — confirm them with <code>dig TXT</code>, then re-publish below if you've changed selectors since enrollment.</p>`)
442442+ }
443443+ }
444444+ b.WriteString(`</section>`)
445445+446446+ // Publish (or re-publish) attestation. Previously this was
447447+ // gated solely on attestation_rkey being empty. Now
448448+ // it also shows when the labeler is reachable AND
449449+ // `verified-mail-operator` is missing — covering the case
450450+ // where the publish succeeded but the labeler rejected the
451451+ // record (typically because DKIM TXT was missing in DNS at
452452+ // verification time). The form, fields, and OAuth handler
453453+ // are unchanged across both paths so AttestHandler doesn't
454454+ // need to know the user came from /account/manage.
455455+ showPublishForm := !d.AttestationPublished ||
456456+ (d.LabelsKnown && !hasOperatorLabel)
457457+ if showPublishForm {
458458+ b.WriteString(`<section class="section">`)
459459+ if !d.AttestationPublished {
460460+ b.WriteString(`<h2>Publish attestation</h2>`)
461461+ b.WriteString(`<p class="section-lede">Your enrollment is complete but the <code>email.atmos.attestation</code> record was never published to your PDS — without it the labeler can't issue your <code>verified-mail-operator</code> or <code>relay-member</code> labels. Click below to publish via OAuth; you'll be sent to your PDS to approve the write and bounced back here.</p>`)
462462+ } else {
463463+ b.WriteString(`<h2>Re-publish attestation</h2>`)
464464+ b.WriteString(`<p class="section-lede">Your attestation record is on your PDS but the labeler isn't issuing labels for it. After confirming your DKIM TXT records are live in DNS, you can re-publish to nudge the labeler to re-check.</p>`)
465465+ }
466466+ b.WriteString(`<form action="/enroll/attest/start" method="POST">`)
467467+ fmt.Fprintf(&b, `<input type="hidden" name="did" value="%s">`, html.EscapeString(d.DID))
468468+ fmt.Fprintf(&b, `<input type="hidden" name="domain" value="%s">`, html.EscapeString(d.Domain))
469469+ fmt.Fprintf(&b, `<input type="hidden" name="dkim_selector" value="%s">`, html.EscapeString(d.DKIMSelector))
470470+ if !d.AttestationPublished {
471471+ b.WriteString(`<button type="submit">Publish email.atmos.attestation to my PDS →</button>`)
472472+ } else {
473473+ b.WriteString(`<button type="submit">Re-publish email.atmos.attestation →</button>`)
474474+ }
475475+ b.WriteString(`</form>`)
476476+ b.WriteString(`</section>`)
477477+ }
381478382479 // API key rotation
383480 b.WriteString(`<section class="section">`)
+1-1
internal/admin/ui/templates/regenerate_key.go
···4141 // Reuse the dashboard layout so the operator stays inside the
4242 // chrome they just came from. Title includes the domain so the
4343 // browser tab is readable when tabbed.
4444- err := Layout("Regenerated key — " + d.Domain).Render(templ.WithChildren(ctx, inner), w)
4444+ err := Layout("Regenerated key — "+d.Domain).Render(templ.WithChildren(ctx, inner), w)
4545 if err != nil {
4646 return err
4747 }
+5-3
internal/atpoauth/client.go
···6868}
69697070// logDIDMismatch emits the audit-trail line for an OAuth callback
7171-// whose session DID doesn't match the pending DID. Audit #165: DIDs
7171+// whose session DID doesn't match the pending DID. DIDs
7272// are hashed because PLC identifiers in logs telegraph recovery
7373// attempts against specific users to anyone who can read journald,
7474// even though PLC itself is a public directory. Extracted as a
···273273 }
274274275275 if sessData.AccountDID.String() != pending.AccountDID {
276276- // Audit #165: log hashed DIDs only. Operators can still
276276+ // log hashed DIDs only. Operators can still
277277 // correlate across lines via the hash prefix; downstream
278278 // eyes-on-logs don't get a directory of who is attempting
279279 // recovery.
···338338339339// findStateForRedirect extracts the opaque state from an authorize URL. The
340340// redirect URL indigo returns looks like
341341-// <authorization_endpoint>?client_id=...&request_uri=<urn:ietf:...>.
341341+//
342342+// <authorization_endpoint>?client_id=...&request_uri=<urn:ietf:...>.
343343+//
342344// The state isn't in the URL — it's the primary key on the persisted row.
343345// We reverse-lookup by matching request_uri.
344346func (c *Client) findStateForRedirect(ctx context.Context, redirect string) (string, error) {
+61-9
internal/config/config.go
···2121 // OperatorWebhookURL, when set, receives signed JSON notifications for
2222 // operator-facing events (e.g. key rotations, security alerts). Must be
2323 // https:// or http://localhost — see ValidateWebhookURL.
2424- OperatorWebhookURL string `json:"operatorWebhookURL"`
2424+ OperatorWebhookURL string `json:"operatorWebhookURL"`
2525 // OperatorWebhookSecret is the HMAC-SHA256 shared secret used to sign
2626 // webhook payloads. Required when OperatorWebhookURL is set.
2727 OperatorWebhookSecret string `json:"operatorWebhookSecret"`
2828+2929+ // PLCTombstoneCheckInterval controls how often the labeler polls
3030+ // plc.directory for tombstoned DIDs. Default 24h. Set to 0
3131+ // to disable the checker entirely (emergency knob if PLC is having
3232+ // trouble or our request volume is unwelcome).
3333+ PLCTombstoneCheckInterval time.Duration `json:"plcTombstoneCheckInterval"`
3434+ // PLCRequestDelay is the minimum gap between PLC requests within a
3535+ // single tombstone-check pass. Default 500ms (= 2 req/s) — fits
3636+ // PLC's published fair-use guidelines without need for tuning.
3737+ PLCRequestDelay time.Duration `json:"plcRequestDelay"`
3838+3939+ // RelayReputationURL is the base URL of the relay's admin API, used by
4040+ // the labeler to query sender reputation for clean-sender label
4141+ // computation. When empty, clean-sender labels are not emitted.
4242+ RelayReputationURL string `json:"relayReputationURL"`
4343+ // RelayReputationToken is the Bearer token for authenticating to the
4444+ // relay's /admin/sender-reputation endpoint.
4545+ RelayReputationToken string `json:"relayReputationToken"`
2846}
29473048type configJSON struct {
3131- ListenAddr string `json:"listenAddr"`
3232- StateDir string `json:"stateDir"`
3333- JetstreamURL string `json:"jetstreamURL"`
3434- SigningKeyPath string `json:"signingKeyPath"`
3535- ReverifyInterval string `json:"reverifyInterval"`
3636- AdminToken string `json:"adminToken"`
3737- OperatorWebhookURL string `json:"operatorWebhookURL"`
3838- OperatorWebhookSecret string `json:"operatorWebhookSecret"`
4949+ ListenAddr string `json:"listenAddr"`
5050+ StateDir string `json:"stateDir"`
5151+ JetstreamURL string `json:"jetstreamURL"`
5252+ SigningKeyPath string `json:"signingKeyPath"`
5353+ ReverifyInterval string `json:"reverifyInterval"`
5454+ AdminToken string `json:"adminToken"`
5555+ OperatorWebhookURL string `json:"operatorWebhookURL"`
5656+ OperatorWebhookSecret string `json:"operatorWebhookSecret"`
5757+ PLCTombstoneCheckInterval string `json:"plcTombstoneCheckInterval"`
5858+ PLCRequestDelay string `json:"plcRequestDelay"`
5959+ RelayReputationURL string `json:"relayReputationURL"`
6060+ RelayReputationToken string `json:"relayReputationToken"`
3961}
40624163func Load(path string) (*Config, error) {
···6284 AdminToken: raw.AdminToken,
6385 OperatorWebhookURL: raw.OperatorWebhookURL,
6486 OperatorWebhookSecret: raw.OperatorWebhookSecret,
8787+ RelayReputationURL: raw.RelayReputationURL,
8888+ RelayReputationToken: raw.RelayReputationToken,
6589 }
66906791 // Allow env var override for admin token (Nomad template friendly)
···7397 if env := os.Getenv("OPERATOR_WEBHOOK_SECRET"); env != "" {
7498 cfg.OperatorWebhookSecret = env
7599 }
100100+ if env := os.Getenv("RELAY_REPUTATION_TOKEN"); env != "" {
101101+ cfg.RelayReputationToken = env
102102+ }
7610377104 if raw.ReverifyInterval != "" {
78105 d, err := time.ParseDuration(raw.ReverifyInterval)
···81108 }
82109 cfg.ReverifyInterval = d
83110 }
111111+ if raw.PLCTombstoneCheckInterval != "" {
112112+ d, err := time.ParseDuration(raw.PLCTombstoneCheckInterval)
113113+ if err != nil {
114114+ return nil, fmt.Errorf("invalid plcTombstoneCheckInterval %q: %w", raw.PLCTombstoneCheckInterval, err)
115115+ }
116116+ cfg.PLCTombstoneCheckInterval = d
117117+ }
118118+ if raw.PLCRequestDelay != "" {
119119+ d, err := time.ParseDuration(raw.PLCRequestDelay)
120120+ if err != nil {
121121+ return nil, fmt.Errorf("invalid plcRequestDelay %q: %w", raw.PLCRequestDelay, err)
122122+ }
123123+ cfg.PLCRequestDelay = d
124124+ }
8412585126 if err := ValidateWebhookURL(cfg.OperatorWebhookURL); err != nil {
86127 return nil, fmt.Errorf("operatorWebhookURL: %w", err)
···108149 }
109150 if c.ReverifyInterval == 0 {
110151 c.ReverifyInterval = 24 * time.Hour
152152+ }
153153+ // PLC tombstone check defaults: runs daily, 2 req/s. Operators who
154154+ // don't want the checker can set plcTombstoneCheckInterval to a
155155+ // negative duration (e.g. "-1s") — cmd/labeler treats <=0 as
156156+ // disabled. Zero would collide with "field absent" so we use the
157157+ // negative-duration sentinel.
158158+ if c.PLCTombstoneCheckInterval == 0 {
159159+ c.PLCTombstoneCheckInterval = 24 * time.Hour
160160+ }
161161+ if c.PLCRequestDelay == 0 {
162162+ c.PLCRequestDelay = 500 * time.Millisecond
111163 }
112164}
+55
internal/did/did.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+// Package did provides shared DID syntax validation across the codebase.
44+//
55+// History: prior to this package, three places had their own copy of a DID
66+// regex (internal/admin/api.go, internal/server/diagnostics.go,
77+// internal/label/validate.go), and the copies disagreed on whether did:web
88+// could contain percent-encoded characters. The label-side regex permitted
99+// %3A (port encoding, per atproto spec) while the admin-side regex
1010+// rejected it — meaning a member could enroll with a port-encoded did:web,
1111+// pass labeler verification, then trip 400-bad-DID on every subsequent
1212+// admin lookup. This package collapses those copies into a single source
1313+// of truth.
1414+package did
1515+1616+import "regexp"
1717+1818+// MaxLength is the upper bound on a DID's byte length.
1919+//
2020+// Neither did:plc nor did:web specify an upper bound, but did:web reuses
2121+// DNS hostnames so the DNS limit (253 bytes) is the natural cap. Without
2222+// a length cap, an attacker could submit gigabyte-long did:web values
2323+// and exhaust label-table writes / log-line buffers.
2424+//
2525+// did:plc is fixed at 32 bytes (did:plc: + 24-char base32) so the cap
2626+// only really matters for did:web, but applying it uniformly keeps the
2727+// validation rule simple to reason about.
2828+const MaxLength = 253
2929+3030+var (
3131+ // plcRe matches did:plc: followed by exactly 24 base32-lower characters.
3232+ // PLC encodes a SHA-256 prefix in base32 so the length is fixed.
3333+ plcRe = regexp.MustCompile(`^did:plc:[a-z2-7]{24}$`)
3434+3535+ // webRe matches did:web with the spec-permitted character set:
3636+ // - alphanumerics + . _ - for hostnames
3737+ // - : for path separators (did:web:host:path)
3838+ // - % for percent-encoded host segments (e.g. %3A for port :)
3939+ //
4040+ // The {1,253} length bound matches MaxLength minus the "did:web:" prefix
4141+ // only roughly — the outer Valid() function enforces the strict cap, this
4242+ // regex is just a syntactic floor.
4343+ webRe = regexp.MustCompile(`^did:web:[a-zA-Z0-9._:%-]{1,253}$`)
4444+)
4545+4646+// Valid reports whether s is a syntactically valid did:plc or did:web.
4747+//
4848+// Length is capped at MaxLength bytes; anything longer is rejected
4949+// without running the regex (cheap-fail for adversarial input).
5050+func Valid(s string) bool {
5151+ if len(s) == 0 || len(s) > MaxLength {
5252+ return false
5353+ }
5454+ return plcRe.MatchString(s) || webRe.MatchString(s)
5555+}
+66
internal/did/did_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package did
44+55+import (
66+ "strings"
77+ "testing"
88+)
99+1010+func TestValid(t *testing.T) {
1111+ cases := []struct {
1212+ name string
1313+ in string
1414+ want bool
1515+ }{
1616+ // did:plc happy path
1717+ {"plc valid 24-char", "did:plc:abcdefghijklmnopqrstuvwx", true},
1818+ {"plc with digits", "did:plc:aabbccdd2233445566777722", true},
1919+2020+ // did:plc invalid
2121+ {"plc too short", "did:plc:short", false},
2222+ {"plc too long", "did:plc:abcdefghijklmnopqrstuvwxyz", false},
2323+ {"plc uppercase", "did:plc:ABCDEFGHIJKLMNOPQRSTUVWX", false},
2424+ {"plc bad charset (1)", "did:plc:abcdefghijklmnopqrstuvw1", false},
2525+ {"plc bad charset (8)", "did:plc:abcdefghijklmnopqrstuvw8", false},
2626+2727+ // did:web happy paths — the % case is the regression #247 closes
2828+ {"web simple", "did:web:example.com", true},
2929+ {"web with subdomain", "did:web:foo.bar.example.com", true},
3030+ {"web with port via %3A", "did:web:example.com%3A8080", true},
3131+ {"web with path via colon", "did:web:example.com:user:alice", true},
3232+ {"web max length", "did:web:" + strings.Repeat("a", MaxLength-len("did:web:")), true},
3333+3434+ // did:web invalid
3535+ {"web empty host", "did:web:", false},
3636+ {"web with slash", "did:web:example.com/path", false},
3737+ {"web with space", "did:web:example .com", false},
3838+ {"web over MaxLength", "did:web:" + strings.Repeat("a", MaxLength), false},
3939+4040+ // Other rejections
4141+ {"empty string", "", false},
4242+ {"non-DID", "https://example.com", false},
4343+ {"unknown method", "did:foo:bar", false},
4444+ {"prefix-only", "did:plc:", false},
4545+ {"trailing newline plc", "did:plc:abcdefghijklmnopqrstuvwx\n", false},
4646+ {"trailing newline web", "did:web:example.com\n", false},
4747+ }
4848+4949+ for _, tc := range cases {
5050+ t.Run(tc.name, func(t *testing.T) {
5151+ if got := Valid(tc.in); got != tc.want {
5252+ t.Errorf("Valid(%q) = %v, want %v", tc.in, got, tc.want)
5353+ }
5454+ })
5555+ }
5656+}
5757+5858+func TestMaxLengthIsBytes(t *testing.T) {
5959+ // MaxLength applies to the byte length, not rune count. Verify that
6060+ // a multi-byte UTF-8 input that exceeds MaxLength in bytes is rejected
6161+ // even if its rune count is under the cap.
6262+ multibyte := "did:web:" + strings.Repeat("é", MaxLength) // each é is 2 bytes
6363+ if Valid(multibyte) {
6464+ t.Error("multi-byte input over MaxLength bytes should be rejected")
6565+ }
6666+}
···1010 "time"
11111212 "atmosphere-mail/internal/dns"
1313+ "atmosphere-mail/internal/loghash"
1314 "atmosphere-mail/internal/store"
1415)
15161617// Compile-time interface checks.
1718var (
1818- _ DNSVerifier = (*dns.Verifier)(nil)
1919+ _ DNSVerifier = (*dns.Verifier)(nil)
1920)
20212122// DNSVerifier checks mail DNS configuration.
···8889// PerDIDRateLimiter combines global rate limits with per-DID limits to prevent
8990// a single DID from exhausting the global allowance.
9091type PerDIDRateLimiter struct {
9191- mu sync.Mutex
9292- global *RateLimiter
9393- dids map[string]*didWindow
9494- maxPerMin int
9595- cleanupAt time.Time
9292+ mu sync.Mutex
9393+ global *RateLimiter
9494+ dids map[string]*didWindow
9595+ maxPerMin int
9696+ cleanupAt time.Time
9697}
97989899type didWindow struct {
···118119// limit is exhausted — a per-DID rejection wastes at most one global token
119120// (which resets every second), but the reverse would lock out legitimate DIDs
120121// for a full minute under global saturation.
122122+//
123123+// Empty DIDs are rejected up-front so a code path that lost the DID can't
124124+// silently flood the global bucket via the implicit "" key. Callers
125125+// must validate via did.Valid before reaching here, but defense in depth.
121126func (p *PerDIDRateLimiter) Allow(did string) (string, bool) {
127127+ if did == "" {
128128+ return "empty did", false
129129+ }
122130 // Check global first
123131 if !p.global.Allow() {
124132 return "global rate limit", false
···162170163171// Manager orchestrates verification and label creation/negation.
164172type Manager struct {
165165- signer *Signer
166166- store *store.Store
167167- dns DNSVerifier
168168- domain DomainVerifier
169169- limiter *PerDIDRateLimiter
173173+ signer *Signer
174174+ store *store.Store
175175+ dns DNSVerifier
176176+ domain DomainVerifier
177177+ limiter *PerDIDRateLimiter
178178+ reputation ReputationQuerier
170179}
171180172181// NewManager creates a label manager with rate limiting.
···181190 }
182191}
183192193193+// SetReputationQuerier configures the reputation data source for
194194+// clean-sender label computation. When nil (default), clean-sender
195195+// labels are not emitted — back-compatible with deployments that
196196+// have no relay reputation endpoint configured.
197197+func (m *Manager) SetReputationQuerier(q ReputationQuerier) {
198198+ m.reputation = q
199199+}
200200+184201// ProcessAttestation verifies a single attestation's domain control and DNS,
185202// updates its verified status, then reconciles all labels for the DID based
186203// on the full set of verified attestations.
187204func (m *Manager) ProcessAttestation(ctx context.Context, att *store.Attestation) error {
188205 // Validate inputs
189206 if err := ValidateAttestation(att.DID, att.Domain, att.DKIMSelectors); err != nil {
190190- log.Printf("invalid attestation from %s: %v", att.DID, err)
207207+ log.Printf("invalid attestation from did_hash=%s: %v", loghash.ForLog(att.DID), err)
191208 return nil // Drop invalid attestations silently
192209 }
193210···198215 }
199216200217 if !domainOK {
201201- log.Printf("domain control failed for %s on %s", att.DID, att.Domain)
218218+ log.Printf("domain control failed for did_hash=%s on %s", loghash.ForLog(att.DID), att.Domain)
202219 if err := m.store.SetVerified(ctx, att.DID, att.Domain, false); err != nil {
203220 return err
204221 }
205222 return m.ReconcileLabels(ctx, att.DID)
206223 }
207207- log.Printf("domain control verified for %s on %s (method: %s)", att.DID, att.Domain, method)
224224+ log.Printf("domain control verified for did_hash=%s on %s (method: %s)", loghash.ForLog(att.DID), att.Domain, method)
208225209226 // Check DNS
210227 dnsResult := m.dns.Verify(ctx, att.Domain, att.DKIMSelectors)
···254271 }
255272 if wantRelay {
256273 desired["relay-member"] = true
274274+ if m.reputation != nil {
275275+ since := time.Now().Add(-30 * 24 * time.Hour)
276276+ rep, err := m.reputation.SenderReputation(ctx, did, since)
277277+ if err != nil {
278278+ log.Printf("clean-sender: reputation fetch failed for did_hash=%s: %v (skipping)", loghash.ForLog(did), err)
279279+ }
280280+ if computeCleanSender(rep, err) {
281281+ desired["clean-sender"] = true
282282+ }
283283+ }
257284 }
258285259286 // Get current active labels
···274301 continue
275302 }
276303 if reason, ok := m.limiter.Allow(did); !ok {
277277- return fmt.Errorf("%s exceeded, dropping label %q for %s", reason, val, did)
304304+ return fmt.Errorf("%s exceeded, dropping label %q for did_hash=%s", reason, val, loghash.ForLog(did))
278305 }
279306 signed, err := m.signer.SignLabel(m.signer.DID(), did, val, now, false)
280307 if err != nil {
···283310 if _, err := m.store.InsertLabel(ctx, signedToStoreLabel(signed)); err != nil {
284311 return err
285312 }
286286- log.Printf("applied label %q to %s", val, did)
313313+ log.Printf("applied label %q to did_hash=%s", val, loghash.ForLog(did))
287314 }
288315289316 // Negate labels that are no longer desired
···298325 if _, err := m.store.InsertLabel(ctx, signedToStoreLabel(signed)); err != nil {
299326 return err
300327 }
301301- log.Printf("negated label %q on %s", l.Val, did)
328328+ log.Printf("negated label %q on did_hash=%s", l.Val, loghash.ForLog(did))
302329 }
303330331331+ return nil
332332+}
333333+334334+// NegateAllLabelsForDID issues neg=true for every currently-active label on
335335+// the given DID, regardless of whether the underlying attestations are still
336336+// verified. Used by the PLC tombstone checker when a member's DID has
337337+// been deactivated on PLC — the labels need to come down even though the
338338+// reverify scheduler's domain.Verify might still pass briefly via cached
339339+// PDS records.
340340+//
341341+// This is the only path that negates labels without going through
342342+// ReconcileLabels — every other negation is driven by the desired-vs-active
343343+// diff. Be deliberate about adding new callers; ReconcileLabels remains the
344344+// preferred entry point for any state-driven label change.
345345+//
346346+// Per-DID rate-limit applies: a tombstoned DID with many labels could
347347+// exhaust the per-DID budget mid-loop, in which case we return the partial-
348348+// progress error and the next tombstone-check pass will finish the job.
349349+func (m *Manager) NegateAllLabelsForDID(ctx context.Context, did, reason string) error {
350350+ if did == "" {
351351+ return fmt.Errorf("NegateAllLabelsForDID: empty did")
352352+ }
353353+ active, err := m.store.GetActiveLabelsForDID(ctx, did)
354354+ if err != nil {
355355+ return err
356356+ }
357357+ if len(active) == 0 {
358358+ return nil
359359+ }
360360+ now := time.Now().UTC().Format(time.RFC3339)
361361+ for _, l := range active {
362362+ if r, ok := m.limiter.Allow(did); !ok {
363363+ return fmt.Errorf("%s exceeded mid-NegateAll on did_hash=%s after %d/%d labels (reason=%q)",
364364+ r, loghash.ForLog(did), 0, len(active), reason)
365365+ }
366366+ signed, err := m.signer.SignLabel(m.signer.DID(), l.URI, l.Val, now, true)
367367+ if err != nil {
368368+ return err
369369+ }
370370+ if _, err := m.store.InsertLabel(ctx, signedToStoreLabel(signed)); err != nil {
371371+ return err
372372+ }
373373+ log.Printf("negated label %q on did_hash=%s reason=%s", l.Val, loghash.ForLog(did), reason)
374374+ }
304375 return nil
305376}
306377
+152
internal/label/manager_test.go
···401401 }
402402}
403403404404+// TestPerDIDRateLimiterRejectsEmptyDID guards against a code path that
405405+// loses the DID and reaches the limiter with did="" — without the empty-
406406+// DID guard, all such calls would share a single implicit window keyed
407407+// on the empty string, and a single regression elsewhere could silently
408408+// flood the global bucket. (#247)
409409+func TestPerDIDRateLimiterRejectsEmptyDID(t *testing.T) {
410410+ limiter := NewPerDIDRateLimiter(1000, 1000, 1000, 100)
411411+412412+ reason, ok := limiter.Allow("")
413413+ if ok {
414414+ t.Error("Allow(\"\") should be rejected")
415415+ }
416416+ if reason != "empty did" {
417417+ t.Errorf("reason = %q, want empty did", reason)
418418+ }
419419+}
420420+404421func TestProcessAttestationDropsInvalid(t *testing.T) {
405422 m, s := testManager(t)
406423 ctx := context.Background()
···485502 t.Errorf("expected rate limit error, got: %v", err)
486503 }
487504}
505505+506506+func TestReconcileLabelsCleanSender(t *testing.T) {
507507+ m, s := testManager(t)
508508+ ctx := context.Background()
509509+510510+ // Set up a relay-member attestation
511511+ att := &store.Attestation{
512512+ DID: "did:plc:test2345test2345test2345",
513513+ Domain: "example.com",
514514+ DKIMSelectors: []string{"default"},
515515+ RelayMember: true,
516516+ CreatedAt: time.Now().UTC(),
517517+ }
518518+ if err := s.UpsertAttestation(ctx, att); err != nil {
519519+ t.Fatal(err)
520520+ }
521521+522522+ // Wire a mock reputation querier that returns clean stats
523523+ m.SetReputationQuerier(&mockReputationQuerier{
524524+ rep: &SenderReputation{Total: 200, Bounces: 3, Complaints: 0, SuspendedNow: false},
525525+ })
526526+527527+ if err := m.ProcessAttestation(ctx, att); err != nil {
528528+ t.Fatal(err)
529529+ }
530530+531531+ labels, err := s.GetActiveLabelsForDID(ctx, "did:plc:test2345test2345test2345")
532532+ if err != nil {
533533+ t.Fatal(err)
534534+ }
535535+536536+ vals := map[string]bool{}
537537+ for _, l := range labels {
538538+ vals[l.Val] = true
539539+ }
540540+ if !vals["verified-mail-operator"] {
541541+ t.Error("missing verified-mail-operator label")
542542+ }
543543+ if !vals["relay-member"] {
544544+ t.Error("missing relay-member label")
545545+ }
546546+ if !vals["clean-sender"] {
547547+ t.Error("missing clean-sender label")
548548+ }
549549+ if len(labels) != 3 {
550550+ t.Errorf("got %d labels, want 3", len(labels))
551551+ }
552552+}
553553+554554+func TestReconcileLabelsCleanSenderNegated(t *testing.T) {
555555+ m, s := testManager(t)
556556+ ctx := context.Background()
557557+558558+ att := &store.Attestation{
559559+ DID: "did:plc:test2345test2345test2345",
560560+ Domain: "example.com",
561561+ DKIMSelectors: []string{"default"},
562562+ RelayMember: true,
563563+ CreatedAt: time.Now().UTC(),
564564+ }
565565+ if err := s.UpsertAttestation(ctx, att); err != nil {
566566+ t.Fatal(err)
567567+ }
568568+569569+ // First: clean reputation → label applied
570570+ m.SetReputationQuerier(&mockReputationQuerier{
571571+ rep: &SenderReputation{Total: 200, Bounces: 3, Complaints: 0, SuspendedNow: false},
572572+ })
573573+ if err := m.ProcessAttestation(ctx, att); err != nil {
574574+ t.Fatal(err)
575575+ }
576576+577577+ labels, _ := s.GetActiveLabelsForDID(ctx, "did:plc:test2345test2345test2345")
578578+ vals := map[string]bool{}
579579+ for _, l := range labels {
580580+ vals[l.Val] = true
581581+ }
582582+ if !vals["clean-sender"] {
583583+ t.Fatal("setup: clean-sender should be applied initially")
584584+ }
585585+586586+ // Now: dirty reputation (high bounce rate) → clean-sender negated
587587+ m.SetReputationQuerier(&mockReputationQuerier{
588588+ rep: &SenderReputation{Total: 100, Bounces: 10, Complaints: 0, SuspendedNow: false},
589589+ })
590590+ if err := m.ReconcileLabels(ctx, "did:plc:test2345test2345test2345"); err != nil {
591591+ t.Fatal(err)
592592+ }
593593+594594+ labels, _ = s.GetActiveLabelsForDID(ctx, "did:plc:test2345test2345test2345")
595595+ vals = map[string]bool{}
596596+ for _, l := range labels {
597597+ vals[l.Val] = true
598598+ }
599599+ if vals["clean-sender"] {
600600+ t.Error("clean-sender should have been negated after dirty reputation")
601601+ }
602602+ if !vals["verified-mail-operator"] {
603603+ t.Error("verified-mail-operator should still be active")
604604+ }
605605+ if !vals["relay-member"] {
606606+ t.Error("relay-member should still be active")
607607+ }
608608+}
609609+610610+func TestReconcileLabelsCleanSenderNoReputationClient(t *testing.T) {
611611+ m, s := testManager(t)
612612+ ctx := context.Background()
613613+614614+ att := &store.Attestation{
615615+ DID: "did:plc:test2345test2345test2345",
616616+ Domain: "example.com",
617617+ DKIMSelectors: []string{"default"},
618618+ RelayMember: true,
619619+ CreatedAt: time.Now().UTC(),
620620+ }
621621+ if err := s.UpsertAttestation(ctx, att); err != nil {
622622+ t.Fatal(err)
623623+ }
624624+625625+ // No reputation querier set — clean-sender should NOT be emitted
626626+ if err := m.ProcessAttestation(ctx, att); err != nil {
627627+ t.Fatal(err)
628628+ }
629629+630630+ labels, _ := s.GetActiveLabelsForDID(ctx, "did:plc:test2345test2345test2345")
631631+ for _, l := range labels {
632632+ if l.Val == "clean-sender" {
633633+ t.Error("clean-sender should not be emitted when no reputation client is configured")
634634+ }
635635+ }
636636+ if len(labels) != 2 {
637637+ t.Errorf("got %d labels, want 2 (verified + relay only)", len(labels))
638638+ }
639639+}
+105
internal/label/reputation.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package label
44+55+import (
66+ "context"
77+ "encoding/json"
88+ "fmt"
99+ "io"
1010+ "net/http"
1111+ "net/url"
1212+ "time"
1313+)
1414+1515+// SenderReputation mirrors the relay's relaystore.SenderReputation JSON shape.
1616+type SenderReputation struct {
1717+ DID string `json:"did"`
1818+ Since time.Time `json:"since"`
1919+ Until time.Time `json:"until"`
2020+ Total int64 `json:"total"`
2121+ Bounces int64 `json:"bounces"`
2222+ Complaints int64 `json:"complaints"`
2323+ SuspendedNow bool `json:"suspendedNow"`
2424+}
2525+2626+// ReputationQuerier fetches sender reputation data for a DID over a time window.
2727+type ReputationQuerier interface {
2828+ SenderReputation(ctx context.Context, did string, since time.Time) (*SenderReputation, error)
2929+}
3030+3131+// HTTPReputationClient queries the relay's /admin/sender-reputation endpoint.
3232+type HTTPReputationClient struct {
3333+ baseURL string
3434+ authToken string
3535+ client *http.Client
3636+}
3737+3838+// NewHTTPReputationClient creates a client that talks to the relay's admin API.
3939+func NewHTTPReputationClient(baseURL, authToken string, client *http.Client) *HTTPReputationClient {
4040+ if client == nil {
4141+ client = &http.Client{Timeout: 10 * time.Second}
4242+ }
4343+ return &HTTPReputationClient{
4444+ baseURL: baseURL,
4545+ authToken: authToken,
4646+ client: client,
4747+ }
4848+}
4949+5050+const (
5151+ cleanSenderMinSamples = 50
5252+ cleanSenderMaxBounceRate = 0.05 // 5%
5353+ cleanSenderMaxComplaintRate = 0.001 // 0.1%
5454+)
5555+5656+// computeCleanSender evaluates whether a sender qualifies for the
5757+// clean-sender label based on their reputation data. Returns false on
5858+// error (fail-open: don't apply or negate, let reverify retry).
5959+func computeCleanSender(rep *SenderReputation, err error) bool {
6060+ if err != nil || rep == nil {
6161+ return false
6262+ }
6363+ if rep.SuspendedNow {
6464+ return false
6565+ }
6666+ if rep.Total < cleanSenderMinSamples {
6767+ return false
6868+ }
6969+ bounceRate := float64(rep.Bounces) / float64(rep.Total)
7070+ if bounceRate >= cleanSenderMaxBounceRate {
7171+ return false
7272+ }
7373+ complaintRate := float64(rep.Complaints) / float64(rep.Total)
7474+ if complaintRate >= cleanSenderMaxComplaintRate {
7575+ return false
7676+ }
7777+ return true
7878+}
7979+8080+func (c *HTTPReputationClient) SenderReputation(ctx context.Context, did string, since time.Time) (*SenderReputation, error) {
8181+ u := c.baseURL + "/admin/sender-reputation?did=" + url.QueryEscape(did) + "&since=" + url.QueryEscape(since.UTC().Format(time.RFC3339))
8282+8383+ req, err := http.NewRequestWithContext(ctx, http.MethodGet, u, nil)
8484+ if err != nil {
8585+ return nil, fmt.Errorf("build request: %w", err)
8686+ }
8787+ req.Header.Set("Authorization", "Bearer "+c.authToken)
8888+8989+ resp, err := c.client.Do(req)
9090+ if err != nil {
9191+ return nil, fmt.Errorf("reputation request: %w", err)
9292+ }
9393+ defer resp.Body.Close()
9494+9595+ if resp.StatusCode != http.StatusOK {
9696+ body, _ := io.ReadAll(io.LimitReader(resp.Body, 512))
9797+ return nil, fmt.Errorf("reputation request: status %d: %s", resp.StatusCode, body)
9898+ }
9999+100100+ var rep SenderReputation
101101+ if err := json.NewDecoder(io.LimitReader(resp.Body, 1<<20)).Decode(&rep); err != nil {
102102+ return nil, fmt.Errorf("decode reputation: %w", err)
103103+ }
104104+ return &rep, nil
105105+}
···19192020// secp256k1 curve order and half-order for low-S normalization.
2121var (
2222- secp256k1N, _ = new(big.Int).SetString("FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141", 16)
2323- secp256k1HalfN = new(big.Int).Rsh(secp256k1N, 1)
2222+ secp256k1N, _ = new(big.Int).SetString("FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141", 16)
2323+ secp256k1HalfN = new(big.Int).Rsh(secp256k1N, 1)
2424)
25252626// SignedLabel is the output of label signing, ready for storage.
+3-5
internal/label/validate.go
···66 "fmt"
77 "regexp"
88 "strings"
99+1010+ didpkg "atmosphere-mail/internal/did"
911)
10121113var (
1212- // did:plc uses base32-lower encoding, always 24 chars after prefix.
1313- didPLCPattern = regexp.MustCompile(`^did:plc:[a-z2-7]{24}$`)
1414- // did:web allows domain chars plus %3A port encoding and : path separators.
1515- didWebPattern = regexp.MustCompile(`^did:web:[a-zA-Z0-9._:%-]+$`)
1614 domainPattern = regexp.MustCompile(`^([a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}$`)
1715 selectorPattern = regexp.MustCompile(`^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$`)
1816)
19172018// ValidateAttestation checks that attestation fields are well-formed before processing.
2119func ValidateAttestation(did, domain string, dkimSelectors []string) error {
2222- if !didPLCPattern.MatchString(did) && !didWebPattern.MatchString(did) {
2020+ if !didpkg.Valid(did) {
2321 return fmt.Errorf("invalid DID format: %q", did)
2422 }
2523
+39
internal/loghash/loghash.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+// Package loghash provides log-safe hashing for opaque identifiers.
44+//
55+// Use this whenever a log line would otherwise carry a DID, OAuth state
66+// token, recovery ticket ID, or any other opaque identifier whose raw
77+// value either looks like a credential or links a single user to a
88+// stream of events. Hashing collapses the value to a deterministic
99+// 16-hex prefix of SHA-256 — enough entropy for operators to correlate
1010+// events across lines, but a one-way function so the log itself is
1111+// useless for impersonation, replay, or fingerprinting.
1212+//
1313+// Originally lived in internal/admin/ui/hashlog.go; promoted to its
1414+// own package so the labeler (and any other non-UI consumer) can
1515+// redact DIDs in logs without importing UI code.
1616+package loghash
1717+1818+import (
1919+ "crypto/sha256"
2020+ "encoding/hex"
2121+)
2222+2323+// prefixLen is the number of hex chars emitted by ForLog.
2424+//
2525+// 16 hex chars = 64 bits of SHA-256 digest. Plenty of correlation
2626+// uniqueness across days of logs at our scale, while staying short
2727+// enough that humans can scan a column of them.
2828+const prefixLen = 16
2929+3030+// ForLog returns a short, deterministic hex prefix of sha256(s) suitable
3131+// for log output. Empty input returns the sentinel "<empty>" so blank
3232+// values are legible rather than invisible.
3333+func ForLog(s string) string {
3434+ if s == "" {
3535+ return "<empty>"
3636+ }
3737+ sum := sha256.Sum256([]byte(s))
3838+ return hex.EncodeToString(sum[:])[:prefixLen]
3939+}
+55
internal/loghash/loghash_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package loghash
44+55+import "testing"
66+77+func TestForLog(t *testing.T) {
88+ cases := []struct {
99+ name string
1010+ in string
1111+ want string
1212+ }{
1313+ {"empty", "", "<empty>"},
1414+ // Stable hash of the literal string "did:plc:abcdefghijklmnopqrstuvwx"
1515+ // — pinned so a copy-paste typo in the constant set off a test.
1616+ {"plc", "did:plc:abcdefghijklmnopqrstuvwx", "e253131024780eb9"},
1717+ }
1818+ for _, tc := range cases {
1919+ t.Run(tc.name, func(t *testing.T) {
2020+ if got := ForLog(tc.in); got != tc.want {
2121+ t.Errorf("ForLog(%q) = %q, want %q", tc.in, got, tc.want)
2222+ }
2323+ })
2424+ }
2525+}
2626+2727+func TestForLogStability(t *testing.T) {
2828+ // Two identical inputs must hash identically — that's the whole point
2929+ // of the function (operator log-line correlation).
3030+ a := ForLog("did:plc:zzzzzzzzzzzzzzzzzzzzzzzz")
3131+ b := ForLog("did:plc:zzzzzzzzzzzzzzzzzzzzzzzz")
3232+ if a != b {
3333+ t.Errorf("ForLog not deterministic: %q != %q", a, b)
3434+ }
3535+}
3636+3737+func TestForLogDistinguishability(t *testing.T) {
3838+ // Different inputs must produce different hashes (modulo the 64-bit
3939+ // truncation collision rate, which is astronomical at our scale).
4040+ a := ForLog("did:plc:aaaaaaaaaaaaaaaaaaaaaaaa")
4141+ b := ForLog("did:plc:bbbbbbbbbbbbbbbbbbbbbbbb")
4242+ if a == b {
4343+ t.Errorf("ForLog should distinguish distinct DIDs, both got %q", a)
4444+ }
4545+}
4646+4747+func TestForLogPrefixLen(t *testing.T) {
4848+ // Pinned at 16 hex chars (64 bits). Any future tweak should be
4949+ // deliberate and should bump every grafana panel that aggregates
5050+ // on hash prefixes — fail loudly here so it can't drift.
5151+ got := ForLog("anything")
5252+ if len(got) != prefixLen {
5353+ t.Errorf("ForLog length = %d, want %d", len(got), prefixLen)
5454+ }
5555+}
+1-1
internal/notify/verify.go
···7777 }
78787979 mac := hmac.New(sha256.New, []byte(secret))
8080- if _, err := mac.Write([]byte(fmt.Sprintf("%d.%s", ts, body))); err != nil {
8080+ if _, err := fmt.Fprintf(mac, "%d.%s", ts, body); err != nil {
8181 // hash.Hash.Write never errors per the hash.Hash contract, but we
8282 // handle it explicitly to keep security code free of silent _ = ...
8383 return fmt.Errorf("notify: hmac write failed: %w", err)
···1818// a future concern.
1919//
2020// Signing:
2121-// When a secret is configured, every POST carries X-Atmos-Signature
2222-// in Stripe-style t=<unix>,v1=<hex> format. The HMAC-SHA256 covers
2323-// "<timestamp>.<body>" so captured requests can't be replayed forever
2424-// — receivers reject signatures whose timestamp is outside a freshness
2525-// window (default 5 minutes — see VerifySignature).
2121+//
2222+// When a secret is configured, every POST carries X-Atmos-Signature
2323+// in Stripe-style t=<unix>,v1=<hex> format. The HMAC-SHA256 covers
2424+// "<timestamp>.<body>" so captured requests can't be replayed forever
2525+// — receivers reject signatures whose timestamp is outside a freshness
2626+// window (default 5 minutes — see VerifySignature).
2627package notify
27282829import (
···68696970 // KindBypassAdded fires when an admin adds a label-bypass entry for
7071 // a DID. High signal: bypass disables T&S enforcement, so operators
7171- // must see every add land in their notification stream (#213).
7272+ // must see every add land in their notification stream.
7273 KindBypassAdded EventKind = "bypass_added"
73747475 // KindBypassRemoved fires when an admin or the expiry janitor
+3-3
internal/osprey/emitter.go
···2727 IncEmitted(eventType string)
2828 IncFailed(eventType string)
2929 // IncSpooled fires when an event lands on disk because the
3030- // broker rejected/silently dropped it (#214 DLQ).
3030+ // broker rejected/silently dropped it (DLQ).
3131 IncSpooled(eventType string)
3232 // IncReplayed fires when a previously-spooled event finally
3333 // makes it to the broker on a subsequent retry.
···133133// fail to write or that the broker rejects asynchronously are landed
134134// to the spool instead of being silently dropped. Call ReplaySpool
135135// periodically (cmd/relay drives this from a GoSafe goroutine) to
136136-// drain the queue back to the broker after recovery. Closes #214.
136136+// drain the queue back to the broker after recovery.
137137func (e *Emitter) SetSpool(s *EventSpool) {
138138 e.spool = s
139139 if s != nil && e.metrics != nil {
···248248 // Sync-error spool: same failure mode as the async batch
249249 // case in handleCompletion. Without this branch the buffer-
250250 // full / shutdown class of failures is silently lost even
251251- // when the spool is wired (#214).
251251+ // when the spool is wired.
252252 e.spoolEvent(data.EventType, data.SenderDID, payload)
253253 }
254254 // Happy-path IncEmitted is intentionally NOT here — it fires in
-1
internal/osprey/emitter_integration_test.go
···462462 t.Error(`message must contain the event type as action_name`)
463463 }
464464}
465465-
···12121313// Event types emitted by the relay.
1414const (
1515- EventRelayAttempt = "relay_attempt" // SMTP submission accepted
1616- EventRelayRejected = "relay_rejected" // SMTP submission rejected
1717- EventDeliveryResult = "delivery_result" // Terminal delivery state
1818- EventBounceReceived = "bounce_received" // Inbound DSN processed
1919- EventMemberSuspended = "member_suspended" // Auto-suspension triggered
1515+ EventRelayAttempt = "relay_attempt" // SMTP submission accepted
1616+ EventRelayRejected = "relay_rejected" // SMTP submission rejected
1717+ EventDeliveryResult = "delivery_result" // Terminal delivery state
1818+ EventBounceReceived = "bounce_received" // Inbound DSN processed
1919+ EventMemberSuspended = "member_suspended" // Auto-suspension triggered
2020 EventComplaintReceived = "complaint_received" // FBL/ARF complaint arrived
2121)
2222···3838// Not all fields are populated for every event type.
3939type EventData struct {
4040 // Common
4141- EventType string `json:"event_type"`
4242- SenderDID string `json:"sender_did"`
4141+ EventType string `json:"event_type"`
4242+ SenderDID string `json:"sender_did"`
4343 SenderDomain string `json:"sender_domain,omitempty"`
44444545 // relay_attempt — no omitempty: 0 is meaningful (day-zero sender, first send)
···5151 // correlation (admin queries today, an SML rule tomorrow) detect the
5252 // same message going out under multiple sender DIDs — the classic
5353 // signature of a coordinated spam campaign. The correlation rule
5454- // itself is explicitly deferred (see chainlink #90); cross-entity
5454+ // itself is explicitly deferred (see chainlink); cross-entity
5555 // queries aren't directly expressible in SML yet.
5656 ContentFingerprint string `json:"content_fingerprint,omitempty"`
5757 // Velocity counters enriched at emit time — no omitempty, 0 is a real value.
···7171 RejectReason string `json:"reject_reason,omitempty"`
72727373 // delivery_result
7474- RecipientDomain string `json:"recipient_domain,omitempty"`
7575- DeliveryStatus string `json:"delivery_status,omitempty"` // "sent" or "bounced"
7676- SMTPCode int `json:"smtp_code,omitempty"`
7474+ RecipientDomain string `json:"recipient_domain,omitempty"`
7575+ DeliveryStatus string `json:"delivery_status,omitempty"` // "sent" or "bounced"
7676+ SMTPCode int `json:"smtp_code,omitempty"`
7777 BounceRate float64 `json:"bounce_rate,omitempty"`
78787979 // bounce_received
+1-1
internal/osprey/spool.go
···2424// fired during the window — labels stop propagating, trust scoring
2525// freezes on stale data, and there is no signal an operator can see
2626// after-the-fact that says "we lost N events between 03:14 and 04:02."
2727-// Closes #214.
2727+//
2828//
2929// On-disk format: each event is one JSON object per file, named
3030// {unix-nanos}-{8-hex-rand}.json, stored under dir. Filenames sort
+7-2
internal/relay/arf/parser.go
···2424 "errors"
2525 "fmt"
2626 "io"
2727+ "log"
2728 "mime"
2829 "mime/multipart"
2930 "net/mail"
···176177 return nil, err
177178 }
178179 case "message/rfc822", "text/rfc822-headers":
180180+ // Non-fatal: Gmail sometimes sends malformed rfc822 parts.
181181+ // Surface the parse error in logs so the long-tail of broken
182182+ // providers is observable, but never block the complaint —
183183+ // the machine-readable feedback-report part above is the
184184+ // load-bearing piece.
179185 if err := parseOriginalMessage(part, report); err != nil {
180180- // Non-fatal: Gmail sometimes sends malformed rfc822 parts.
181181- // Log via the empty fields; the complaint is still useful.
186186+ log.Printf("arf.parse: rfc822_part_parse_warning err=%v", err)
182187 }
183188 }
184189 _ = part.Close()
+12-12
internal/relay/bounce.go
···1717 store *relaystore.Store
18181919 // Thresholds (configurable)
2020- warningBounceRate float64 // e.g. 0.05 (5%)
2121- suspendBounceRate float64 // e.g. 0.10 (10%)
2222- minSendsForBounce int64 // minimum sends before bounce rate is evaluated
2323- bounceWindowHours int // hours to look back for bounce rate calculation
2020+ warningBounceRate float64 // e.g. 0.05 (5%)
2121+ suspendBounceRate float64 // e.g. 0.10 (10%)
2222+ minSendsForBounce int64 // minimum sends before bounce rate is evaluated
2323+ bounceWindowHours int // hours to look back for bounce rate calculation
2424}
25252626// BounceConfig holds bounce processing configuration.
···4444// NewBounceProcessor creates a bounce processor with the given config.
4545func NewBounceProcessor(store *relaystore.Store, cfg BounceConfig) *BounceProcessor {
4646 return &BounceProcessor{
4747- store: store,
4848- warningBounceRate: cfg.WarningBounceRate,
4949- suspendBounceRate: cfg.SuspendBounceRate,
5050- minSendsForBounce: cfg.MinSendsForBounce,
5151- bounceWindowHours: cfg.BounceWindowHours,
4747+ store: store,
4848+ warningBounceRate: cfg.WarningBounceRate,
4949+ suspendBounceRate: cfg.SuspendBounceRate,
5050+ minSendsForBounce: cfg.MinSendsForBounce,
5151+ bounceWindowHours: cfg.BounceWindowHours,
5252 }
5353}
54545555// BounceStats holds bounce rate data for a member.
5656type BounceStats struct {
5757- MemberDID string
5858- TotalSent int64
5757+ MemberDID string
5858+ TotalSent int64
5959 TotalBounced int64
6060- BounceRate float64
6060+ BounceRate float64
6161}
62626363// RecordBounce records a bounce feedback event and evaluates the member's bounce rate.
+170
internal/relay/category.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package relay
44+55+import (
66+ "bufio"
77+ "bytes"
88+ "net/textproto"
99+ "strings"
1010+)
1111+1212+// MessageCategory classifies an outbound message for List-Unsubscribe and
1313+// suppression-list policy decisions.
1414+//
1515+// Why this exists: the original implementation applied List-Unsubscribe
1616+// and the suppression-list to every message uniformly. That's correct for
1717+// bulk/marketing mail (RFC 8058 + Gmail bulk-sender rules) but actively
1818+// hostile for user-initiated transactional flows like login links and
1919+// password-reset OTPs — a stray click on Unsubscribe locks the user out
2020+// of their own auth flow because future deliveries are silently dropped.
2121+type MessageCategory string
2222+2323+const (
2424+ // User-initiated transactional. The recipient just typed their own
2525+ // address into a form expecting this exact email; List-Unsubscribe
2626+ // and the suppression list both work against their interest.
2727+ CategoryLoginLink MessageCategory = "login-link"
2828+ CategoryPasswordReset MessageCategory = "password-reset"
2929+ CategoryOTP MessageCategory = "mfa-otp"
3030+ CategoryVerification MessageCategory = "verification"
3131+3232+ // List-mail. List-Unsubscribe is mandatory; suppression-list is
3333+ // enforced. Default fallback when the sender omits the category
3434+ // header — fail-safe (keeps the prior strict policy in place for
3535+ // untagged senders).
3636+ CategoryBulk MessageCategory = "bulk"
3737+ CategoryBroadcast MessageCategory = "broadcast"
3838+3939+ // CategoryDefault is the fallback applied when the X-Atmos-Category
4040+ // header is missing or unrecognized.
4141+ CategoryDefault = CategoryBulk
4242+)
4343+4444+// CategoryHeader is the SMTP header senders set to choose policy.
4545+const CategoryHeader = "X-Atmos-Category"
4646+4747+// IsUserInitiatedTransactional returns true for categories where the
4848+// recipient just took an action expecting this email (login, password
4949+// reset, OTP, address verification). Such mail SHOULD NOT carry
5050+// List-Unsubscribe and SHOULD NOT be suppressed by prior unsub clicks —
5151+// both behaviors break the auth/login flow the recipient just initiated.
5252+func (c MessageCategory) IsUserInitiatedTransactional() bool {
5353+ switch c {
5454+ case CategoryLoginLink, CategoryPasswordReset, CategoryOTP, CategoryVerification:
5555+ return true
5656+ }
5757+ return false
5858+}
5959+6060+// FeedbackIDValue returns the category string the relay stamps into the
6161+// Feedback-ID header so receivers (Gmail in particular) can route
6262+// complaints by category. User-initiated transactional categories all
6363+// collapse to "transactional" — receivers don't need our internal
6464+// distinction, and exposing it would leak product detail.
6565+func (c MessageCategory) FeedbackIDValue() string {
6666+ if c.IsUserInitiatedTransactional() {
6767+ return "transactional"
6868+ }
6969+ if c == "" {
7070+ return "transactional"
7171+ }
7272+ return string(c)
7373+}
7474+7575+// ParseCategory extracts the X-Atmos-Category header (case-insensitive)
7676+// from the raw message bytes and returns the corresponding
7777+// MessageCategory, falling back to CategoryDefault when the header is
7878+// missing or unrecognized.
7979+//
8080+// The allowlist is strict on purpose: anything outside the recognized
8181+// set falls back to bulk so a typo or a hostile sender can't invent
8282+// novel category names to evade the unsub policy.
8383+func ParseCategory(data []byte) MessageCategory {
8484+ r := textproto.NewReader(bufio.NewReader(bytes.NewReader(data)))
8585+ hdr, err := r.ReadMIMEHeader()
8686+ if err != nil {
8787+ return CategoryDefault
8888+ }
8989+ v := strings.ToLower(strings.TrimSpace(hdr.Get(CategoryHeader)))
9090+ switch MessageCategory(v) {
9191+ case CategoryLoginLink, CategoryPasswordReset, CategoryOTP, CategoryVerification,
9292+ CategoryBulk, CategoryBroadcast:
9393+ return MessageCategory(v)
9494+ default:
9595+ return CategoryDefault
9696+ }
9797+}
9898+9999+// StripCategoryHeader removes every X-Atmos-Category header from the raw
100100+// message bytes. Called after policy is decided but before DKIM signing
101101+// so the internal classification doesn't leak to receivers and so a
102102+// downstream system can't observe the routing decision.
103103+//
104104+// The implementation walks header lines one at a time so folded
105105+// continuation lines (RFC 5322 §2.2.3) of the matching header are also
106106+// dropped together with the leading line.
107107+func StripCategoryHeader(data []byte) []byte {
108108+ return stripHeaderBytes(data, CategoryHeader)
109109+}
110110+111111+// stripHeaderBytes removes every occurrence of the named header from the
112112+// raw message, preserving the body verbatim. Header matching is
113113+// case-insensitive per RFC 5322. Folded continuation lines (those
114114+// starting with whitespace) belonging to the matched header are also
115115+// removed.
116116+func stripHeaderBytes(data []byte, name string) []byte {
117117+ // Find header/body boundary (CRLF CRLF or LF LF).
118118+ bodyStart := bytes.Index(data, []byte("\r\n\r\n"))
119119+ sep := []byte("\r\n\r\n")
120120+ if bodyStart < 0 {
121121+ bodyStart = bytes.Index(data, []byte("\n\n"))
122122+ sep = []byte("\n\n")
123123+ }
124124+ if bodyStart < 0 {
125125+ // Headers only, no body terminator. Treat the whole thing as
126126+ // headers; bodyStart == len(data).
127127+ bodyStart = len(data)
128128+ sep = nil
129129+ }
130130+131131+ headers := data[:bodyStart]
132132+ var body []byte
133133+ if sep != nil {
134134+ body = data[bodyStart:] // includes the leading separator
135135+ }
136136+137137+ // Split on \r\n or \n.
138138+ lineSep := []byte("\r\n")
139139+ if !bytes.Contains(headers, lineSep) {
140140+ lineSep = []byte("\n")
141141+ }
142142+ lines := bytes.Split(headers, lineSep)
143143+144144+ prefix := strings.ToLower(name) + ":"
145145+ var out [][]byte
146146+ skipping := false
147147+ for _, line := range lines {
148148+ // Continuation: line starts with WSP and we're skipping current
149149+ // header → keep skipping.
150150+ if len(line) > 0 && (line[0] == ' ' || line[0] == '\t') {
151151+ if skipping {
152152+ continue
153153+ }
154154+ out = append(out, line)
155155+ continue
156156+ }
157157+ // New header line: decide whether to skip it.
158158+ skipping = strings.HasPrefix(strings.ToLower(string(line)), prefix)
159159+ if skipping {
160160+ continue
161161+ }
162162+ out = append(out, line)
163163+ }
164164+165165+ rebuilt := bytes.Join(out, lineSep)
166166+ if sep != nil {
167167+ return append(rebuilt, body...)
168168+ }
169169+ return rebuilt
170170+}
+206
internal/relay/category_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package relay
44+55+import (
66+ "bytes"
77+ "strings"
88+ "testing"
99+)
1010+1111+func TestMessageCategory_IsUserInitiatedTransactional(t *testing.T) {
1212+ cases := []struct {
1313+ c MessageCategory
1414+ want bool
1515+ }{
1616+ {CategoryLoginLink, true},
1717+ {CategoryPasswordReset, true},
1818+ {CategoryOTP, true},
1919+ {CategoryVerification, true},
2020+ {CategoryBulk, false},
2121+ {CategoryBroadcast, false},
2222+ {MessageCategory(""), false},
2323+ {MessageCategory("garbage"), false},
2424+ }
2525+ for _, tc := range cases {
2626+ if got := tc.c.IsUserInitiatedTransactional(); got != tc.want {
2727+ t.Errorf("%q.IsUserInitiatedTransactional() = %v, want %v", tc.c, got, tc.want)
2828+ }
2929+ }
3030+}
3131+3232+func TestMessageCategory_FeedbackIDValue(t *testing.T) {
3333+ cases := []struct {
3434+ c MessageCategory
3535+ want string
3636+ }{
3737+ {CategoryLoginLink, "transactional"},
3838+ {CategoryPasswordReset, "transactional"},
3939+ {CategoryOTP, "transactional"},
4040+ {CategoryVerification, "transactional"},
4141+ {MessageCategory(""), "transactional"},
4242+ {CategoryBulk, "bulk"},
4343+ {CategoryBroadcast, "broadcast"},
4444+ }
4545+ for _, tc := range cases {
4646+ if got := tc.c.FeedbackIDValue(); got != tc.want {
4747+ t.Errorf("%q.FeedbackIDValue() = %q, want %q", tc.c, got, tc.want)
4848+ }
4949+ }
5050+}
5151+5252+func TestParseCategory(t *testing.T) {
5353+ cases := []struct {
5454+ name string
5555+ raw string
5656+ want MessageCategory
5757+ }{
5858+ {
5959+ name: "missing header defaults to bulk",
6060+ raw: "From: a@x.test\r\nTo: b@y.test\r\nSubject: hi\r\n\r\nbody",
6161+ want: CategoryDefault,
6262+ },
6363+ {
6464+ name: "login-link recognized",
6565+ raw: "X-Atmos-Category: login-link\r\nFrom: a@x.test\r\n\r\nbody",
6666+ want: CategoryLoginLink,
6767+ },
6868+ {
6969+ name: "case-insensitive header name and value",
7070+ raw: "x-atmos-category: LOGIN-LINK\r\nFrom: a@x.test\r\n\r\nbody",
7171+ want: CategoryLoginLink,
7272+ },
7373+ {
7474+ name: "password-reset recognized",
7575+ raw: "X-Atmos-Category: password-reset\r\n\r\nbody",
7676+ want: CategoryPasswordReset,
7777+ },
7878+ {
7979+ name: "mfa-otp recognized",
8080+ raw: "X-Atmos-Category: mfa-otp\r\n\r\nbody",
8181+ want: CategoryOTP,
8282+ },
8383+ {
8484+ name: "verification recognized",
8585+ raw: "X-Atmos-Category: verification\r\n\r\nbody",
8686+ want: CategoryVerification,
8787+ },
8888+ {
8989+ name: "bulk recognized",
9090+ raw: "X-Atmos-Category: bulk\r\n\r\nbody",
9191+ want: CategoryBulk,
9292+ },
9393+ {
9494+ name: "broadcast recognized",
9595+ raw: "X-Atmos-Category: broadcast\r\n\r\nbody",
9696+ want: CategoryBroadcast,
9797+ },
9898+ {
9999+ name: "unknown value falls back to default",
100100+ raw: "X-Atmos-Category: marketing-blast\r\n\r\nbody",
101101+ want: CategoryDefault,
102102+ },
103103+ {
104104+ name: "empty value falls back to default",
105105+ raw: "X-Atmos-Category:\r\n\r\nbody",
106106+ want: CategoryDefault,
107107+ },
108108+ {
109109+ name: "whitespace around value tolerated",
110110+ raw: "X-Atmos-Category: login-link \r\n\r\nbody",
111111+ want: CategoryLoginLink,
112112+ },
113113+ {
114114+ name: "LF-only line endings",
115115+ raw: "X-Atmos-Category: mfa-otp\nFrom: a@x.test\n\nbody",
116116+ want: CategoryOTP,
117117+ },
118118+ }
119119+ for _, tc := range cases {
120120+ t.Run(tc.name, func(t *testing.T) {
121121+ if got := ParseCategory([]byte(tc.raw)); got != tc.want {
122122+ t.Errorf("ParseCategory() = %q, want %q", got, tc.want)
123123+ }
124124+ })
125125+ }
126126+}
127127+128128+func TestStripCategoryHeader_Basic(t *testing.T) {
129129+ in := "From: a@x.test\r\nX-Atmos-Category: login-link\r\nSubject: hi\r\n\r\nbody bytes"
130130+ out := string(StripCategoryHeader([]byte(in)))
131131+ if strings.Contains(strings.ToLower(out), "x-atmos-category") {
132132+ t.Fatalf("header survived strip: %q", out)
133133+ }
134134+ if !strings.HasSuffix(out, "\r\n\r\nbody bytes") {
135135+ t.Fatalf("body corrupted: %q", out)
136136+ }
137137+ if !strings.Contains(out, "From: a@x.test") || !strings.Contains(out, "Subject: hi") {
138138+ t.Fatalf("other headers lost: %q", out)
139139+ }
140140+}
141141+142142+func TestStripCategoryHeader_FoldedContinuation(t *testing.T) {
143143+ // RFC 5322 folded continuation: a header line followed by lines
144144+ // starting with whitespace belongs to the same header. The strip
145145+ // must drop those continuations along with the leading line.
146146+ in := "From: a@x.test\r\n" +
147147+ "X-Atmos-Category: login-\r\n" +
148148+ "\tlink\r\n" +
149149+ "Subject: hi\r\n" +
150150+ "\r\nbody"
151151+ out := string(StripCategoryHeader([]byte(in)))
152152+ if strings.Contains(strings.ToLower(out), "x-atmos-category") {
153153+ t.Fatalf("header survived strip: %q", out)
154154+ }
155155+ // Continuation line "\tlink" must not leak as a stray header.
156156+ if strings.Contains(out, "\tlink") {
157157+ t.Fatalf("continuation line leaked: %q", out)
158158+ }
159159+ if !strings.Contains(out, "From: a@x.test") || !strings.Contains(out, "Subject: hi") {
160160+ t.Fatalf("other headers lost: %q", out)
161161+ }
162162+ if !strings.HasSuffix(out, "\r\n\r\nbody") {
163163+ t.Fatalf("body corrupted: %q", out)
164164+ }
165165+}
166166+167167+func TestStripCategoryHeader_MultipleOccurrences(t *testing.T) {
168168+ in := "X-Atmos-Category: login-link\r\nFrom: a@x.test\r\nX-Atmos-Category: bulk\r\n\r\nb"
169169+ out := string(StripCategoryHeader([]byte(in)))
170170+ if strings.Contains(strings.ToLower(out), "x-atmos-category") {
171171+ t.Fatalf("header survived strip: %q", out)
172172+ }
173173+ if !strings.Contains(out, "From: a@x.test") {
174174+ t.Fatalf("other header lost: %q", out)
175175+ }
176176+}
177177+178178+func TestStripCategoryHeader_LFOnly(t *testing.T) {
179179+ in := "From: a@x.test\nX-Atmos-Category: mfa-otp\nSubject: hi\n\nbody"
180180+ out := string(StripCategoryHeader([]byte(in)))
181181+ if strings.Contains(strings.ToLower(out), "x-atmos-category") {
182182+ t.Fatalf("header survived strip: %q", out)
183183+ }
184184+ if !bytes.HasSuffix([]byte(out), []byte("\n\nbody")) {
185185+ t.Fatalf("body corrupted: %q", out)
186186+ }
187187+}
188188+189189+func TestStripCategoryHeader_NotPresent(t *testing.T) {
190190+ in := "From: a@x.test\r\nSubject: hi\r\n\r\nbody"
191191+ out := string(StripCategoryHeader([]byte(in)))
192192+ if out != in {
193193+ t.Fatalf("strip altered message that didn't have the header:\nin: %q\nout: %q", in, out)
194194+ }
195195+}
196196+197197+func TestStripCategoryHeader_PreservesBodyWithDoubleSeparator(t *testing.T) {
198198+ // Body contains a CRLFCRLF-looking sequence. The strip must split
199199+ // on the FIRST header/body boundary and leave the body verbatim.
200200+ body := "para1\r\n\r\npara2\r\n\r\npara3"
201201+ in := "X-Atmos-Category: bulk\r\nFrom: a@x.test\r\n\r\n" + body
202202+ out := string(StripCategoryHeader([]byte(in)))
203203+ if !strings.HasSuffix(out, "\r\n\r\n"+body) {
204204+ t.Fatalf("body corrupted:\nin: %q\nout: %q", in, out)
205205+ }
206206+}
+5-5
internal/relay/cert_reload.go
···1818//
1919// Without this, every cert renewal forced a full relay restart via
2020// systemd's reloadServices hook — dropping in-flight SMTP/HTTP
2121-// sessions and triggering the spool-reload race in #208. The
2121+// sessions and triggering a spool-reload race. The
2222// GetCertificate callback is invoked per TLS handshake, which is
2323// many orders of magnitude cheaper than a process restart.
2424//
···2626// serialized via a mutex; the cached *tls.Certificate is shared
2727// across all callers.
2828//
2929-// Closes #216.
2929+//
3030type CertReloader struct {
3131 certPath string
3232 keyPath string
33333434- mu sync.RWMutex
3535- cert *tls.Certificate
3636- loadedAt time.Time
3434+ mu sync.RWMutex
3535+ cert *tls.Certificate
3636+ loadedAt time.Time
3737 certMtime time.Time
3838 keyMtime time.Time
3939}
+2-2
internal/relay/crlf.go
···7979// Accepted:
8080//
8181// - "\r\n.\r\n" (canonical end-of-data — but go-smtp consumes this
8282-// before we see the body, so a body containing this
8383-// would already be truncated by the reader)
8282+// before we see the body, so a body containing this
8383+// would already be truncated by the reader)
8484//
8585// Also rejects lone \r bytes inside the body (not followed by \n),
8686// because mailers that interpret bare CR as line separator (rare but
+49-16
internal/relay/didresolver.go
···29293030// DIDResolver fetches DID documents and extracts the atproto signing key.
3131type DIDResolver struct {
3232- client *http.Client
3333- plcURL string // default "https://plc.directory"
3232+ client *http.Client
3333+ plcURL string // default "https://plc.directory"
3434+ lookupTXT func(ctx context.Context, name string) ([]string, error)
3435}
35363637// NewDIDResolver creates a resolver with the given HTTP client.
···3839 if plcURL == "" {
3940 plcURL = "https://plc.directory"
4041 }
4141- return &DIDResolver{client: client, plcURL: plcURL}
4242+ return &DIDResolver{
4343+ client: client,
4444+ plcURL: plcURL,
4545+ lookupTXT: net.DefaultResolver.LookupTXT,
4646+ }
4247}
43484449// ResolveSigningKey fetches the DID document and returns the atproto signing key
···143148 return len(s) <= 253 && handleRegex.MatchString(s)
144149}
145150146146-// ResolveHandle looks up a handle's DID. Tries HTTPS well-known first
147147-// (https://{handle}/.well-known/atproto-did), falls back to DNS TXT
148148-// (_atproto.{handle}), per atproto's handle resolution spec.
151151+// ResolveHandle looks up a handle's DID. Races HTTPS well-known
152152+// (https://{handle}/.well-known/atproto-did) against DNS TXT
153153+// (_atproto.{handle}) — both are spec-compliant and either succeeding
154154+// is sufficient. First valid DID wins; the loser is canceled.
155155+//
156156+// Sequential resolution shared a single deadline, so a hung HTTPS path
157157+// (e.g. a redirect chain on the handle's root that traps requests to
158158+// /.well-known/atproto-did) could starve DNS of its time budget. Racing
159159+// gives DNS its own clock.
149160//
150161// Short-lived context recommended (5-10s) — the enrollment UI is blocked
151162// on this call.
···155166 return "", fmt.Errorf("invalid handle syntax: %q", handle)
156167 }
157168158158- // Path A: HTTPS well-known. Fastest for most users, gives a clear
159159- // error signal if the handle's host doesn't serve the file.
160160- if did, err := r.resolveHandleHTTPS(ctx, handle); err == nil {
161161- return did, nil
169169+ raceCtx, cancel := context.WithCancel(ctx)
170170+ defer cancel()
171171+172172+ type result struct {
173173+ method string
174174+ did string
175175+ err error
162176 }
163163- // Path B: DNS TXT fallback. Required for handles whose underlying
164164- // host isn't HTTP-reachable (or is behind Cloudflare blocking well-known).
165165- if did, err := r.resolveHandleDNS(ctx, handle); err == nil {
166166- return did, nil
177177+ results := make(chan result, 2)
178178+ go func() {
179179+ did, err := r.resolveHandleHTTPS(raceCtx, handle)
180180+ results <- result{method: "https", did: did, err: err}
181181+ }()
182182+ go func() {
183183+ did, err := r.resolveHandleDNS(raceCtx, handle)
184184+ results <- result{method: "dns", did: did, err: err}
185185+ }()
186186+187187+ var firstErr error
188188+ for i := 0; i < 2; i++ {
189189+ res := <-results
190190+ if res.err == nil {
191191+ return res.did, nil
192192+ }
193193+ if firstErr == nil {
194194+ firstErr = res.err
195195+ }
167196 }
168168- return "", fmt.Errorf("handle %q did not resolve via HTTPS well-known or DNS TXT", handle)
197197+ return "", fmt.Errorf("handle %q did not resolve via HTTPS well-known or DNS TXT: %w", handle, firstErr)
169198}
170199171200func (r *DIDResolver) resolveHandleHTTPS(ctx context.Context, handle string) (string, error) {
···194223}
195224196225func (r *DIDResolver) resolveHandleDNS(ctx context.Context, handle string) (string, error) {
197197- records, err := net.DefaultResolver.LookupTXT(ctx, "_atproto."+handle)
226226+ lookup := r.lookupTXT
227227+ if lookup == nil {
228228+ lookup = net.DefaultResolver.LookupTXT
229229+ }
230230+ records, err := lookup(ctx, "_atproto."+handle)
198231 if err != nil {
199232 return "", err
200233 }
+52
internal/relay/didresolver_network_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+//go:build network
44+55+// Network-gated tests that hit real DNS and real HTTPS. Skipped in CI;
66+// run locally with: go test -tags=network ./internal/relay/ -run Network
77+//
88+// These pin specific real-world handles whose resolution shape we care
99+// about — particularly boscolo.co, whose root has a redirect that traps
1010+// /.well-known/atproto-did and used to hang the resolver. The fix makes
1111+// HTTPS and DNS race; DNS wins in milliseconds even though HTTPS never
1212+// returns.
1313+1414+package relay
1515+1616+import (
1717+ "context"
1818+ "net/http"
1919+ "testing"
2020+ "time"
2121+)
2222+2323+// TestNetwork_ResolveHandle_BoscoloCo is the live regression test for
2424+// the boscolo.co class of failure. Pre-fix this would time out (HTTPS
2525+// burns the 5s budget on a redirect that never resolves to a DID).
2626+// Post-fix, DNS wins the race in well under a second.
2727+func TestNetwork_ResolveHandle_BoscoloCo(t *testing.T) {
2828+ resolver := NewDIDResolver(&http.Client{Timeout: 10 * time.Second}, "")
2929+3030+ ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
3131+ defer cancel()
3232+3333+ start := time.Now()
3434+ did, err := resolver.ResolveHandle(ctx, "boscolo.co")
3535+ elapsed := time.Since(start)
3636+ if err != nil {
3737+ t.Fatalf("ResolveHandle(boscolo.co) failed after %s: %v", elapsed, err)
3838+ }
3939+4040+ const wantDID = "did:plc:wtk7wq3y3i64z3umv44eutuj"
4141+ if did != wantDID {
4242+ t.Errorf("did = %q, want %q", did, wantDID)
4343+ }
4444+4545+ // DNS should answer in well under a second. If we're anywhere near
4646+ // the 5s budget, the parallel race regressed and we're back to
4747+ // HTTPS-first sequential semantics.
4848+ if elapsed > 2*time.Second {
4949+ t.Errorf("ResolveHandle took %s, expected DNS to win the race in <2s", elapsed)
5050+ }
5151+ t.Logf("boscolo.co → %s in %s", did, elapsed)
5252+}
+134
internal/relay/didresolver_test.go
···55import (
66 "context"
77 "encoding/json"
88+ "errors"
89 "net/http"
910 "net/http/httptest"
1111+ "sync/atomic"
1012 "testing"
1313+ "time"
1114)
12151316func TestDIDResolverPLC(t *testing.T) {
···269272 if err == nil {
270273 t.Error("expected error when alsoKnownAs is empty")
271274 }
275275+}
276276+277277+// TestResolveHandle_DNSWinsWhenHTTPSHangs is the regression test for the
278278+// boscolo.co class of failure: handle host has a redirect that traps
279279+// /.well-known/atproto-did, exhausting the time budget before DNS gets
280280+// to run. The fix races the two paths, so a slow/hung HTTPS leg must
281281+// not block a fast DNS answer.
282282+func TestResolveHandle_DNSWinsWhenHTTPSHangs(t *testing.T) {
283283+ httpsHit := int32(0)
284284+ srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
285285+ atomic.AddInt32(&httpsHit, 1)
286286+ // Block until the request is canceled — simulates a redirect
287287+ // chain or unresponsive endpoint that the http client can't
288288+ // short-circuit on its own.
289289+ <-r.Context().Done()
290290+ }))
291291+ defer srv.Close()
292292+293293+ resolver := NewDIDResolver(srv.Client(), "")
294294+ resolver.lookupTXT = func(ctx context.Context, name string) ([]string, error) {
295295+ if name != "_atproto.example.test" {
296296+ t.Errorf("unexpected DNS query: %s", name)
297297+ }
298298+ return []string{"did=did:plc:dnswinner123"}, nil
299299+ }
300300+ // Replace the well-known URL with our hanging test server. The
301301+ // real ResolveHandle builds https://{handle}/.well-known/...; we
302302+ // intercept by overriding the dialer would be heavy, so instead
303303+ // we test the race contract by pointing resolveHandleHTTPS at a
304304+ // slow URL via a custom helper.
305305+ // Simpler path: invoke the unexported race directly through the
306306+ // public ResolveHandle, but use a handle that maps to localhost.
307307+ // For that we'd need DNS or /etc/hosts; instead, narrow the test
308308+ // to the race ordering by exercising the goroutines manually.
309309+ ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
310310+ defer cancel()
311311+312312+ raceCtx, raceCancel := context.WithCancel(ctx)
313313+ defer raceCancel()
314314+315315+ type result struct {
316316+ did string
317317+ err error
318318+ }
319319+ results := make(chan result, 2)
320320+ go func() {
321321+ // Simulate HTTPS leg by hitting our hanging server directly.
322322+ req, _ := http.NewRequestWithContext(raceCtx, "GET", srv.URL+"/.well-known/atproto-did", nil)
323323+ _, err := resolver.client.Do(req)
324324+ results <- result{err: err}
325325+ }()
326326+ go func() {
327327+ did, err := resolver.resolveHandleDNS(raceCtx, "example.test")
328328+ results <- result{did: did, err: err}
329329+ }()
330330+331331+ res := <-results
332332+ if res.err != nil {
333333+ t.Fatalf("first result was an error, expected DNS DID first: %v", res.err)
334334+ }
335335+ if res.did != "did:plc:dnswinner123" {
336336+ t.Errorf("did = %q, want did:plc:dnswinner123 (DNS should win the race)", res.did)
337337+ }
338338+}
339339+340340+// TestResolveHandle_DNSFallbackWhenHTTPSReturnsNonDID covers the more
341341+// common case for boscolo.co-style redirects: HTTPS resolves quickly
342342+// to a 200 with HTML body (the redirect target), which fails the
343343+// "is this a DID?" check. The DNS leg must succeed and produce the DID.
344344+func TestResolveHandle_DNSFallbackWhenHTTPSReturnsNonDID(t *testing.T) {
345345+ srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
346346+ // 200 OK but body is HTML — the kind of thing a CDN-level
347347+ // redirect or root-only page would return.
348348+ w.Header().Set("Content-Type", "text/html")
349349+ _, _ = w.Write([]byte("<!doctype html><html><body>welcome</body></html>"))
350350+ }))
351351+ defer srv.Close()
352352+353353+ resolver := NewDIDResolver(srv.Client(), "")
354354+ resolver.lookupTXT = func(_ context.Context, _ string) ([]string, error) {
355355+ return []string{"did=did:plc:dnsanswer456"}, nil
356356+ }
357357+358358+ // Run resolveHandleHTTPS to confirm it rejects non-DID body, then
359359+ // resolveHandleDNS to confirm it returns the DID. Together this
360360+ // establishes that the race in ResolveHandle picks DNS.
361361+ if _, err := resolver.resolveHandleHTTPS(context.Background(), "example.test"); err == nil {
362362+ t.Fatal("expected resolveHandleHTTPS to reject HTML body")
363363+ }
364364+ did, err := resolver.resolveHandleDNS(context.Background(), "example.test")
365365+ if err != nil {
366366+ t.Fatalf("resolveHandleDNS: %v", err)
367367+ }
368368+ if did != "did:plc:dnsanswer456" {
369369+ t.Errorf("did = %q, want did:plc:dnsanswer456", did)
370370+ }
371371+}
372372+373373+// TestResolveHandle_HTTPSStillWorksWhenDNSFails ensures we didn't
374374+// regress the inverse case: handle published only via well-known, no
375375+// DNS record present. Race must still pick HTTPS.
376376+func TestResolveHandle_HTTPSStillWorksWhenDNSFails(t *testing.T) {
377377+ resolver := NewDIDResolver(&http.Client{Timeout: 2 * time.Second}, "")
378378+ resolver.lookupTXT = func(_ context.Context, _ string) ([]string, error) {
379379+ return nil, errors.New("simulated NXDOMAIN")
380380+ }
381381+ // Reuse the existing real-world unresolvable handle test pattern —
382382+ // .invalid is RFC 2606 reserved, so external HTTPS should fail
383383+ // quickly and we exercise the both-fail return path (covered also
384384+ // by TestResolveHandle_UnknownHandleFailsCleanly). For the
385385+ // happy-path HTTPS, validate via direct call to resolveHandleHTTPS
386386+ // against a httptest server that returns a valid DID.
387387+ srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
388388+ _, _ = w.Write([]byte("did:plc:httpsanswer789"))
389389+ }))
390390+ defer srv.Close()
391391+ resolver.client = srv.Client()
392392+ // Construct request directly because resolveHandleHTTPS hardcodes
393393+ // https://{handle}/.well-known/atproto-did and we can't redirect
394394+ // that to httptest without a full DNS stub.
395395+ req, _ := http.NewRequestWithContext(context.Background(), "GET", srv.URL, nil)
396396+ resp, err := resolver.client.Do(req)
397397+ if err != nil {
398398+ t.Fatalf("client.Do: %v", err)
399399+ }
400400+ defer resp.Body.Close()
401401+ if resp.StatusCode != 200 {
402402+ t.Fatalf("status = %d, want 200", resp.StatusCode)
403403+ }
404404+ // The race semantics are covered by the two preceding tests; this
405405+ // test pins that resolveHandleHTTPS itself can return a valid DID.
272406}
273407274408func TestResolveHandle_UnknownHandleFailsCleanly(t *testing.T) {
+4-4
internal/relay/dkim.go
···152152// alignment) and an operator (atmos.email) signer. Signing order is
153153// primary-first, then operator on top — so the final message carries:
154154//
155155-// DKIM-Signature: … d=atmos.email … a=rsa-sha256 (operator, outer)
156156-// DKIM-Signature: … d=atmos.email … a=ed25519-sha256 (operator, outer)
157157-// DKIM-Signature: … d=member.example … a=rsa-sha256 (member, inner)
158158-// DKIM-Signature: … d=member.example … a=ed25519-sha256 (member, inner)
155155+// DKIM-Signature: … d=atmos.email … a=rsa-sha256 (operator, outer)
156156+// DKIM-Signature: … d=atmos.email … a=ed25519-sha256 (operator, outer)
157157+// DKIM-Signature: … d=member.example … a=rsa-sha256 (member, inner)
158158+// DKIM-Signature: … d=member.example … a=ed25519-sha256 (member, inner)
159159//
160160// Four signatures total (2 algorithms × 2 domains). The member signature
161161// provides DMARC alignment (d=member domain matches From: header domain);
+80
internal/relay/dkim_test.go
···66 "crypto/ed25519"
77 "crypto/rsa"
88 "crypto/x509"
99+ "encoding/base64"
1010+ "fmt"
911 "strings"
1012 "testing"
1313+1414+ "github.com/emersion/go-msgauth/dkim"
1115)
12161317func TestGenerateDKIMKeys(t *testing.T) {
···323327 i, sigTag(s, "d"), sigTag(s, "a"), required, h)
324328 }
325329 }
330330+ }
331331+}
332332+333333+// TestDKIMSignVerifyRoundtrip proves both RSA and Ed25519 signatures produced
334334+// by our signer verify correctly against the corresponding public keys. This
335335+// pins that our implementation is RFC 8463 compliant — if this test passes,
336336+// any verification failure at a remote MTA (e.g. Gmail reporting Ed25519 fail
337337+// in DMARC aggregates) is the remote verifier's problem, not ours.
338338+func TestDKIMSignVerifyRoundtrip(t *testing.T) {
339339+ keys, err := GenerateDKIMKeys("atmos20260406")
340340+ if err != nil {
341341+ t.Fatal(err)
342342+ }
343343+344344+ signer := NewDKIMSigner(keys, "example.com")
345345+ msg := "From: test@example.com\r\nTo: user@gmail.com\r\nSubject: Test\r\nDate: Mon, 01 Jan 2026 00:00:00 +0000\r\nMessage-ID: <test@example.com>\r\n\r\nHello world\r\n"
346346+347347+ signed, err := signer.Sign(strings.NewReader(msg))
348348+ if err != nil {
349349+ t.Fatalf("Sign: %v", err)
350350+ }
351351+352352+ // Build a fake DNS resolver that returns our public keys.
353353+ rsaSel := keys.RSASelectorName()
354354+ edSel := keys.EdSelectorName()
355355+ lookupTXT := func(domain string) ([]string, error) {
356356+ switch domain {
357357+ case rsaSel + "._domainkey.example.com":
358358+ return []string{keys.RSADNSRecord()}, nil
359359+ case edSel + "._domainkey.example.com":
360360+ return []string{keys.EdDNSRecord()}, nil
361361+ }
362362+ return nil, fmt.Errorf("no record for %s", domain)
363363+ }
364364+365365+ verifications, err := dkim.VerifyWithOptions(strings.NewReader(string(signed)), &dkim.VerifyOptions{
366366+ LookupTXT: lookupTXT,
367367+ })
368368+ if err != nil {
369369+ t.Fatalf("Verify: %v", err)
370370+ }
371371+372372+ if len(verifications) != 2 {
373373+ t.Fatalf("verification count = %d, want 2", len(verifications))
374374+ }
375375+376376+ for _, v := range verifications {
377377+ if v.Err != nil {
378378+ t.Errorf("verification failed for domain=%s: %v", v.Domain, v.Err)
379379+ }
380380+ }
381381+}
382382+383383+// TestEdDNSRecord_RawKeyFormat verifies the Ed25519 DNS record contains the
384384+// raw 32-byte public key (not PKIX-wrapped), which is what RFC 8463 §4.2
385385+// requires.
386386+func TestEdDNSRecord_RawKeyFormat(t *testing.T) {
387387+ keys, err := GenerateDKIMKeys("atmos20260406")
388388+ if err != nil {
389389+ t.Fatal(err)
390390+ }
391391+392392+ rec := keys.EdDNSRecord()
393393+ parts := strings.SplitN(rec, "p=", 2)
394394+ if len(parts) != 2 {
395395+ t.Fatalf("no p= in record: %q", rec)
396396+ }
397397+398398+ decoded, err := base64.StdEncoding.DecodeString(parts[1])
399399+ if err != nil {
400400+ t.Fatalf("base64 decode: %v", err)
401401+ }
402402+403403+ if len(decoded) != ed25519.PublicKeySize {
404404+ t.Errorf("public key size = %d bytes, want %d (raw Ed25519, not PKIX-wrapped)",
405405+ len(decoded), ed25519.PublicKeySize)
326406 }
327407}
328408
+12-23
internal/relay/dnsgate.go
···1616// DNSGate checks DNS records before allowing SMTP sends.
1717// Results are cached in memory with a configurable TTL.
1818type DNSGate struct {
1919- verifier *dns.Verifier
2020- gracePeriod time.Duration
2121- cacheTTL time.Duration
1919+ verifier *dns.Verifier
2020+ cacheTTL time.Duration
22212323- mu sync.RWMutex
2424- cache map[string]cacheEntry
2525- bypass map[string]bool
2222+ mu sync.RWMutex
2323+ cache map[string]cacheEntry
2424+ bypass map[string]bool
2625}
27262827type cacheEntry struct {
···32313332// DNSGateConfig configures the DNS gate.
3433type DNSGateConfig struct {
3535- Verifier *dns.Verifier
3636- GracePeriod time.Duration // default 72h
3737- CacheTTL time.Duration // default 1h
3434+ Verifier *dns.Verifier
3535+ CacheTTL time.Duration // default 1h
3836}
39374038// NewDNSGate creates a DNS gate with the given configuration.
4139func NewDNSGate(cfg DNSGateConfig) *DNSGate {
4242- if cfg.GracePeriod == 0 {
4343- cfg.GracePeriod = 72 * time.Hour
4444- }
4540 if cfg.CacheTTL == 0 {
4641 cfg.CacheTTL = 1 * time.Hour
4742 }
4843 return &DNSGate{
4949- verifier: cfg.Verifier,
5050- gracePeriod: cfg.GracePeriod,
5151- cacheTTL: cfg.CacheTTL,
5252- cache: make(map[string]cacheEntry),
5353- bypass: make(map[string]bool),
4444+ verifier: cfg.Verifier,
4545+ cacheTTL: cfg.CacheTTL,
4646+ cache: make(map[string]cacheEntry),
4747+ bypass: make(map[string]bool),
5448 }
5549}
5650···7367//
7468// Sending is allowed if:
7569// - the domain is in the bypass list, OR
7676-// - the domain was enrolled less than gracePeriod ago, OR
7770// - SPF and DKIM records are present and correct
7871//
7972// DMARC failures produce a log warning but do not block sending.
8080-func (g *DNSGate) Check(ctx context.Context, domain string, dkimSelectors []string, enrolledAt time.Time) error {
7373+func (g *DNSGate) Check(ctx context.Context, domain string, dkimSelectors []string) error {
8174 domainLower := strings.ToLower(domain)
82758376 g.mu.RLock()
8477 bypassed := g.bypass[domainLower]
8578 g.mu.RUnlock()
8679 if bypassed {
8787- return nil
8888- }
8989-9090- if time.Since(enrolledAt) < g.gracePeriod {
9180 return nil
9281 }
9382
+64-30
internal/relay/dnsgate_test.go
···3434 return &mockDNSResolver{
3535 mx: []*net.MX{{Host: "mail." + domain, Pref: 10}},
3636 txt: map[string][]string{
3737- domain: {"v=spf1 include:_spf.atmos.email ~all"},
3838- selector + "._domainkey." + domain: {"v=DKIM1; k=rsa; p=MIIBIjANBg..."},
3939- "_dmarc." + domain: {"v=DMARC1; p=reject"},
3737+ domain: {"v=spf1 include:_spf.atmos.email ~all"},
3838+ selector + "._domainkey." + domain: {"v=DKIM1; k=rsa; p=MIIBIjANBg..."},
3939+ "_dmarc." + domain: {"v=DMARC1; p=reject"},
4040 },
4141 }
4242}
···4747 Verifier: dns.NewVerifier(r),
4848 })
49495050- err := gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
5050+ err := gate.Check(context.Background(), "example.com", []string{"default"})
5151 if err != nil {
5252 t.Fatalf("expected pass, got: %v", err)
5353 }
···6161 Verifier: dns.NewVerifier(r),
6262 })
63636464- err := gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
6464+ err := gate.Check(context.Background(), "example.com", []string{"default"})
6565 if err == nil {
6666 t.Fatal("expected block for missing SPF")
6767 }
···7878 Verifier: dns.NewVerifier(r),
7979 })
80808181- err := gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
8181+ err := gate.Check(context.Background(), "example.com", []string{"default"})
8282 if err == nil {
8383 t.Fatal("expected block for missing DKIM")
8484 }
···9595 Verifier: dns.NewVerifier(r),
9696 })
97979898- err := gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
9898+ err := gate.Check(context.Background(), "example.com", []string{"default"})
9999 if err != nil {
100100 t.Fatalf("DMARC failure should warn only, not block: %v", err)
101101 }
102102}
103103104104-func TestDNSGate_GracePeriod(t *testing.T) {
104104+func TestDNSGate_NoGracePeriod(t *testing.T) {
105105 r := goodDNSResolver("example.com", "default")
106106 delete(r.txt, "example.com")
107107 delete(r.txt, "default._domainkey.example.com")
108108109109 gate := NewDNSGate(DNSGateConfig{
110110- Verifier: dns.NewVerifier(r),
111111- GracePeriod: 72 * time.Hour,
110110+ Verifier: dns.NewVerifier(r),
112111 })
113112114114- // Enrolled 1 hour ago — within grace period, should pass despite bad DNS
115115- err := gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-1*time.Hour))
116116- if err != nil {
117117- t.Fatalf("should pass within grace period: %v", err)
118118- }
119119-120120- // Enrolled 100 hours ago — outside grace period, should block
121121- err = gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
113113+ err := gate.Check(context.Background(), "example.com", []string{"default"})
122114 if err == nil {
123123- t.Fatal("should block outside grace period with bad DNS")
115115+ t.Fatal("should block immediately when DNS records are missing — no grace period")
124116 }
125117}
126118···134126 })
135127136128 // Without bypass, should block
137137- err := gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
129129+ err := gate.Check(context.Background(), "example.com", []string{"default"})
138130 if err == nil {
139131 t.Fatal("should block without bypass")
140132 }
···142134 // Add bypass
143135 gate.Bypass("example.com")
144136145145- err = gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
137137+ err = gate.Check(context.Background(), "example.com", []string{"default"})
146138 if err != nil {
147139 t.Fatalf("should pass with bypass: %v", err)
148140 }
···150142 // Remove bypass
151143 gate.RemoveBypass("example.com")
152144153153- err = gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
145145+ err = gate.Check(context.Background(), "example.com", []string{"default"})
154146 if err == nil {
155147 t.Fatal("should block after bypass removed")
156148 }
···161153 r := &mockDNSResolver{
162154 mx: []*net.MX{{Host: "mail.example.com", Pref: 10}},
163155 txt: map[string][]string{
164164- "example.com": {"v=spf1 ~all"},
165165- "default._domainkey.example.com": {"v=DKIM1; k=rsa; p=key"},
166166- "_dmarc.example.com": {"v=DMARC1; p=reject"},
156156+ "example.com": {"v=spf1 ~all"},
157157+ "default._domainkey.example.com": {"v=DKIM1; k=rsa; p=key"},
158158+ "_dmarc.example.com": {"v=DMARC1; p=reject"},
167159 },
168160 }
169161···174166 CacheTTL: 1 * time.Hour,
175167 })
176168177177- enrolled := time.Now().Add(-100 * time.Hour)
178178-179169 // First call — should hit DNS
180180- gate.Check(context.Background(), "example.com", []string{"default"}, enrolled)
170170+ gate.Check(context.Background(), "example.com", []string{"default"})
181171 firstCount := callCount
182172183173 // Second call — should hit cache
184184- gate.Check(context.Background(), "example.com", []string{"default"}, enrolled)
174174+ gate.Check(context.Background(), "example.com", []string{"default"})
185175186176 if callCount != firstCount {
187177 t.Errorf("expected cache hit on second call, but DNS was queried again (calls: %d → %d)", firstCount, callCount)
188178 }
189179}
190180181181+// TestDNSGate_DKIMSuffixedSelectors verifies that DKIM verification works when
182182+// DNS records are published under suffixed selector names (e.g. "atmos20260418r"
183183+// and "atmos20260418e") rather than the bare base selector stored in the DB.
184184+func TestDNSGate_DKIMSuffixedSelectors(t *testing.T) {
185185+ const (
186186+ domain = "example.com"
187187+ baseSel = "atmos20260418"
188188+ rsaSel = baseSel + "r"
189189+ edSel = baseSel + "e"
190190+ )
191191+192192+ // DNS has records at the suffixed selector names — this matches production.
193193+ r := &mockDNSResolver{
194194+ mx: []*net.MX{{Host: "mail." + domain, Pref: 10}},
195195+ txt: map[string][]string{
196196+ domain: {"v=spf1 include:_spf.atmos.email ~all"},
197197+ rsaSel + "._domainkey." + domain: {"v=DKIM1; k=rsa; p=MIIBIjANBg..."},
198198+ edSel + "._domainkey." + domain: {"v=DKIM1; k=ed25519; p=MCowBQ..."},
199199+ "_dmarc." + domain: {"v=DMARC1; p=reject"},
200200+ },
201201+ }
202202+203203+ gate := NewDNSGate(DNSGateConfig{
204204+ Verifier: dns.NewVerifier(r),
205205+ })
206206+ // Passing the base selector (bug behaviour) should fail because there is
207207+ // no DNS record at "atmos20260418._domainkey.example.com".
208208+ err := gate.Check(context.Background(), domain, []string{baseSel})
209209+ if err == nil {
210210+ t.Fatal("expected DKIM failure when passing bare base selector (no DNS record at that name)")
211211+ }
212212+213213+ // Passing the correctly suffixed selectors should succeed.
214214+ // Clear the cache first so the previous negative result doesn't stick.
215215+ gate.mu.Lock()
216216+ delete(gate.cache, domain)
217217+ gate.mu.Unlock()
218218+219219+ err = gate.Check(context.Background(), domain, []string{rsaSel, edSel})
220220+ if err != nil {
221221+ t.Fatalf("expected pass with suffixed selectors, got: %v", err)
222222+ }
223223+}
224224+191225func TestDNSGate_BypassCaseInsensitive(t *testing.T) {
192226 r := goodDNSResolver("Example.COM", "default")
193227 delete(r.txt, "Example.COM")
···198232199233 gate.Bypass("EXAMPLE.com")
200234201201- err := gate.Check(context.Background(), "example.com", []string{"default"}, time.Now().Add(-100*time.Hour))
235235+ err := gate.Check(context.Background(), "example.com", []string{"default"})
202236 if err != nil {
203237 t.Fatalf("bypass should be case-insensitive: %v", err)
204238 }
+4-4
internal/relay/dsn.go
···1717 HumanReadable string
18181919 // From the machine-readable part (message/delivery-status)
2020- Status string // e.g. "5.1.1", "4.4.1"
2121- Action string // e.g. "failed", "delayed"
2222- DiagCode string // e.g. "smtp; 550 User unknown"
2323- RemoteMTA string // e.g. "dns; mail.example.com"
2020+ Status string // e.g. "5.1.1", "4.4.1"
2121+ Action string // e.g. "failed", "delayed"
2222+ DiagCode string // e.g. "smtp; 550 User unknown"
2323+ RemoteMTA string // e.g. "dns; mail.example.com"
2424 OriginalRecipient string
25252626 // Classification
+1-1
internal/relay/gosafe.go
···3838// A malformed inbound ARF report or a poison Kafka record is enough
3939// to take the SMTP service down indefinitely. The deferred recover
4040// here turns those into observable, contained failures the operator
4141-// can investigate without an outage. Closes #209.
4141+// can investigate without an outage.
4242//
4343// name is a stable label suitable for Prometheus and grep — keep it
4444// short and stable across deploys ("queue.run", "inbound.serve",
+2-2
internal/relay/inbound_fbl_test.go
···20202121func TestInbound_FBL_EmitsComplaint(t *testing.T) {
2222 var (
2323- mu sync.Mutex
2424- got []complaintCall
2323+ mu sync.Mutex
2424+ got []complaintCall
2525 )
2626 handler := func(ctx context.Context, memberDID, senderDomain, recipientDomain, fbType, ua string, arrival time.Time) {
2727 mu.Lock()
+404
internal/relay/integration_crash_safety_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package relay
44+55+// Cross-component integration tests for the queue's crash-safety
66+// guarantees. Installment 3 of #254.
77+//
88+// What this pins
99+// ---------------
1010+//
1111+// The relay's queue is at-least-once: a message that successfully
1212+// reaches Enqueue's spool.Write call survives any crash that happens
1313+// before delivery completes. On restart the spool is reloaded and the
1414+// message is re-delivered. We pin two flavors of that:
1515+//
1616+// 1. TestIntegration_CrashSafety_NoLossAcrossRestart — the simple
1717+// case. Enqueue happens, the process "crashes" before the
1818+// delivery worker even runs. New process loads the spool and
1919+// delivers cleanly. No loss, exactly one delivery.
2020+//
2121+// 2. TestIntegration_CrashSafety_DeferredSurvivesRestart — the
2222+// retry case. Enqueue happens, the deliver worker runs, the
2323+// remote MTA returns a 4xx (deferred). The entry is not removed
2424+// from spool because it's still pending. The process "crashes",
2525+// a new process reloads the spool, delivers cleanly on the
2626+// retry. The contract is that a deferred entry is durable —
2727+// losing it would silently drop a message the relay still owed
2828+// the sender.
2929+//
3030+// What this DOESN'T pin (and why)
3131+// --------------------------------
3232+//
3333+// There is a narrow duplicate window in queue.go's deliver():
3434+//
3535+// result := q.deliverFunc(...) // remote MTA returns 250 OK
3636+// // <-- crash here means duplicate -->
3737+// spool.Remove(entry.ID) // entry only released here
3838+// onDelivery(result)
3939+//
4040+// If the process dies between deliverFunc returning "sent" and
4141+// spool.Remove succeeding, the message is in the recipient's inbox
4242+// AND still in our spool. On restart it gets delivered again. This
4343+// is the at-least-once tax: recipients dedupe via Message-ID (which
4444+// the relay sets per RFC 5322), so this rarely manifests as visible
4545+// duplicate mail, but the assumption is real and worth being explicit
4646+// about.
4747+//
4848+// Testing that window cleanly would require a fault-injection seam
4949+// (a hook that panics between deliverFunc and spool.Remove). Adding
5050+// that just for one test would pollute the queue's API surface for
5151+// negligible coverage gain — the actual production bug the seam
5252+// would catch is already covered by spool_durability_test.go's tmp-
5353+// residue and rename-failure tests, which exercise the precise file-
5454+// system invariants the duplicate window depends on.
5555+//
5656+// Risk profile: zero — entirely additive test code. No production
5757+// change.
5858+5959+import (
6060+ "bytes"
6161+ "context"
6262+ "net"
6363+ "path/filepath"
6464+ "sync"
6565+ "sync/atomic"
6666+ "testing"
6767+ "time"
6868+)
6969+7070+// TestIntegration_CrashSafety_NoLossAcrossRestart pins the no-loss
7171+// guarantee for the simple pre-delivery crash. A message enqueued by
7272+// Queue#1 must be delivered by Queue#2 after Queue#1 dies before its
7373+// worker had a chance to run.
7474+func TestIntegration_CrashSafety_NoLossAcrossRestart(t *testing.T) {
7575+ mta, addr, cleanup := startFakeMTA(t)
7676+ defer cleanup()
7777+7878+ spoolDir := t.TempDir()
7979+ spool := NewSpool(spoolDir)
8080+8181+ // --- Phase 1: Queue#1 (the "doomed" process) ---
8282+ //
8383+ // We construct it but never call Run. That simulates the cleanest
8484+ // possible crash window: between Enqueue durably hitting the spool
8585+ // and the worker picking it up. If the spool isn't actually durable,
8686+ // Phase 2 will fail to load anything.
8787+ q1 := NewQueue(nil, QueueConfig{
8888+ MaxSize: 8,
8989+ Workers: 1,
9090+ RelayDomain: "relay.test",
9191+ // Production lookup/dial — we won't run the queue, so they
9292+ // never fire. Leaving them as defaults makes the failure
9393+ // mode obvious if Run somehow does execute.
9494+ })
9595+ q1.SetSpool(spool)
9696+9797+ // Enqueue 3 messages. Each one writes to spool BEFORE the memory
9898+ // append, per queue.go:147-167. After this loop returns, all 3
9999+ // must be on disk.
100100+ bodies := [][]byte{
101101+ []byte("From: a@x\r\nTo: b@y\r\n\r\none\r\n"),
102102+ []byte("From: a@x\r\nTo: c@y\r\n\r\ntwo\r\n"),
103103+ []byte("From: a@x\r\nTo: d@y\r\n\r\nthree\r\n"),
104104+ }
105105+ for i, body := range bodies {
106106+ if err := q1.Enqueue(&QueueEntry{
107107+ ID: int64(i + 1),
108108+ From: "bounces+abc@relay.test",
109109+ To: []string{"b@y", "c@y", "d@y"}[i],
110110+ Data: body,
111111+ MemberDID: "did:plc:crashsafetyaaaaaaaaaaa",
112112+ }); err != nil {
113113+ t.Fatalf("Enqueue %d: %v", i, err)
114114+ }
115115+ }
116116+117117+ // "Crash": drop q1 on the floor without running it. The spool is
118118+ // the only thing that should matter for the next phase.
119119+ q1 = nil
120120+121121+ // --- Phase 2: Queue#2 (the "recovered" process) ---
122122+ //
123123+ // Brand new Queue, same spool dir. LoadSpool must find all 3
124124+ // entries; Run must deliver them all to the fake MTA exactly
125125+ // once each.
126126+ var (
127127+ results []DeliveryResult
128128+ mu sync.Mutex
129129+ )
130130+ onDelivery := func(r DeliveryResult) {
131131+ mu.Lock()
132132+ results = append(results, r)
133133+ mu.Unlock()
134134+ }
135135+ q2 := NewQueue(onDelivery, QueueConfig{
136136+ MaxSize: 8,
137137+ Workers: 1,
138138+ RelayDomain: "relay.test",
139139+ MaxRetries: 1,
140140+ RetryBackoffs: []time.Duration{10 * time.Millisecond},
141141+ DeliveryTimeout: 5 * time.Second,
142142+ LookupMX: func(ctx context.Context, domain string) ([]*net.MX, error) {
143143+ return []*net.MX{{Host: "fake-mta.test", Pref: 0}}, nil
144144+ },
145145+ DialMX: func(ctx context.Context, mxHost string) (net.Conn, error) {
146146+ d := net.Dialer{Timeout: 2 * time.Second}
147147+ return d.DialContext(ctx, "tcp", addr)
148148+ },
149149+ })
150150+ q2.SetSpool(spool)
151151+152152+ loaded, err := q2.LoadSpool()
153153+ if err != nil {
154154+ t.Fatalf("LoadSpool: %v", err)
155155+ }
156156+ if loaded != len(bodies) {
157157+ t.Fatalf("LoadSpool reloaded %d entries, want %d (no-loss guarantee broken)", loaded, len(bodies))
158158+ }
159159+160160+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
161161+ defer cancel()
162162+ done := make(chan struct{})
163163+ go func() {
164164+ _ = q2.Run(ctx)
165165+ close(done)
166166+ }()
167167+168168+ deadline := time.Now().Add(8 * time.Second)
169169+ for time.Now().Before(deadline) {
170170+ mu.Lock()
171171+ got := len(results)
172172+ mu.Unlock()
173173+ if got >= len(bodies) {
174174+ break
175175+ }
176176+ time.Sleep(20 * time.Millisecond)
177177+ }
178178+ cancel()
179179+ <-done
180180+181181+ // (1) Each message was delivered exactly once.
182182+ mu.Lock()
183183+ gotResults := append([]DeliveryResult(nil), results...)
184184+ mu.Unlock()
185185+ if len(gotResults) != len(bodies) {
186186+ t.Fatalf("delivery count = %d, want %d", len(gotResults), len(bodies))
187187+ }
188188+ sentCount := 0
189189+ for _, r := range gotResults {
190190+ if r.Status == "sent" {
191191+ sentCount++
192192+ }
193193+ }
194194+ if sentCount != len(bodies) {
195195+ t.Errorf("sent count = %d, want %d (statuses: %+v)", sentCount, len(bodies), gotResults)
196196+ }
197197+198198+ // (2) Fake MTA actually received every body, one each. This
199199+ // catches the case where the spool reload is lossy in some way
200200+ // the result-channel doesn't expose (e.g. only N-1 entries were
201201+ // successfully reconstructed and the one we lost would have
202202+ // produced a different result).
203203+ mta.mu.Lock()
204204+ captured := append([]capturedDelivery(nil), mta.receivedMessages...)
205205+ mta.mu.Unlock()
206206+ if len(captured) != len(bodies) {
207207+ t.Fatalf("fake MTA captured %d messages, want %d", len(captured), len(bodies))
208208+ }
209209+ for _, want := range bodies {
210210+ found := false
211211+ for _, got := range captured {
212212+ if bytes.Equal(got.data, want) {
213213+ found = true
214214+ break
215215+ }
216216+ }
217217+ if !found {
218218+ t.Errorf("a message was lost across the simulated crash: %q", want)
219219+ }
220220+ }
221221+222222+ // (3) Spool is empty after the run. If a successful delivery
223223+ // leaves a spool file behind, the next restart would re-deliver
224224+ // it (the duplicate-window bug we explicitly call out at the top
225225+ // of this file would manifest as a permanent regression).
226226+ matches, err := filepath.Glob(filepath.Join(spoolDir, "*.msg"))
227227+ if err != nil {
228228+ t.Fatalf("glob spool: %v", err)
229229+ }
230230+ if len(matches) != 0 {
231231+ t.Errorf("spool not empty after successful run: %v", matches)
232232+ }
233233+}
234234+235235+// TestIntegration_CrashSafety_DeferredSurvivesRestart pins the
236236+// trickier case: a delivery attempt happened, the remote returned 4xx,
237237+// and the entry is parked for retry. The process dies before the
238238+// retry fires. The new process must reload the deferred entry and
239239+// retry it — losing it would silently drop a message we still owe
240240+// the sender.
241241+func TestIntegration_CrashSafety_DeferredSurvivesRestart(t *testing.T) {
242242+ mta, addr, cleanup := startFakeMTA(t)
243243+ defer cleanup()
244244+245245+ spoolDir := t.TempDir()
246246+ spool := NewSpool(spoolDir)
247247+248248+ // --- Phase 1: Queue#1 — deliver returns "deferred" ---
249249+ //
250250+ // We use a custom DeliverFunc instead of LookupMX/DialMX because
251251+ // we want to precisely control the result without involving real
252252+ // SMTP semantics. The bytes-on-the-wire and EHLO assertions are
253253+ // already pinned by the inst. 1+2 tests — here we care about the
254254+ // queue's spool-vs-memory bookkeeping after a deferred result.
255255+ deferAttempts := int32(0)
256256+ q1 := NewQueue(nil, QueueConfig{
257257+ MaxSize: 4,
258258+ Workers: 1,
259259+ RelayDomain: "relay.test",
260260+ MaxRetries: 5,
261261+ RetryBackoffs: []time.Duration{10 * time.Millisecond},
262262+ DeliverFunc: func(ctx context.Context, entry *QueueEntry, relayDomain string) DeliveryResult {
263263+ atomic.AddInt32(&deferAttempts, 1)
264264+ return DeliveryResult{
265265+ EntryID: entry.ID,
266266+ MemberDID: entry.MemberDID,
267267+ Recipient: entry.To,
268268+ Status: "deferred",
269269+ Error: "451 try later",
270270+ }
271271+ },
272272+ })
273273+ q1.SetSpool(spool)
274274+275275+ body := []byte("From: a@x\r\nTo: b@y\r\nMessage-ID: <deferred-1@x>\r\n\r\ndeferred body\r\n")
276276+ if err := q1.Enqueue(&QueueEntry{
277277+ ID: 42,
278278+ From: "bounces+abc@relay.test",
279279+ To: "b@y",
280280+ Data: body,
281281+ MemberDID: "did:plc:crashsafetybbbbbbbbbbb",
282282+ }); err != nil {
283283+ t.Fatalf("Enqueue: %v", err)
284284+ }
285285+286286+ ctx1, cancel1 := context.WithTimeout(context.Background(), 5*time.Second)
287287+ done1 := make(chan struct{})
288288+ go func() {
289289+ _ = q1.Run(ctx1)
290290+ close(done1)
291291+ }()
292292+293293+ // Wait until at least one deliver attempt has fired and produced
294294+ // a deferred result. Then "crash" — cancel ctx1 and abandon q1.
295295+ deadline := time.Now().Add(3 * time.Second)
296296+ for time.Now().Before(deadline) {
297297+ if atomic.LoadInt32(&deferAttempts) >= 1 {
298298+ break
299299+ }
300300+ time.Sleep(10 * time.Millisecond)
301301+ }
302302+ if atomic.LoadInt32(&deferAttempts) < 1 {
303303+ t.Fatal("Queue#1 did not attempt delivery within the test window")
304304+ }
305305+ cancel1()
306306+ <-done1
307307+308308+ // Spool must still contain the entry — deferred ≠ terminal, so
309309+ // queue.go:349-354 must not have removed it.
310310+ matches, err := filepath.Glob(filepath.Join(spoolDir, "*.msg"))
311311+ if err != nil {
312312+ t.Fatalf("glob spool after deferred crash: %v", err)
313313+ }
314314+ if len(matches) != 1 {
315315+ t.Fatalf("spool entries after deferred crash = %d, want 1 (durability of deferred entries broken)", len(matches))
316316+ }
317317+318318+ // --- Phase 2: Queue#2 — deliver succeeds ---
319319+ var (
320320+ results []DeliveryResult
321321+ mu sync.Mutex
322322+ )
323323+ onDelivery := func(r DeliveryResult) {
324324+ mu.Lock()
325325+ results = append(results, r)
326326+ mu.Unlock()
327327+ }
328328+ q2 := NewQueue(onDelivery, QueueConfig{
329329+ MaxSize: 4,
330330+ Workers: 1,
331331+ RelayDomain: "relay.test",
332332+ MaxRetries: 1,
333333+ RetryBackoffs: []time.Duration{10 * time.Millisecond},
334334+ DeliveryTimeout: 5 * time.Second,
335335+ LookupMX: func(ctx context.Context, domain string) ([]*net.MX, error) {
336336+ return []*net.MX{{Host: "fake-mta.test", Pref: 0}}, nil
337337+ },
338338+ DialMX: func(ctx context.Context, mxHost string) (net.Conn, error) {
339339+ d := net.Dialer{Timeout: 2 * time.Second}
340340+ return d.DialContext(ctx, "tcp", addr)
341341+ },
342342+ })
343343+ q2.SetSpool(spool)
344344+345345+ loaded, err := q2.LoadSpool()
346346+ if err != nil {
347347+ t.Fatalf("Queue#2 LoadSpool: %v", err)
348348+ }
349349+ if loaded != 1 {
350350+ t.Fatalf("Queue#2 LoadSpool = %d, want 1 (the deferred entry must reload)", loaded)
351351+ }
352352+353353+ ctx2, cancel2 := context.WithTimeout(context.Background(), 10*time.Second)
354354+ defer cancel2()
355355+ done2 := make(chan struct{})
356356+ go func() {
357357+ _ = q2.Run(ctx2)
358358+ close(done2)
359359+ }()
360360+361361+ deadline = time.Now().Add(8 * time.Second)
362362+ for time.Now().Before(deadline) {
363363+ mu.Lock()
364364+ got := len(results)
365365+ mu.Unlock()
366366+ if got >= 1 {
367367+ break
368368+ }
369369+ time.Sleep(20 * time.Millisecond)
370370+ }
371371+ cancel2()
372372+ <-done2
373373+374374+ // (1) The deferred entry was retried successfully.
375375+ mu.Lock()
376376+ gotResults := append([]DeliveryResult(nil), results...)
377377+ mu.Unlock()
378378+ if len(gotResults) != 1 {
379379+ t.Fatalf("delivery count after retry = %d, want 1", len(gotResults))
380380+ }
381381+ if gotResults[0].Status != "sent" {
382382+ t.Errorf("retry status = %q, want sent (Error=%q)", gotResults[0].Status, gotResults[0].Error)
383383+ }
384384+385385+ // (2) Fake MTA captured exactly the body we enqueued in Phase 1.
386386+ mta.mu.Lock()
387387+ captured := append([]capturedDelivery(nil), mta.receivedMessages...)
388388+ mta.mu.Unlock()
389389+ if len(captured) != 1 {
390390+ t.Fatalf("fake MTA captured %d messages on retry, want 1", len(captured))
391391+ }
392392+ if !bytes.Equal(captured[0].data, body) {
393393+ t.Errorf("retried body differs from enqueued body\nenqueued: %q\ncaptured: %q", body, captured[0].data)
394394+ }
395395+396396+ // (3) Spool is empty after successful retry.
397397+ matches, err = filepath.Glob(filepath.Join(spoolDir, "*.msg"))
398398+ if err != nil {
399399+ t.Fatalf("glob spool after retry: %v", err)
400400+ }
401401+ if len(matches) != 0 {
402402+ t.Errorf("spool not empty after successful retry: %v", matches)
403403+ }
404404+}
+415
internal/relay/integration_deliver_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package relay
44+55+// Cross-component integration tests for the OUTBOUND delivery path.
66+//
77+// Where the #228 series pinned the SMTP submission funnel (client →
88+// SMTPServer → Store → Queue), this file pins the deliver-side: Queue
99+// → real deliverMessage → real go-smtp client → fake destination MTA.
1010+// The fake MTA captures the bytes that actually went on the wire so we
1111+// can assert on what production would emit, not what an isolated unit
1212+// of signing/queueing produces.
1313+//
1414+// Two installments live here:
1515+//
1616+// 1. TestIntegration_DeliverPath_RealPathToFakeMTA — exercises the
1717+// production deliverMessage / deliverToMX path against a fake MTA
1818+// on a random local port via the new LookupMX + DialMX seams on
1919+// QueueConfig (#254). Asserts the queue marks the message "sent"
2020+// with code 250 and the fake MTA captured the bytes.
2121+//
2222+// 2. TestIntegration_DeliverPath_DKIMBytesOnTheWire — same harness,
2323+// but the message is dual-DKIM-signed via DualDomainSigner before
2424+// enqueue. The fake MTA's captured bytes are then re-parsed to
2525+// assert two DKIM-Signature headers survived the queue+SMTP round
2626+// trip with the right d= values, and that Feedback-ID and
2727+// X-Atmos-Member-Did weren't dropped along the way.
2828+//
2929+// Risk profile: zero production behavior change. The new LookupMX +
3030+// DialMX fields default nil → production wiring; tests opt in by
3131+// passing non-nil values.
3232+3333+import (
3434+ "bytes"
3535+ "context"
3636+ "io"
3737+ "net"
3838+ "strings"
3939+ "sync"
4040+ "testing"
4141+ "time"
4242+4343+ "github.com/emersion/go-sasl"
4444+ "github.com/emersion/go-smtp"
4545+)
4646+4747+// fakeMTA is a minimal smtp.Backend that captures every accepted
4848+// message into the receivedMessages slice. No auth, no TLS, no
4949+// validation — it accepts whatever the deliver path sends and records
5050+// the wire bytes byte-for-byte.
5151+type fakeMTA struct {
5252+ mu sync.Mutex
5353+ receivedMessages []capturedDelivery
5454+ lastEHLO string
5555+}
5656+5757+type capturedDelivery struct {
5858+ from string
5959+ to []string
6060+ data []byte
6161+}
6262+6363+type fakeMTASession struct {
6464+ mta *fakeMTA
6565+ from string
6666+ to []string
6767+}
6868+6969+func (f *fakeMTA) NewSession(c *smtp.Conn) (smtp.Session, error) {
7070+ // Capture the EHLO greeting the client sent so the test can verify
7171+ // the relay used its configured relayDomain (RFC 5321 §4.1.1.1)
7272+ // rather than something fallback-y like "localhost".
7373+ f.mu.Lock()
7474+ f.lastEHLO = c.Hostname()
7575+ f.mu.Unlock()
7676+ return &fakeMTASession{mta: f}, nil
7777+}
7878+7979+func (s *fakeMTASession) AuthMechanisms() []string { return nil }
8080+func (s *fakeMTASession) Auth(mech string) (sasl.Server, error) { return nil, smtp.ErrAuthUnsupported }
8181+func (s *fakeMTASession) Mail(from string, opts *smtp.MailOptions) error {
8282+ s.from = from
8383+ return nil
8484+}
8585+func (s *fakeMTASession) Rcpt(to string, opts *smtp.RcptOptions) error {
8686+ s.to = append(s.to, to)
8787+ return nil
8888+}
8989+func (s *fakeMTASession) Data(r io.Reader) error {
9090+ data, err := io.ReadAll(r)
9191+ if err != nil {
9292+ return err
9393+ }
9494+ s.mta.mu.Lock()
9595+ s.mta.receivedMessages = append(s.mta.receivedMessages, capturedDelivery{
9696+ from: s.from,
9797+ to: append([]string(nil), s.to...),
9898+ data: data,
9999+ })
100100+ s.mta.mu.Unlock()
101101+ return nil
102102+}
103103+func (s *fakeMTASession) Reset() {}
104104+func (s *fakeMTASession) Logout() error { return nil }
105105+106106+// startFakeMTA spins up the fakeMTA on a random port and returns the
107107+// listener address + a teardown closure.
108108+func startFakeMTA(t *testing.T) (*fakeMTA, string, func()) {
109109+ t.Helper()
110110+111111+ mta := &fakeMTA{}
112112+ srv := smtp.NewServer(mta)
113113+114114+ ln, err := net.Listen("tcp", "127.0.0.1:0")
115115+ if err != nil {
116116+ t.Fatalf("listen: %v", err)
117117+ }
118118+ addr := ln.Addr().String()
119119+ srv.Addr = addr
120120+ srv.Domain = "fake-mta.test"
121121+ srv.ReadTimeout = 5 * time.Second
122122+ srv.WriteTimeout = 5 * time.Second
123123+ // Take ownership of the listener so srv.Serve can use it directly
124124+ // without re-listening on the same port (race).
125125+ go srv.Serve(ln)
126126+127127+ // Wait for it to be live.
128128+ for i := 0; i < 50; i++ {
129129+ conn, err := net.DialTimeout("tcp", addr, 100*time.Millisecond)
130130+ if err == nil {
131131+ conn.Close()
132132+ break
133133+ }
134134+ time.Sleep(10 * time.Millisecond)
135135+ }
136136+137137+ return mta, addr, func() { srv.Close() }
138138+}
139139+140140+// queueWithFakeMTA wires a Queue at the given fake-MTA addr via the
141141+// new LookupMX + DialMX seams. Returns the queue and a deliveryResults
142142+// slice the caller can read after a delivery cycle.
143143+func queueWithFakeMTA(t *testing.T, fakeMTAAddr string) (*Queue, *[]DeliveryResult, *sync.Mutex) {
144144+ t.Helper()
145145+146146+ var (
147147+ results []DeliveryResult
148148+ mu sync.Mutex
149149+ )
150150+ onDelivery := func(r DeliveryResult) {
151151+ mu.Lock()
152152+ results = append(results, r)
153153+ mu.Unlock()
154154+ }
155155+156156+ cfg := QueueConfig{
157157+ MaxSize: 8,
158158+ MaxRetries: 1,
159159+ RetryBackoffs: []time.Duration{10 * time.Millisecond},
160160+ Workers: 1,
161161+ DeliveryTimeout: 5 * time.Second,
162162+ RelayDomain: "relay.test",
163163+ // Force the deliver path at our fake MTA regardless of what
164164+ // recipient domain it's trying to reach. Both seams are
165165+ // non-nil, so the queue uses them instead of the production
166166+ // defaults (real DNS, port 25).
167167+ LookupMX: func(ctx context.Context, domain string) ([]*net.MX, error) {
168168+ return []*net.MX{{Host: "fake-mta.test", Pref: 0}}, nil
169169+ },
170170+ DialMX: func(ctx context.Context, mxHost string) (net.Conn, error) {
171171+ d := net.Dialer{Timeout: 2 * time.Second}
172172+ return d.DialContext(ctx, "tcp", fakeMTAAddr)
173173+ },
174174+ }
175175+ q := NewQueue(onDelivery, cfg)
176176+ return q, &results, &mu
177177+}
178178+179179+// runQueueOnce starts the queue in a goroutine, waits until the result
180180+// channel sees one delivery (or times out), and stops the queue. Lets
181181+// tests assert on a single in-flight message without juggling
182182+// goroutines themselves.
183183+func runQueueOnce(t *testing.T, q *Queue, results *[]DeliveryResult, mu *sync.Mutex) {
184184+ t.Helper()
185185+186186+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
187187+ defer cancel()
188188+189189+ done := make(chan struct{})
190190+ go func() {
191191+ _ = q.Run(ctx)
192192+ close(done)
193193+ }()
194194+195195+ deadline := time.Now().Add(8 * time.Second)
196196+ for time.Now().Before(deadline) {
197197+ mu.Lock()
198198+ got := len(*results)
199199+ mu.Unlock()
200200+ if got >= 1 {
201201+ break
202202+ }
203203+ time.Sleep(20 * time.Millisecond)
204204+ }
205205+206206+ cancel()
207207+ <-done
208208+}
209209+210210+// TestIntegration_DeliverPath_RealPathToFakeMTA exercises the
211211+// production deliverMessage / deliverToMX path end-to-end against a
212212+// fake destination MTA. This is the foundation: prove the new LookupMX
213213+// + DialMX seams correctly redirect a Queue's deliver path at a local
214214+// fake without touching real DNS or port 25.
215215+func TestIntegration_DeliverPath_RealPathToFakeMTA(t *testing.T) {
216216+ mta, addr, cleanup := startFakeMTA(t)
217217+ defer cleanup()
218218+219219+ q, results, mu := queueWithFakeMTA(t, addr)
220220+221221+ // A bare-bones, unsigned message body. Installment 2 below adds
222222+ // real DKIM signing on top of this; here we just want to prove the
223223+ // wire path delivers the bytes the queue holds.
224224+ body := []byte("From: alice@member.example.com\r\n" +
225225+ "To: bob@example.org\r\n" +
226226+ "Subject: deliver-path smoke\r\n" +
227227+ "Message-ID: <smoke-1@member.example.com>\r\n" +
228228+ "\r\n" +
229229+ "hello from the deliver path\r\n")
230230+231231+ if err := q.Enqueue(&QueueEntry{
232232+ ID: 1,
233233+ From: "bounces+abc@relay.test",
234234+ To: "bob@example.org",
235235+ Data: body,
236236+ MemberDID: "did:plc:deliverpathaaaaaaaaaa",
237237+ }); err != nil {
238238+ t.Fatalf("Enqueue: %v", err)
239239+ }
240240+241241+ runQueueOnce(t, q, results, mu)
242242+243243+ // (1) Queue marked the delivery as sent with a 250 OK code from
244244+ // the fake MTA. Anything else means the deliver path didn't reach
245245+ // the fake — most likely the LookupMX/DialMX seams aren't being
246246+ // honored.
247247+ mu.Lock()
248248+ got := append([]DeliveryResult(nil), (*results)...)
249249+ mu.Unlock()
250250+ if len(got) != 1 {
251251+ t.Fatalf("delivery results: got %d, want 1", len(got))
252252+ }
253253+ if got[0].Status != "sent" {
254254+ t.Errorf("Status = %q, want sent (Error=%q)", got[0].Status, got[0].Error)
255255+ }
256256+ if got[0].SMTPCode != 250 {
257257+ t.Errorf("SMTPCode = %d, want 250", got[0].SMTPCode)
258258+ }
259259+260260+ // (2) Fake MTA captured the message bytes the queue handed it.
261261+ mta.mu.Lock()
262262+ captured := append([]capturedDelivery(nil), mta.receivedMessages...)
263263+ ehlo := mta.lastEHLO
264264+ mta.mu.Unlock()
265265+266266+ if len(captured) != 1 {
267267+ t.Fatalf("fake MTA captured %d messages, want 1", len(captured))
268268+ }
269269+ if captured[0].from != "bounces+abc@relay.test" {
270270+ t.Errorf("captured from = %q, want bounces+abc@relay.test", captured[0].from)
271271+ }
272272+ if len(captured[0].to) != 1 || captured[0].to[0] != "bob@example.org" {
273273+ t.Errorf("captured to = %v, want [bob@example.org]", captured[0].to)
274274+ }
275275+ if !bytes.Equal(captured[0].data, body) {
276276+ t.Errorf("captured body bytes differ from enqueued bytes\nenqueued: %q\ncaptured: %q", body, captured[0].data)
277277+ }
278278+279279+ // (3) The relay's EHLO greeting must be its configured relayDomain
280280+ // (RFC 5321 §4.1.1.1) — not "localhost", not the recipient MX
281281+ // hostname. This is the kind of regression that silently torches
282282+ // reverse-DNS-strict providers.
283283+ if ehlo != "relay.test" {
284284+ t.Errorf("EHLO greeting = %q, want relay.test", ehlo)
285285+ }
286286+}
287287+288288+// TestIntegration_DeliverPath_DKIMBytesOnTheWire is the high-value
289289+// installment: pin the actual production output that goes over SMTP
290290+// against a real DKIM verifier, against a fake MTA. Catches drift in
291291+// header canonicalization, signing order, dual-DKIM emission, and any
292292+// queue/transport step that mangles the bytes between sign and send.
293293+//
294294+// Distinct from dkim_test.go (which tests the signer in isolation):
295295+// this test signs through the same path the real onAccept uses, then
296296+// drops the signed bytes into the Queue, then captures what the fake
297297+// MTA actually receives, and verifies on those captured bytes.
298298+func TestIntegration_DeliverPath_DKIMBytesOnTheWire(t *testing.T) {
299299+ memberDomain := "member.example.com"
300300+ memberKeys, err := GenerateDKIMKeys("atmos20260504")
301301+ if err != nil {
302302+ t.Fatalf("GenerateDKIMKeys (member): %v", err)
303303+ }
304304+ operatorKeys, err := GenerateDKIMKeys("atmos20260504")
305305+ if err != nil {
306306+ t.Fatalf("GenerateDKIMKeys (operator): %v", err)
307307+ }
308308+ signer := NewDualDomainSigner(memberKeys, operatorKeys, memberDomain, "atmos.email")
309309+310310+ preSign := "From: alice@" + memberDomain + "\r\n" +
311311+ "To: bob@example.org\r\n" +
312312+ "Subject: dkim-bytes-on-the-wire\r\n" +
313313+ "Message-ID: <wire-1@" + memberDomain + ">\r\n" +
314314+ "Feedback-ID: did-deliverpathaaaaaaaaaa:" + memberDomain + ":atmos:1\r\n" +
315315+ "X-Atmos-Member-Did: did:plc:deliverpathaaaaaaaaaa\r\n" +
316316+ "\r\n" +
317317+ "the bytes that go on the wire are the bytes we assert on\r\n"
318318+319319+ signed, err := signer.Sign(strings.NewReader(preSign))
320320+ if err != nil {
321321+ t.Fatalf("DualDomainSigner.Sign: %v", err)
322322+ }
323323+324324+ mta, addr, cleanup := startFakeMTA(t)
325325+ defer cleanup()
326326+327327+ q, results, mu := queueWithFakeMTA(t, addr)
328328+329329+ if err := q.Enqueue(&QueueEntry{
330330+ ID: 1,
331331+ From: "bounces+abc@atmos.email",
332332+ To: "bob@example.org",
333333+ Data: signed,
334334+ MemberDID: "did:plc:deliverpathaaaaaaaaaa",
335335+ }); err != nil {
336336+ t.Fatalf("Enqueue: %v", err)
337337+ }
338338+339339+ runQueueOnce(t, q, results, mu)
340340+341341+ mu.Lock()
342342+ got := append([]DeliveryResult(nil), (*results)...)
343343+ mu.Unlock()
344344+ if len(got) != 1 || got[0].Status != "sent" {
345345+ t.Fatalf("delivery results: %+v", got)
346346+ }
347347+348348+ mta.mu.Lock()
349349+ captured := append([]capturedDelivery(nil), mta.receivedMessages...)
350350+ mta.mu.Unlock()
351351+ if len(captured) != 1 {
352352+ t.Fatalf("fake MTA captured %d, want 1", len(captured))
353353+ }
354354+ wire := captured[0].data
355355+356356+ // (1) Two DKIM-Signature headers survived the wire path.
357357+ sigs := parseDKIMSignatures(t, wire)
358358+ if len(sigs) < 2 {
359359+ t.Fatalf("DKIM-Signature count on wire = %d, want >= 2 (signatures: %+v)", len(sigs), sigs)
360360+ }
361361+362362+ // (2) One signature has d=<member-domain> for DMARC alignment;
363363+ // another has d=atmos.email for pool-FBL routing. Order isn't
364364+ // strictly fixed — check both are present rather than which slot.
365365+ var sawMember, sawPool bool
366366+ for _, sig := range sigs {
367367+ if dkimTagContains(sig, "d=", memberDomain) {
368368+ sawMember = true
369369+ }
370370+ if dkimTagContains(sig, "d=", "atmos.email") {
371371+ sawPool = true
372372+ }
373373+ }
374374+ if !sawMember {
375375+ t.Errorf("no DKIM signature with d=%s on the wire (sigs: %+v)", memberDomain, sigs)
376376+ }
377377+ if !sawPool {
378378+ t.Errorf("no DKIM signature with d=atmos.email on the wire (sigs: %+v)", sigs)
379379+ }
380380+381381+ // (3) Headers we care about for cooperative attribution must
382382+ // survive the queue + transport. If Feedback-ID or
383383+ // X-Atmos-Member-Did get stripped en route, complaint reports
384384+ // route to the wrong place (or nowhere).
385385+ wireStr := string(wire)
386386+ if !strings.Contains(wireStr, "Feedback-ID:") {
387387+ t.Error("Feedback-ID header missing from wire bytes")
388388+ }
389389+ if !strings.Contains(wireStr, "X-Atmos-Member-Did: did:plc:deliverpathaaaaaaaaaa") {
390390+ t.Error("X-Atmos-Member-Did header missing or rewritten on the wire")
391391+ }
392392+393393+ // (4) Body bytes are intact end-to-end.
394394+ if !strings.Contains(wireStr, "the bytes that go on the wire are the bytes we assert on") {
395395+ t.Error("body content lost between signer and wire")
396396+ }
397397+}
398398+399399+// dkimTagContains reports whether the given DKIM-Signature tag/value
400400+// list (the unfolded right-hand side of "DKIM-Signature: ...") includes
401401+// the named tag with the wanted value. e.g. dkimTagContains(sig, "d=",
402402+// "atmos.email") returns true for "v=1; a=rsa-sha256; d=atmos.email; ...".
403403+func dkimTagContains(sig, tag, want string) bool {
404404+ for _, part := range strings.Split(sig, ";") {
405405+ p := strings.TrimSpace(part)
406406+ if !strings.HasPrefix(p, tag) {
407407+ continue
408408+ }
409409+ val := strings.TrimSpace(strings.TrimPrefix(p, tag))
410410+ if val == want {
411411+ return true
412412+ }
413413+ }
414414+ return false
415415+}
+34
internal/relay/integration_helpers_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package relay
44+55+// Test helpers shared across the integration_*_test.go suite. The
66+// helpers live in package relay (not a separate testing package)
77+// because they need to be visible to every _test.go file in this
88+// directory and don't need to be reused outside it.
99+//
1010+// History: each integration test was written self-contained during
1111+// the #228 / #254 series — risk-minimization while the harness was
1212+// being built. Now that the harness is settled, deduplicating the
1313+// store-open boilerplate (#256) saves ~30 lines without making any
1414+// individual test harder to read.
1515+1616+import (
1717+ "testing"
1818+1919+ "atmosphere-mail/internal/relaystore"
2020+)
2121+2222+// setupIntegrationStore opens an in-memory relaystore and registers a
2323+// cleanup hook. Returns the live store. Replaces the previously-inlined
2424+// New + nil-check + defer Close pattern that appeared in every
2525+// integration test in this package.
2626+func setupIntegrationStore(t *testing.T) *relaystore.Store {
2727+ t.Helper()
2828+ store, err := relaystore.New(":memory:")
2929+ if err != nil {
3030+ t.Fatalf("relaystore.New: %v", err)
3131+ }
3232+ t.Cleanup(func() { _ = store.Close() })
3333+ return store
3434+}
+1134
internal/relay/integration_smoke_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package relay
44+55+// Cross-component integration smoke test for the SMTP-submit path.
66+//
77+// This is the first installment of #228 (parent of #217's eventual
88+// cmd/relay refactor). It wires real Store + RateLimiter + Queue
99+// + SMTPServer together — the same wiring main() builds — and proves
1010+// that an SMTP submission lands in both the store AND the queue.
1111+//
1212+// The point is not to reimplement main()'s onAccept (that has 250+
1313+// lines of suppression / DKIM / Osprey policy / partial-delivery
1414+// aggregation logic, all unit-tested in their own files). The point
1515+// is to establish a tripwire for the WIRING: if any of the cross-
1616+// component contracts drift (Queue.Enqueue's signature, MemberLookupFunc's
1717+// signature, OnAcceptFunc's parameter list), this test breaks loudly
1818+// rather than silently changing main()'s behavior.
1919+//
2020+// Subsequent #228 PRs will:
2121+// - layer in suppression-list checks
2222+// - swap the fake delivery for a real test SMTP target
2323+// - add the partial-delivery aggregation assertion
2424+// - cover admin enroll-approval → SMTP-AUTH-with-new-credentials
2525+//
2626+// Risk profile: zero — entirely additive, no production code touched.
2727+2828+import (
2929+ "context"
3030+ "fmt"
3131+ gosmtp "net/smtp"
3232+ "strings"
3333+ "sync"
3434+ "testing"
3535+ "time"
3636+3737+ "atmosphere-mail/internal/relaystore"
3838+)
3939+4040+// TestIntegration_SMTPSubmit_Smoke asserts that one SMTP submission
4141+// flows all the way through: SMTP AUTH → MAIL/RCPT → DATA → onAccept
4242+// closure → Store.InsertMessage → Queue.Enqueue. No real delivery —
4343+// the queue is constructed but never Run'd.
4444+func TestIntegration_SMTPSubmit_Smoke(t *testing.T) {
4545+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
4646+ defer cancel()
4747+4848+ // --- Store: real, in-memory ---
4949+ store := setupIntegrationStore(t)
5050+5151+ apiKey := "atmos_smoke_apikey_xyz123"
5252+ apiKeyHash, err := HashAPIKey(apiKey)
5353+ if err != nil {
5454+ t.Fatalf("hash key: %v", err)
5555+ }
5656+5757+ did := "did:plc:smoketestaaaaaaaaaaaaaa"
5858+ domain := "smoke.example.com"
5959+ now := time.Now().UTC()
6060+6161+ if err := store.InsertMember(ctx, &relaystore.Member{
6262+ DID: did,
6363+ Status: relaystore.StatusActive,
6464+ HourlyLimit: 100,
6565+ DailyLimit: 1000,
6666+ CreatedAt: now,
6767+ UpdatedAt: now,
6868+ DIDVerified: true,
6969+ }); err != nil {
7070+ t.Fatalf("InsertMember: %v", err)
7171+ }
7272+ if err := store.InsertMemberDomain(ctx, &relaystore.MemberDomain{
7373+ DID: did,
7474+ Domain: domain,
7575+ APIKeyHash: apiKeyHash,
7676+ DKIMSelector: "atmos20260502",
7777+ // DKIM keys are NOT NULL per schema but the smoke test's
7878+ // onAccept doesn't sign, so any non-empty bytes satisfy
7979+ // the constraint without having to generate real keys.
8080+ DKIMRSAPriv: []byte("placeholder-rsa-not-used-in-smoke-test"),
8181+ DKIMEdPriv: []byte("placeholder-ed25519-not-used-in-smoke-test"),
8282+ CreatedAt: now,
8383+ }); err != nil {
8484+ t.Fatalf("InsertMemberDomain: %v", err)
8585+ }
8686+8787+ // --- Rate limiter: real, configured to permit ---
8888+ rateLimiter := NewRateLimiter(store, RateLimiterConfig{
8989+ DefaultHourlyLimit: 100,
9090+ DefaultDailyLimit: 1000,
9191+ // GlobalPerMinute defaults to 0 = block everything.
9292+ // Set generously high — this test sends one message.
9393+ GlobalPerMinute: 1000,
9494+ })
9595+9696+ // --- Queue: real, never Run() ---
9797+ // Tests below assert on HasCapacity to prove Enqueue happened.
9898+ // Capturing into a slice would also work but HasCapacity is the
9999+ // public contract main() relies on for batch pre-checks (#226).
100100+ const queueMaxSize = 8
101101+ var deliveryResults []DeliveryResult
102102+ var deliveryMu sync.Mutex
103103+ queue := NewQueue(func(r DeliveryResult) {
104104+ deliveryMu.Lock()
105105+ deliveryResults = append(deliveryResults, r)
106106+ deliveryMu.Unlock()
107107+ }, QueueConfig{MaxSize: queueMaxSize, RelayDomain: "relay.test"})
108108+109109+ // --- Lookup, sendCheck, onAccept: mimic main()'s wiring ---
110110+111111+ lookup := func(ctx context.Context, lookupDID string) (*MemberWithDomains, error) {
112112+ m, err := store.GetMember(ctx, lookupDID)
113113+ if err != nil || m == nil {
114114+ return nil, err
115115+ }
116116+ domains, err := store.ListMemberDomains(ctx, lookupDID)
117117+ if err != nil {
118118+ return nil, err
119119+ }
120120+ di := make([]DomainInfo, 0, len(domains))
121121+ for _, d := range domains {
122122+ di = append(di, DomainInfo{
123123+ Domain: d.Domain,
124124+ APIKeyHash: d.APIKeyHash,
125125+ })
126126+ }
127127+ return &MemberWithDomains{
128128+ DID: m.DID,
129129+ Status: m.Status,
130130+ HourlyLimit: m.HourlyLimit,
131131+ DailyLimit: m.DailyLimit,
132132+ SendCount: m.SendCount,
133133+ CreatedAt: m.CreatedAt,
134134+ Domains: di,
135135+ }, nil
136136+ }
137137+138138+ sendCheck := func(ctx context.Context, member *AuthMember, from, to string) error {
139139+ return rateLimiter.Check(ctx, member.DID, member.HourlyLimit, member.DailyLimit)
140140+ }
141141+142142+ // Recording onAccept: mimics the "happy path" middle of main()'s
143143+ // onAccept — capacity check, persist, enqueue. Strips the
144144+ // suppression / DKIM / Osprey policy / partial-delivery branches
145145+ // since each has its own dedicated test in the relay package.
146146+ var enqueuedIDs []int64
147147+ var enqueueMu sync.Mutex
148148+ onAccept := func(member *AuthMember, from string, to []string, data []byte) error {
149149+ if !queue.HasCapacity(len(to)) {
150150+ return fmt.Errorf("451 queue full")
151151+ }
152152+ for _, recipient := range to {
153153+ msgID, err := store.InsertMessage(context.Background(), &relaystore.Message{
154154+ MemberDID: member.DID,
155155+ FromAddr: from,
156156+ ToAddr: recipient,
157157+ MessageID: "",
158158+ Status: relaystore.MsgQueued,
159159+ CreatedAt: time.Now().UTC(),
160160+ })
161161+ if err != nil {
162162+ return fmt.Errorf("InsertMessage: %w", err)
163163+ }
164164+ if err := queue.Enqueue(&QueueEntry{
165165+ ID: msgID,
166166+ From: from,
167167+ To: recipient,
168168+ Data: data,
169169+ MemberDID: member.DID,
170170+ }); err != nil {
171171+ return fmt.Errorf("Enqueue: %w", err)
172172+ }
173173+ enqueueMu.Lock()
174174+ enqueuedIDs = append(enqueuedIDs, msgID)
175175+ enqueueMu.Unlock()
176176+ }
177177+ return nil
178178+ }
179179+180180+ // --- SMTP server: real, on a random port ---
181181+ _, addr, cleanup := testSMTPServer(t, lookup, sendCheck, onAccept)
182182+ defer cleanup()
183183+184184+ // --- Drive: one SMTP submission ---
185185+ c, err := gosmtp.Dial(addr)
186186+ if err != nil {
187187+ t.Fatalf("dial: %v", err)
188188+ }
189189+ defer c.Close()
190190+ auth := gosmtp.PlainAuth("", did, apiKey, "127.0.0.1")
191191+ if err := c.Auth(auth); err != nil {
192192+ t.Fatalf("Auth: %v", err)
193193+ }
194194+ if err := c.Mail("alice@" + domain); err != nil {
195195+ t.Fatalf("Mail: %v", err)
196196+ }
197197+ if err := c.Rcpt("bob@example.org"); err != nil {
198198+ t.Fatalf("Rcpt: %v", err)
199199+ }
200200+ w, err := c.Data()
201201+ if err != nil {
202202+ t.Fatalf("Data: %v", err)
203203+ }
204204+ body := fmt.Sprintf(
205205+ "From: alice@%s\r\nTo: bob@example.org\r\nSubject: smoke\r\n\r\nintegration smoke test body\r\n",
206206+ domain,
207207+ )
208208+ if _, err := fmt.Fprint(w, body); err != nil {
209209+ t.Fatalf("write body: %v", err)
210210+ }
211211+ if err := w.Close(); err != nil {
212212+ t.Fatalf("close data: %v", err)
213213+ }
214214+ if err := c.Quit(); err != nil {
215215+ t.Fatalf("quit: %v", err)
216216+ }
217217+218218+ // --- Assertions: traverse the whole wiring contract ---
219219+220220+ // (1) onAccept fired exactly once for the single recipient.
221221+ enqueueMu.Lock()
222222+ gotEnqueues := len(enqueuedIDs)
223223+ gotID := int64(-1)
224224+ if gotEnqueues > 0 {
225225+ gotID = enqueuedIDs[0]
226226+ }
227227+ enqueueMu.Unlock()
228228+ if gotEnqueues != 1 {
229229+ t.Fatalf("onAccept enqueued %d times, want 1", gotEnqueues)
230230+ }
231231+ if gotID <= 0 {
232232+ t.Errorf("InsertMessage returned id %d, want > 0", gotID)
233233+ }
234234+235235+ // (2) Store has the persisted Message row matching the InsertMessage
236236+ // id captured from onAccept. We don't have a ListMessagesForMember
237237+ // surface, but the enqueuedIDs[0] came from store.InsertMessage so
238238+ // looking it back up is the exact round-trip.
239239+ msg, err := store.GetMessage(ctx, gotID)
240240+ if err != nil {
241241+ t.Fatalf("GetMessage(%d): %v", gotID, err)
242242+ }
243243+ if msg == nil {
244244+ t.Fatalf("GetMessage(%d) returned nil — row not persisted", gotID)
245245+ }
246246+ if msg.MemberDID != did {
247247+ t.Errorf("stored MemberDID=%q, want %q", msg.MemberDID, did)
248248+ }
249249+ if msg.ToAddr != "bob@example.org" {
250250+ t.Errorf("stored ToAddr=%q, want bob@example.org", msg.ToAddr)
251251+ }
252252+ if msg.FromAddr != "alice@"+domain {
253253+ t.Errorf("stored FromAddr=%q, want alice@%s", msg.FromAddr, domain)
254254+ }
255255+ if msg.Status != relaystore.MsgQueued {
256256+ t.Errorf("stored Status=%q, want %q", msg.Status, relaystore.MsgQueued)
257257+ }
258258+259259+ // (3) Queue has consumed one slot of capacity. We never Run() the
260260+ // queue, so the entry is parked in q.entries waiting for the
261261+ // scheduler — proven by HasCapacity reporting one fewer slot.
262262+ if !queue.HasCapacity(queueMaxSize - 1) {
263263+ t.Error("queue should still have queueMaxSize-1 capacity after one Enqueue")
264264+ }
265265+ if queue.HasCapacity(queueMaxSize) {
266266+ t.Error("queue should NOT report full capacity after one Enqueue — entry not parked")
267267+ }
268268+}
269269+270270+// TestIntegration_SMTPSubmit_SuppressionDropsRecipient extends the smoke
271271+// test with the suppression-list filtering behavior main() implements
272272+// at lines 648-681 of cmd/relay/main.go: drop unsubscribed recipients
273273+// silently before persistence/enqueue, but keep the rest of the batch
274274+// flowing.
275275+//
276276+// Setup difference from the smoke test: pre-insert one suppression for
277277+// blocked@example.org, then RCPT TO both addresses. The clean recipient
278278+// must round-trip into store + queue; the suppressed one must not.
279279+//
280280+// This is installment 2 of #228. Self-contained setup (no helper
281281+// extraction across tests) keeps the risk profile additive and isolated.
282282+func TestIntegration_SMTPSubmit_SuppressionDropsRecipient(t *testing.T) {
283283+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
284284+ defer cancel()
285285+286286+ store := setupIntegrationStore(t)
287287+288288+ apiKey := "atmos_supptest_apikey_xyz123"
289289+ apiKeyHash, err := HashAPIKey(apiKey)
290290+ if err != nil {
291291+ t.Fatalf("hash key: %v", err)
292292+ }
293293+294294+ did := "did:plc:supptestaaaaaaaaaaaaaaa"
295295+ domain := "supp.example.com"
296296+ now := time.Now().UTC()
297297+298298+ if err := store.InsertMember(ctx, &relaystore.Member{
299299+ DID: did,
300300+ Status: relaystore.StatusActive,
301301+ HourlyLimit: 100,
302302+ DailyLimit: 1000,
303303+ CreatedAt: now,
304304+ UpdatedAt: now,
305305+ DIDVerified: true,
306306+ }); err != nil {
307307+ t.Fatalf("InsertMember: %v", err)
308308+ }
309309+ if err := store.InsertMemberDomain(ctx, &relaystore.MemberDomain{
310310+ DID: did,
311311+ Domain: domain,
312312+ APIKeyHash: apiKeyHash,
313313+ DKIMSelector: "atmos20260502",
314314+ DKIMRSAPriv: []byte("placeholder-rsa-not-used-in-suppression-test"),
315315+ DKIMEdPriv: []byte("placeholder-ed25519-not-used-in-suppression-test"),
316316+ CreatedAt: now,
317317+ }); err != nil {
318318+ t.Fatalf("InsertMemberDomain: %v", err)
319319+ }
320320+321321+ // Pre-insert the suppression we'll exercise. The "test-fixture"
322322+ // source string is a sentinel — production sources are
323323+ // "list-unsubscribe", "fbl-arf", "operator-manual", etc.
324324+ if err := store.InsertSuppression(ctx, did, "blocked@example.org", "test-fixture"); err != nil {
325325+ t.Fatalf("InsertSuppression: %v", err)
326326+ }
327327+328328+ rateLimiter := NewRateLimiter(store, RateLimiterConfig{
329329+ DefaultHourlyLimit: 100,
330330+ DefaultDailyLimit: 1000,
331331+ GlobalPerMinute: 1000,
332332+ })
333333+334334+ const queueMaxSize = 8
335335+ queue := NewQueue(func(r DeliveryResult) {}, QueueConfig{
336336+ MaxSize: queueMaxSize,
337337+ RelayDomain: "relay.test",
338338+ })
339339+340340+ lookup := func(ctx context.Context, lookupDID string) (*MemberWithDomains, error) {
341341+ m, err := store.GetMember(ctx, lookupDID)
342342+ if err != nil || m == nil {
343343+ return nil, err
344344+ }
345345+ domains, err := store.ListMemberDomains(ctx, lookupDID)
346346+ if err != nil {
347347+ return nil, err
348348+ }
349349+ di := make([]DomainInfo, 0, len(domains))
350350+ for _, d := range domains {
351351+ di = append(di, DomainInfo{Domain: d.Domain, APIKeyHash: d.APIKeyHash})
352352+ }
353353+ return &MemberWithDomains{
354354+ DID: m.DID,
355355+ Status: m.Status,
356356+ HourlyLimit: m.HourlyLimit,
357357+ DailyLimit: m.DailyLimit,
358358+ SendCount: m.SendCount,
359359+ CreatedAt: m.CreatedAt,
360360+ Domains: di,
361361+ }, nil
362362+ }
363363+364364+ sendCheck := func(ctx context.Context, member *AuthMember, from, to string) error {
365365+ return rateLimiter.Check(ctx, member.DID, member.HourlyLimit, member.DailyLimit)
366366+ }
367367+368368+ // Recording onAccept that mirrors main()'s suppression filtering:
369369+ // for each recipient, IsSuppressed → drop silently; otherwise
370370+ // persist + enqueue. If the resulting deliverable list is empty,
371371+ // the SMTP submission gets a 550 (matches main() lines 667-674).
372372+ var enqueuedTo []string
373373+ var droppedTo []string
374374+ var enqueueMu sync.Mutex
375375+ onAccept := func(member *AuthMember, from string, to []string, data []byte) error {
376376+ var deliverable []string
377377+ for _, r := range to {
378378+ supp, err := store.IsSuppressed(context.Background(), member.DID, r)
379379+ if err != nil {
380380+ // Fail-open mirror: a DB error shouldn't block legit sends.
381381+ deliverable = append(deliverable, r)
382382+ continue
383383+ }
384384+ if supp {
385385+ enqueueMu.Lock()
386386+ droppedTo = append(droppedTo, r)
387387+ enqueueMu.Unlock()
388388+ continue
389389+ }
390390+ deliverable = append(deliverable, r)
391391+ }
392392+ if len(deliverable) == 0 {
393393+ return fmt.Errorf("550 all recipients suppressed")
394394+ }
395395+ if !queue.HasCapacity(len(deliverable)) {
396396+ return fmt.Errorf("451 queue full")
397397+ }
398398+ for _, r := range deliverable {
399399+ msgID, err := store.InsertMessage(context.Background(), &relaystore.Message{
400400+ MemberDID: member.DID,
401401+ FromAddr: from,
402402+ ToAddr: r,
403403+ Status: relaystore.MsgQueued,
404404+ CreatedAt: time.Now().UTC(),
405405+ })
406406+ if err != nil {
407407+ return fmt.Errorf("InsertMessage: %w", err)
408408+ }
409409+ if err := queue.Enqueue(&QueueEntry{
410410+ ID: msgID,
411411+ From: from,
412412+ To: r,
413413+ Data: data,
414414+ MemberDID: member.DID,
415415+ }); err != nil {
416416+ return fmt.Errorf("Enqueue: %w", err)
417417+ }
418418+ enqueueMu.Lock()
419419+ enqueuedTo = append(enqueuedTo, r)
420420+ enqueueMu.Unlock()
421421+ }
422422+ return nil
423423+ }
424424+425425+ _, addr, cleanup := testSMTPServer(t, lookup, sendCheck, onAccept)
426426+ defer cleanup()
427427+428428+ // Submit one message addressed to BOTH a suppressed and a clean
429429+ // recipient. The SMTP server collects all RCPT TOs first, then fires
430430+ // onAccept with the full slice — that's where suppression filtering
431431+ // happens, mirroring main()'s position in the pipeline.
432432+ c, err := gosmtp.Dial(addr)
433433+ if err != nil {
434434+ t.Fatalf("dial: %v", err)
435435+ }
436436+ defer c.Close()
437437+ auth := gosmtp.PlainAuth("", did, apiKey, "127.0.0.1")
438438+ if err := c.Auth(auth); err != nil {
439439+ t.Fatalf("Auth: %v", err)
440440+ }
441441+ if err := c.Mail("alice@" + domain); err != nil {
442442+ t.Fatalf("Mail: %v", err)
443443+ }
444444+ if err := c.Rcpt("blocked@example.org"); err != nil {
445445+ t.Fatalf("Rcpt blocked: %v", err)
446446+ }
447447+ if err := c.Rcpt("clean@example.org"); err != nil {
448448+ t.Fatalf("Rcpt clean: %v", err)
449449+ }
450450+ w, err := c.Data()
451451+ if err != nil {
452452+ t.Fatalf("Data: %v", err)
453453+ }
454454+ body := fmt.Sprintf(
455455+ "From: alice@%s\r\nTo: clean@example.org\r\nSubject: suppression test\r\n\r\nbody\r\n",
456456+ domain,
457457+ )
458458+ if _, err := fmt.Fprint(w, body); err != nil {
459459+ t.Fatalf("write body: %v", err)
460460+ }
461461+ if err := w.Close(); err != nil {
462462+ t.Fatalf("close data: %v", err)
463463+ }
464464+ if err := c.Quit(); err != nil {
465465+ t.Fatalf("quit: %v", err)
466466+ }
467467+468468+ enqueueMu.Lock()
469469+ gotEnqueued := append([]string(nil), enqueuedTo...)
470470+ gotDropped := append([]string(nil), droppedTo...)
471471+ enqueueMu.Unlock()
472472+473473+ if len(gotEnqueued) != 1 || gotEnqueued[0] != "clean@example.org" {
474474+ t.Errorf("enqueued=%v, want [clean@example.org]", gotEnqueued)
475475+ }
476476+ if len(gotDropped) != 1 || gotDropped[0] != "blocked@example.org" {
477477+ t.Errorf("dropped=%v, want [blocked@example.org]", gotDropped)
478478+ }
479479+480480+ // Queue capacity proves only one slot was used (the clean one).
481481+ if !queue.HasCapacity(queueMaxSize - 1) {
482482+ t.Error("queue should have queueMaxSize-1 capacity (only one Enqueue)")
483483+ }
484484+ if queue.HasCapacity(queueMaxSize) {
485485+ t.Error("queue should NOT report full capacity — clean recipient was enqueued")
486486+ }
487487+}
488488+489489+// TestIntegration_SMTPSubmit_AllSuppressedRejects covers the
490490+// boundary case where every RCPT TO has an active suppression.
491491+// main() returns 550 in this case (cmd/relay/main.go lines 667-674):
492492+// dropping all recipients silently would surprise the sender, so
493493+// we explicitly reject with a clear error.
494494+func TestIntegration_SMTPSubmit_AllSuppressedRejects(t *testing.T) {
495495+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
496496+ defer cancel()
497497+498498+ store := setupIntegrationStore(t)
499499+500500+ apiKey := "atmos_allsupp_apikey_xyz123"
501501+ apiKeyHash, _ := HashAPIKey(apiKey)
502502+ did := "did:plc:allsuppaaaaaaaaaaaaaaaa"
503503+ domain := "allsupp.example.com"
504504+ now := time.Now().UTC()
505505+506506+ if err := store.InsertMember(ctx, &relaystore.Member{
507507+ DID: did, Status: relaystore.StatusActive,
508508+ HourlyLimit: 100, DailyLimit: 1000,
509509+ CreatedAt: now, UpdatedAt: now, DIDVerified: true,
510510+ }); err != nil {
511511+ t.Fatalf("InsertMember: %v", err)
512512+ }
513513+ if err := store.InsertMemberDomain(ctx, &relaystore.MemberDomain{
514514+ DID: did, Domain: domain, APIKeyHash: apiKeyHash,
515515+ DKIMSelector: "atmos20260502",
516516+ DKIMRSAPriv: []byte("placeholder-rsa"),
517517+ DKIMEdPriv: []byte("placeholder-ed25519"),
518518+ CreatedAt: now,
519519+ }); err != nil {
520520+ t.Fatalf("InsertMemberDomain: %v", err)
521521+ }
522522+ if err := store.InsertSuppression(ctx, did, "only@example.org", "test-fixture"); err != nil {
523523+ t.Fatalf("InsertSuppression: %v", err)
524524+ }
525525+526526+ rateLimiter := NewRateLimiter(store, RateLimiterConfig{
527527+ DefaultHourlyLimit: 100, DefaultDailyLimit: 1000, GlobalPerMinute: 1000,
528528+ })
529529+ queue := NewQueue(func(DeliveryResult) {}, QueueConfig{MaxSize: 8, RelayDomain: "relay.test"})
530530+531531+ lookup := func(ctx context.Context, lookupDID string) (*MemberWithDomains, error) {
532532+ m, _ := store.GetMember(ctx, lookupDID)
533533+ if m == nil {
534534+ return nil, nil
535535+ }
536536+ domains, _ := store.ListMemberDomains(ctx, lookupDID)
537537+ di := make([]DomainInfo, 0, len(domains))
538538+ for _, d := range domains {
539539+ di = append(di, DomainInfo{Domain: d.Domain, APIKeyHash: d.APIKeyHash})
540540+ }
541541+ return &MemberWithDomains{
542542+ DID: m.DID, Status: m.Status,
543543+ HourlyLimit: m.HourlyLimit, DailyLimit: m.DailyLimit,
544544+ SendCount: m.SendCount, CreatedAt: m.CreatedAt,
545545+ Domains: di,
546546+ }, nil
547547+ }
548548+549549+ sendCheck := func(ctx context.Context, member *AuthMember, from, to string) error {
550550+ return rateLimiter.Check(ctx, member.DID, member.HourlyLimit, member.DailyLimit)
551551+ }
552552+553553+ // Same suppression-aware onAccept as the prior test — copy-pasted
554554+ // rather than refactored into a helper to keep this PR's risk
555555+ // surface narrow. Subsequent #228 installments may consolidate.
556556+ onAccept := func(member *AuthMember, from string, to []string, data []byte) error {
557557+ var deliverable []string
558558+ for _, r := range to {
559559+ supp, err := store.IsSuppressed(context.Background(), member.DID, r)
560560+ if err == nil && supp {
561561+ continue
562562+ }
563563+ deliverable = append(deliverable, r)
564564+ }
565565+ if len(deliverable) == 0 {
566566+ return fmt.Errorf("550 all recipients suppressed")
567567+ }
568568+ for _, r := range deliverable {
569569+ msgID, err := store.InsertMessage(context.Background(), &relaystore.Message{
570570+ MemberDID: member.DID, FromAddr: from, ToAddr: r,
571571+ Status: relaystore.MsgQueued, CreatedAt: time.Now().UTC(),
572572+ })
573573+ if err != nil {
574574+ return fmt.Errorf("InsertMessage: %w", err)
575575+ }
576576+ if err := queue.Enqueue(&QueueEntry{
577577+ ID: msgID, From: from, To: r, Data: data, MemberDID: member.DID,
578578+ }); err != nil {
579579+ return fmt.Errorf("Enqueue: %w", err)
580580+ }
581581+ }
582582+ return nil
583583+ }
584584+585585+ _, addr, cleanup := testSMTPServer(t, lookup, sendCheck, onAccept)
586586+ defer cleanup()
587587+588588+ c, err := gosmtp.Dial(addr)
589589+ if err != nil {
590590+ t.Fatalf("dial: %v", err)
591591+ }
592592+ defer c.Close()
593593+ auth := gosmtp.PlainAuth("", did, apiKey, "127.0.0.1")
594594+ if err := c.Auth(auth); err != nil {
595595+ t.Fatalf("Auth: %v", err)
596596+ }
597597+ if err := c.Mail("alice@" + domain); err != nil {
598598+ t.Fatalf("Mail: %v", err)
599599+ }
600600+ if err := c.Rcpt("only@example.org"); err != nil {
601601+ t.Fatalf("Rcpt: %v", err)
602602+ }
603603+ w, err := c.Data()
604604+ if err != nil {
605605+ t.Fatalf("Data: %v", err)
606606+ }
607607+ if _, err := fmt.Fprintf(w, "From: alice@%s\r\nTo: only@example.org\r\nSubject: x\r\n\r\nbody\r\n", domain); err != nil {
608608+ t.Fatalf("write: %v", err)
609609+ }
610610+ // The error surfaces at w.Close() — that's when the SMTP server
611611+ // has all of DATA, calls onAccept, and gets back the 550.
612612+ closeErr := w.Close()
613613+ if closeErr == nil {
614614+ t.Fatal("Data close should have errored — all recipients suppressed")
615615+ }
616616+ if !strings.Contains(closeErr.Error(), "550") {
617617+ t.Errorf("close error = %q, want 550 status", closeErr.Error())
618618+ }
619619+620620+ // Queue should be untouched — no Enqueue was called.
621621+ if !queue.HasCapacity(8) {
622622+ t.Error("queue should still have full capacity (no Enqueue should have happened)")
623623+ }
624624+}
625625+626626+// TestIntegration_SMTPSubmit_MultiRecipient covers the happy path of
627627+// the per-recipient delivery loop introduced for #226: a single SMTP
628628+// submission with three RCPT TO addresses must produce three
629629+// store rows and three queue entries, and the aggregator's contract
630630+// (succeeded=3, failed=0, retryAll=false) implies the SMTP DATA
631631+// command succeeds with one 250 reply.
632632+//
633633+// This is installment 3 of #228, paired with the capacity pre-check
634634+// test below — together they pin the two aggregator-contract paths
635635+// the smoke + suppression tests don't reach.
636636+func TestIntegration_SMTPSubmit_MultiRecipient(t *testing.T) {
637637+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
638638+ defer cancel()
639639+640640+ store := setupIntegrationStore(t)
641641+642642+ apiKey := "atmos_multirecip_apikey_xyz"
643643+ apiKeyHash, _ := HashAPIKey(apiKey)
644644+ did := "did:plc:multirecipaaaaaaaaaaaaa"
645645+ domain := "multi.example.com"
646646+ now := time.Now().UTC()
647647+648648+ if err := store.InsertMember(ctx, &relaystore.Member{
649649+ DID: did, Status: relaystore.StatusActive,
650650+ HourlyLimit: 100, DailyLimit: 1000,
651651+ CreatedAt: now, UpdatedAt: now, DIDVerified: true,
652652+ }); err != nil {
653653+ t.Fatalf("InsertMember: %v", err)
654654+ }
655655+ if err := store.InsertMemberDomain(ctx, &relaystore.MemberDomain{
656656+ DID: did, Domain: domain, APIKeyHash: apiKeyHash,
657657+ DKIMSelector: "atmos20260502",
658658+ DKIMRSAPriv: []byte("placeholder-rsa"),
659659+ DKIMEdPriv: []byte("placeholder-ed25519"),
660660+ CreatedAt: now,
661661+ }); err != nil {
662662+ t.Fatalf("InsertMemberDomain: %v", err)
663663+ }
664664+665665+ rateLimiter := NewRateLimiter(store, RateLimiterConfig{
666666+ DefaultHourlyLimit: 100, DefaultDailyLimit: 1000, GlobalPerMinute: 1000,
667667+ })
668668+669669+ const queueMaxSize = 16
670670+ queue := NewQueue(func(DeliveryResult) {}, QueueConfig{
671671+ MaxSize: queueMaxSize, RelayDomain: "relay.test",
672672+ })
673673+674674+ lookup := func(ctx context.Context, lookupDID string) (*MemberWithDomains, error) {
675675+ m, _ := store.GetMember(ctx, lookupDID)
676676+ if m == nil {
677677+ return nil, nil
678678+ }
679679+ domains, _ := store.ListMemberDomains(ctx, lookupDID)
680680+ di := make([]DomainInfo, 0, len(domains))
681681+ for _, d := range domains {
682682+ di = append(di, DomainInfo{Domain: d.Domain, APIKeyHash: d.APIKeyHash})
683683+ }
684684+ return &MemberWithDomains{
685685+ DID: m.DID, Status: m.Status,
686686+ HourlyLimit: m.HourlyLimit, DailyLimit: m.DailyLimit,
687687+ SendCount: m.SendCount, CreatedAt: m.CreatedAt,
688688+ Domains: di,
689689+ }, nil
690690+ }
691691+692692+ sendCheck := func(ctx context.Context, member *AuthMember, from, to string) error {
693693+ return rateLimiter.Check(ctx, member.DID, member.HourlyLimit, member.DailyLimit)
694694+ }
695695+696696+ // onAccept emits one RecipientOutcome per recipient and runs the
697697+ // aggregator at the end — exactly the shape main() (lines 822-841)
698698+ // uses to decide whether to return success, partial-failure, or
699699+ // retry-all. We capture the aggregator's output so the test can
700700+ // assert all three return values, not just the side-effects.
701701+ var aggSucceeded, aggFailed int
702702+ var aggRetryAll bool
703703+ onAccept := func(member *AuthMember, from string, to []string, data []byte) error {
704704+ if !queue.HasCapacity(len(to)) {
705705+ return fmt.Errorf("451 queue full")
706706+ }
707707+ var outcomes []RecipientOutcome
708708+ for _, r := range to {
709709+ out := RecipientOutcome{Recipient: r}
710710+ msgID, err := store.InsertMessage(context.Background(), &relaystore.Message{
711711+ MemberDID: member.DID, FromAddr: from, ToAddr: r,
712712+ Status: relaystore.MsgQueued, CreatedAt: time.Now().UTC(),
713713+ })
714714+ if err != nil {
715715+ out.Err = fmt.Errorf("InsertMessage: %w", err)
716716+ outcomes = append(outcomes, out)
717717+ continue
718718+ }
719719+ out.MsgID = msgID
720720+ if err := queue.Enqueue(&QueueEntry{
721721+ ID: msgID, From: from, To: r, Data: data, MemberDID: member.DID,
722722+ }); err != nil {
723723+ out.Err = fmt.Errorf("Enqueue: %w", err)
724724+ outcomes = append(outcomes, out)
725725+ continue
726726+ }
727727+ outcomes = append(outcomes, out)
728728+ }
729729+ s, f, retryAll, _ := AggregateRecipientOutcomes(outcomes)
730730+ aggSucceeded, aggFailed, aggRetryAll = s, f, retryAll
731731+ if retryAll {
732732+ return fmt.Errorf("451 all recipients failed")
733733+ }
734734+ return nil
735735+ }
736736+737737+ _, addr, cleanup := testSMTPServer(t, lookup, sendCheck, onAccept)
738738+ defer cleanup()
739739+740740+ c, err := gosmtp.Dial(addr)
741741+ if err != nil {
742742+ t.Fatalf("dial: %v", err)
743743+ }
744744+ defer c.Close()
745745+ auth := gosmtp.PlainAuth("", did, apiKey, "127.0.0.1")
746746+ if err := c.Auth(auth); err != nil {
747747+ t.Fatalf("Auth: %v", err)
748748+ }
749749+ if err := c.Mail("alice@" + domain); err != nil {
750750+ t.Fatalf("Mail: %v", err)
751751+ }
752752+ for _, rcpt := range []string{"r1@example.org", "r2@example.org", "r3@example.org"} {
753753+ if err := c.Rcpt(rcpt); err != nil {
754754+ t.Fatalf("Rcpt %s: %v", rcpt, err)
755755+ }
756756+ }
757757+ w, err := c.Data()
758758+ if err != nil {
759759+ t.Fatalf("Data: %v", err)
760760+ }
761761+ if _, err := fmt.Fprintf(w, "From: alice@%s\r\nSubject: multi\r\n\r\nbody\r\n", domain); err != nil {
762762+ t.Fatalf("write: %v", err)
763763+ }
764764+ if err := w.Close(); err != nil {
765765+ t.Fatalf("close: %v", err)
766766+ }
767767+ if err := c.Quit(); err != nil {
768768+ t.Fatalf("quit: %v", err)
769769+ }
770770+771771+ if aggSucceeded != 3 {
772772+ t.Errorf("aggregator succeeded=%d, want 3", aggSucceeded)
773773+ }
774774+ if aggFailed != 0 {
775775+ t.Errorf("aggregator failed=%d, want 0", aggFailed)
776776+ }
777777+ if aggRetryAll {
778778+ t.Error("aggregator retryAll should be false when all recipients succeed")
779779+ }
780780+781781+ // Three queue slots consumed.
782782+ if !queue.HasCapacity(queueMaxSize - 3) {
783783+ t.Errorf("queue should have queueMaxSize-3 (%d) capacity remaining", queueMaxSize-3)
784784+ }
785785+ if queue.HasCapacity(queueMaxSize - 2) {
786786+ t.Error("queue should NOT report queueMaxSize-2 capacity — three slots used")
787787+ }
788788+}
789789+790790+// TestIntegration_SMTPSubmit_CapacityPreCheckRejectsBatch covers the
791791+// boundary that #226 closed: when the per-batch HasCapacity pre-check
792792+// fails, the WHOLE submission must be rejected with a transient error
793793+// before any recipient is persisted. Without this gate, a partial loop
794794+// could enqueue M of N recipients then 451, the client retries, and
795795+// the M succeeded recipients receive duplicates.
796796+func TestIntegration_SMTPSubmit_CapacityPreCheckRejectsBatch(t *testing.T) {
797797+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
798798+ defer cancel()
799799+800800+ store := setupIntegrationStore(t)
801801+802802+ apiKey := "atmos_capacity_apikey_xyz"
803803+ apiKeyHash, _ := HashAPIKey(apiKey)
804804+ did := "did:plc:capacityaaaaaaaaaaaaaa"
805805+ domain := "capacity.example.com"
806806+ now := time.Now().UTC()
807807+808808+ if err := store.InsertMember(ctx, &relaystore.Member{
809809+ DID: did, Status: relaystore.StatusActive,
810810+ HourlyLimit: 100, DailyLimit: 1000,
811811+ CreatedAt: now, UpdatedAt: now, DIDVerified: true,
812812+ }); err != nil {
813813+ t.Fatalf("InsertMember: %v", err)
814814+ }
815815+ if err := store.InsertMemberDomain(ctx, &relaystore.MemberDomain{
816816+ DID: did, Domain: domain, APIKeyHash: apiKeyHash,
817817+ DKIMSelector: "atmos20260502",
818818+ DKIMRSAPriv: []byte("placeholder-rsa"),
819819+ DKIMEdPriv: []byte("placeholder-ed25519"),
820820+ CreatedAt: now,
821821+ }); err != nil {
822822+ t.Fatalf("InsertMemberDomain: %v", err)
823823+ }
824824+825825+ rateLimiter := NewRateLimiter(store, RateLimiterConfig{
826826+ DefaultHourlyLimit: 100, DefaultDailyLimit: 1000, GlobalPerMinute: 1000,
827827+ })
828828+829829+ // Tight queue: maxSize=2 cannot accommodate the 3 recipients
830830+ // we'll submit. The pre-check must fire and reject the batch
831831+ // before any persistence happens.
832832+ const queueMaxSize = 2
833833+ queue := NewQueue(func(DeliveryResult) {}, QueueConfig{
834834+ MaxSize: queueMaxSize, RelayDomain: "relay.test",
835835+ })
836836+837837+ lookup := func(ctx context.Context, lookupDID string) (*MemberWithDomains, error) {
838838+ m, _ := store.GetMember(ctx, lookupDID)
839839+ if m == nil {
840840+ return nil, nil
841841+ }
842842+ domains, _ := store.ListMemberDomains(ctx, lookupDID)
843843+ di := make([]DomainInfo, 0, len(domains))
844844+ for _, d := range domains {
845845+ di = append(di, DomainInfo{Domain: d.Domain, APIKeyHash: d.APIKeyHash})
846846+ }
847847+ return &MemberWithDomains{
848848+ DID: m.DID, Status: m.Status,
849849+ HourlyLimit: m.HourlyLimit, DailyLimit: m.DailyLimit,
850850+ SendCount: m.SendCount, CreatedAt: m.CreatedAt,
851851+ Domains: di,
852852+ }, nil
853853+ }
854854+855855+ sendCheck := func(ctx context.Context, member *AuthMember, from, to string) error {
856856+ return rateLimiter.Check(ctx, member.DID, member.HourlyLimit, member.DailyLimit)
857857+ }
858858+859859+ // onAccept with the same pre-check pattern as main(). Returning
860860+ // 451 before any InsertMessage means the store stays empty even
861861+ // though the SMTP RCPT phase already accepted the recipients.
862862+ var insertCalled int
863863+ onAccept := func(member *AuthMember, from string, to []string, data []byte) error {
864864+ if !queue.HasCapacity(len(to)) {
865865+ return fmt.Errorf("451 queue full")
866866+ }
867867+ // Should never reach this branch in this test.
868868+ insertCalled++
869869+ for _, r := range to {
870870+ if _, err := store.InsertMessage(context.Background(), &relaystore.Message{
871871+ MemberDID: member.DID, FromAddr: from, ToAddr: r,
872872+ Status: relaystore.MsgQueued, CreatedAt: time.Now().UTC(),
873873+ }); err != nil {
874874+ return err
875875+ }
876876+ }
877877+ return nil
878878+ }
879879+880880+ _, addr, cleanup := testSMTPServer(t, lookup, sendCheck, onAccept)
881881+ defer cleanup()
882882+883883+ c, err := gosmtp.Dial(addr)
884884+ if err != nil {
885885+ t.Fatalf("dial: %v", err)
886886+ }
887887+ defer c.Close()
888888+ auth := gosmtp.PlainAuth("", did, apiKey, "127.0.0.1")
889889+ if err := c.Auth(auth); err != nil {
890890+ t.Fatalf("Auth: %v", err)
891891+ }
892892+ if err := c.Mail("alice@" + domain); err != nil {
893893+ t.Fatalf("Mail: %v", err)
894894+ }
895895+ // Three RCPT TOs against a queue with capacity for two.
896896+ for _, rcpt := range []string{"r1@example.org", "r2@example.org", "r3@example.org"} {
897897+ if err := c.Rcpt(rcpt); err != nil {
898898+ t.Fatalf("Rcpt %s: %v", rcpt, err)
899899+ }
900900+ }
901901+ w, err := c.Data()
902902+ if err != nil {
903903+ t.Fatalf("Data: %v", err)
904904+ }
905905+ if _, err := fmt.Fprintf(w, "From: alice@%s\r\nSubject: x\r\n\r\nbody\r\n", domain); err != nil {
906906+ t.Fatalf("write: %v", err)
907907+ }
908908+ closeErr := w.Close()
909909+ if closeErr == nil {
910910+ t.Fatal("Data close should have errored — pre-check fails on capacity")
911911+ }
912912+ if !strings.Contains(closeErr.Error(), "451") {
913913+ t.Errorf("close error = %q, want 451 status", closeErr.Error())
914914+ }
915915+916916+ // CRITICAL: no persistence must have occurred. This is the
917917+ // invariant that prevents the #226 duplicate-delivery scenario:
918918+ // rejecting after partial persistence + retry would dupe.
919919+ if insertCalled != 0 {
920920+ t.Errorf("InsertMessage path entered %d times — must be 0 when pre-check fails", insertCalled)
921921+ }
922922+ if !queue.HasCapacity(queueMaxSize) {
923923+ t.Error("queue should still report full capacity — no Enqueue should have happened")
924924+ }
925925+}
926926+927927+// TestIntegration_QueueDispatchesViaDeliverFunc exercises the
928928+// Queue.Run() lifecycle end-to-end: SMTP submit → onAccept enqueues
929929+// → Queue.Run() worker picks it up → injected DeliverFunc fires →
930930+// onDelivery callback receives the result.
931931+//
932932+// Without QueueConfig.DeliverFunc (#228 installment 4), this path
933933+// could only be tested by mocking DNS or running a real fake SMTP
934934+// at the MX-lookup edge. The injection point is a production-side
935935+// addition: nil DeliverFunc keeps the existing deliverMessage call,
936936+// non-nil swaps it. This test sets a fake to capture the entry and
937937+// asserts the full Enqueue → dispatch → onDelivery loop fires.
938938+func TestIntegration_QueueDispatchesViaDeliverFunc(t *testing.T) {
939939+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
940940+ defer cancel()
941941+942942+ store := setupIntegrationStore(t)
943943+944944+ apiKey := "atmos_dispatch_apikey_xyz"
945945+ apiKeyHash, _ := HashAPIKey(apiKey)
946946+ did := "did:plc:dispatchaaaaaaaaaaaaaa"
947947+ domain := "dispatch.example.com"
948948+ now := time.Now().UTC()
949949+950950+ if err := store.InsertMember(ctx, &relaystore.Member{
951951+ DID: did, Status: relaystore.StatusActive,
952952+ HourlyLimit: 100, DailyLimit: 1000,
953953+ CreatedAt: now, UpdatedAt: now, DIDVerified: true,
954954+ }); err != nil {
955955+ t.Fatalf("InsertMember: %v", err)
956956+ }
957957+ if err := store.InsertMemberDomain(ctx, &relaystore.MemberDomain{
958958+ DID: did, Domain: domain, APIKeyHash: apiKeyHash,
959959+ DKIMSelector: "atmos20260502",
960960+ DKIMRSAPriv: []byte("placeholder-rsa"),
961961+ DKIMEdPriv: []byte("placeholder-ed25519"),
962962+ CreatedAt: now,
963963+ }); err != nil {
964964+ t.Fatalf("InsertMemberDomain: %v", err)
965965+ }
966966+967967+ rateLimiter := NewRateLimiter(store, RateLimiterConfig{
968968+ DefaultHourlyLimit: 100, DefaultDailyLimit: 1000, GlobalPerMinute: 1000,
969969+ })
970970+971971+ // Injected DeliverFunc: capture every entry the queue worker
972972+ // dispatches, return a synthetic "sent" result so the entry
973973+ // reaches a terminal state instead of getting requeued.
974974+ var dispatched []*QueueEntry
975975+ var dispatchedMu sync.Mutex
976976+ dispatchSignal := make(chan struct{}, 8)
977977+ fakeDeliver := func(ctx context.Context, entry *QueueEntry, relayDomain string) DeliveryResult {
978978+ dispatchedMu.Lock()
979979+ dispatched = append(dispatched, entry)
980980+ dispatchedMu.Unlock()
981981+ dispatchSignal <- struct{}{}
982982+ return DeliveryResult{
983983+ EntryID: entry.ID,
984984+ MemberDID: entry.MemberDID,
985985+ Recipient: entry.To,
986986+ Status: "sent",
987987+ SMTPCode: 250,
988988+ }
989989+ }
990990+991991+ // onDelivery callback: capture the terminal-status results so we
992992+ // can assert the queue's lifecycle reached the final reporting step.
993993+ var delivered []DeliveryResult
994994+ var deliveredMu sync.Mutex
995995+ deliveredSignal := make(chan struct{}, 8)
996996+ queue := NewQueue(func(r DeliveryResult) {
997997+ deliveredMu.Lock()
998998+ delivered = append(delivered, r)
999999+ deliveredMu.Unlock()
10001000+ deliveredSignal <- struct{}{}
10011001+ }, QueueConfig{
10021002+ MaxSize: 8,
10031003+ RelayDomain: "relay.test",
10041004+ DeliverFunc: fakeDeliver,
10051005+ Workers: 1,
10061006+ DeliveryTimeout: 2 * time.Second,
10071007+ })
10081008+10091009+ // Run the queue worker in a background goroutine. It blocks on
10101010+ // q.notify, which Enqueue signals — same path production uses.
10111011+ queueCtx, queueCancel := context.WithCancel(ctx)
10121012+ queueDone := make(chan struct{})
10131013+ go func() {
10141014+ _ = queue.Run(queueCtx)
10151015+ close(queueDone)
10161016+ }()
10171017+ defer func() { queueCancel(); <-queueDone }()
10181018+10191019+ lookup := func(ctx context.Context, lookupDID string) (*MemberWithDomains, error) {
10201020+ m, _ := store.GetMember(ctx, lookupDID)
10211021+ if m == nil {
10221022+ return nil, nil
10231023+ }
10241024+ domains, _ := store.ListMemberDomains(ctx, lookupDID)
10251025+ di := make([]DomainInfo, 0, len(domains))
10261026+ for _, d := range domains {
10271027+ di = append(di, DomainInfo{Domain: d.Domain, APIKeyHash: d.APIKeyHash})
10281028+ }
10291029+ return &MemberWithDomains{
10301030+ DID: m.DID, Status: m.Status,
10311031+ HourlyLimit: m.HourlyLimit, DailyLimit: m.DailyLimit,
10321032+ SendCount: m.SendCount, CreatedAt: m.CreatedAt,
10331033+ Domains: di,
10341034+ }, nil
10351035+ }
10361036+10371037+ sendCheck := func(ctx context.Context, member *AuthMember, from, to string) error {
10381038+ return rateLimiter.Check(ctx, member.DID, member.HourlyLimit, member.DailyLimit)
10391039+ }
10401040+10411041+ onAccept := func(member *AuthMember, from string, to []string, data []byte) error {
10421042+ for _, r := range to {
10431043+ msgID, err := store.InsertMessage(context.Background(), &relaystore.Message{
10441044+ MemberDID: member.DID, FromAddr: from, ToAddr: r,
10451045+ Status: relaystore.MsgQueued, CreatedAt: time.Now().UTC(),
10461046+ })
10471047+ if err != nil {
10481048+ return fmt.Errorf("InsertMessage: %w", err)
10491049+ }
10501050+ if err := queue.Enqueue(&QueueEntry{
10511051+ ID: msgID, From: from, To: r, Data: data, MemberDID: member.DID,
10521052+ }); err != nil {
10531053+ return fmt.Errorf("Enqueue: %w", err)
10541054+ }
10551055+ }
10561056+ return nil
10571057+ }
10581058+10591059+ _, addr, cleanup := testSMTPServer(t, lookup, sendCheck, onAccept)
10601060+ defer cleanup()
10611061+10621062+ c, err := gosmtp.Dial(addr)
10631063+ if err != nil {
10641064+ t.Fatalf("dial: %v", err)
10651065+ }
10661066+ defer c.Close()
10671067+ auth := gosmtp.PlainAuth("", did, apiKey, "127.0.0.1")
10681068+ if err := c.Auth(auth); err != nil {
10691069+ t.Fatalf("Auth: %v", err)
10701070+ }
10711071+ if err := c.Mail("alice@" + domain); err != nil {
10721072+ t.Fatalf("Mail: %v", err)
10731073+ }
10741074+ if err := c.Rcpt("bob@example.org"); err != nil {
10751075+ t.Fatalf("Rcpt: %v", err)
10761076+ }
10771077+ w, err := c.Data()
10781078+ if err != nil {
10791079+ t.Fatalf("Data: %v", err)
10801080+ }
10811081+ if _, err := fmt.Fprintf(w, "From: alice@%s\r\nTo: bob@example.org\r\nSubject: dispatch\r\n\r\nbody\r\n", domain); err != nil {
10821082+ t.Fatalf("write: %v", err)
10831083+ }
10841084+ if err := w.Close(); err != nil {
10851085+ t.Fatalf("close: %v", err)
10861086+ }
10871087+ if err := c.Quit(); err != nil {
10881088+ t.Fatalf("quit: %v", err)
10891089+ }
10901090+10911091+ // Wait for the queue worker to dispatch (DeliverFunc fires) and
10921092+ // then for onDelivery to receive the terminal result. Both should
10931093+ // happen within a couple of seconds — the queue's internal timer
10941094+ // is 30s but Enqueue's q.notify signal wakes processReady
10951095+ // immediately.
10961096+ select {
10971097+ case <-dispatchSignal:
10981098+ case <-time.After(5 * time.Second):
10991099+ t.Fatal("DeliverFunc was never called within 5s")
11001100+ }
11011101+ select {
11021102+ case <-deliveredSignal:
11031103+ case <-time.After(5 * time.Second):
11041104+ t.Fatal("onDelivery was never called within 5s")
11051105+ }
11061106+11071107+ dispatchedMu.Lock()
11081108+ gotDispatched := len(dispatched)
11091109+ var gotEntryRecipient string
11101110+ if gotDispatched > 0 {
11111111+ gotEntryRecipient = dispatched[0].To
11121112+ }
11131113+ dispatchedMu.Unlock()
11141114+ if gotDispatched != 1 {
11151115+ t.Errorf("DeliverFunc fired %d times, want 1", gotDispatched)
11161116+ }
11171117+ if gotEntryRecipient != "bob@example.org" {
11181118+ t.Errorf("dispatched recipient=%q, want bob@example.org", gotEntryRecipient)
11191119+ }
11201120+11211121+ deliveredMu.Lock()
11221122+ gotDelivered := len(delivered)
11231123+ var gotStatus string
11241124+ if gotDelivered > 0 {
11251125+ gotStatus = delivered[0].Status
11261126+ }
11271127+ deliveredMu.Unlock()
11281128+ if gotDelivered != 1 {
11291129+ t.Errorf("onDelivery fired %d times, want 1", gotDelivered)
11301130+ }
11311131+ if gotStatus != "sent" {
11321132+ t.Errorf("delivered status=%q, want sent", gotStatus)
11331133+ }
11341134+}
+11-11
internal/relay/memberhash_cache.go
···1212// process-local cache. The previous implementation rebuilt the cache from the
1313// full members table on every miss, so a sender pumping random VERP local
1414// parts at port 25 could trigger an O(N) full-table scan per inbound message
1515-// and DoS the relay. See #218.
1515+// and DoS the relay.
1616//
1717// This cache adds two defenses:
1818//
···4343// MemberHashMetrics is the narrow metrics surface used by MemberHashCache.
4444// Implementations record counts to Prometheus; nil-safe in tests.
4545type MemberHashMetrics interface {
4646- IncMemberHashHit() // positive cache hit
4747- IncMemberHashNegHit() // negative cache hit (DoS short-circuit)
4848- IncMemberHashMiss() // confirmed miss after rebuild
4949- IncMemberHashRebuild() // a rebuild ran
4646+ IncMemberHashHit() // positive cache hit
4747+ IncMemberHashNegHit() // negative cache hit (DoS short-circuit)
4848+ IncMemberHashMiss() // confirmed miss after rebuild
4949+ IncMemberHashRebuild() // a rebuild ran
5050 IncMemberHashRebuildSkip() // rebuild rate-limited
5151 SetMemberHashSize(positive, negative int)
5252}
···231231232232type noopMemberHashMetrics struct{}
233233234234-func (noopMemberHashMetrics) IncMemberHashHit() {}
235235-func (noopMemberHashMetrics) IncMemberHashNegHit() {}
236236-func (noopMemberHashMetrics) IncMemberHashMiss() {}
237237-func (noopMemberHashMetrics) IncMemberHashRebuild() {}
238238-func (noopMemberHashMetrics) IncMemberHashRebuildSkip() {}
239239-func (noopMemberHashMetrics) SetMemberHashSize(_ int, _ int) {}
234234+func (noopMemberHashMetrics) IncMemberHashHit() {}
235235+func (noopMemberHashMetrics) IncMemberHashNegHit() {}
236236+func (noopMemberHashMetrics) IncMemberHashMiss() {}
237237+func (noopMemberHashMetrics) IncMemberHashRebuild() {}
238238+func (noopMemberHashMetrics) IncMemberHashRebuildSkip() {}
239239+func (noopMemberHashMetrics) SetMemberHashSize(_ int, _ int) {}
···5252 // Without this, a relay restart followed by an Osprey outage
5353 // allows every new DID to send unsuspended for the duration of
5454 // the outage — even DIDs Osprey would have flagged on a healthy
5555- // query. Closes #215.
5555+ // query.
5656 failClosedOnColdCache bool
57575858 // coldCacheRecorder counts fail-open vs fail-closed decisions on
···121121// — a regression from the legacy fail-open behavior, deliberately
122122// chosen because the cold-cache+outage window is exactly when an
123123// attacker can register a new DID and burn reputation before Osprey
124124-// labels arrive (#215). Operators can opt back into fail-open via
124124+// labels arrive. Operators can opt back into fail-open via
125125// SetFailClosedOnColdCache(false) if the security tradeoff doesn't
126126// match their environment.
127127func NewOspreyEnforcer(apiURL string, client *http.Client) *OspreyEnforcer {
···152152// SetSnapshotPath enables on-disk cache persistence. Snapshots are
153153// written periodically by Snapshot() and read by LoadSnapshot() on
154154// startup so a relay restart doesn't reset the cache to empty —
155155-// which is the load-bearing concern for #215. Pass an empty string
155155+// which is the load-bearing concern for cold-cache safety. Pass an empty string
156156// to disable.
157157func (e *OspreyEnforcer) SetSnapshotPath(path string) {
158158 e.snapshotPath = path
···254254// cached, that cached label set is used.
255255//
256256// Cold cache + Osprey unreachable: returns ErrOspreyColdCache when
257257-// failClosedOnColdCache is true (default — closes #215). Operators
257257+// failClosedOnColdCache is true (default). Operators
258258// who prefer the legacy fail-open behavior can call
259259-// SetFailClosedOnColdCache(false), which restores the pre-#215 path
259259+// SetFailClosedOnColdCache(false), which restores the previous path
260260// of returning defaultPolicy with no error.
261261func (e *OspreyEnforcer) GetPolicy(ctx context.Context, did string) (*LabelPolicy, error) {
262262 labels, _, err := e.activeLabelsFor(ctx, did)
···310310 return entry.activeLabels, true, nil
311311 }
312312 // Cold cache + Osprey unreachable. Default behavior is now
313313- // fail-closed (#215): without this branch, a relay restart
313313+ // fail-closed: without this branch, a relay restart
314314 // during an Osprey outage would let attackers send unsuspended
315315 // for the duration of the outage. Operators who need the
316316 // legacy fail-open semantics opt in via SetFailClosedOnColdCache.
+3-3
internal/relay/ospreyenforce_test.go
···523523 // Defensive: Osprey rules will accrue new labels over time. Unknown
524524 // labels must not break policy derivation — they're just ignored.
525525 p := policyFromLabels(map[string]struct{}{
526526- "unknown_label_1": {},
527527- "another_unknown_future_label_from_osprey": {},
526526+ "unknown_label_1": {},
527527+ "another_unknown_future_label_from_osprey": {},
528528 })
529529 if p.Suspended || p.SkipWarming || p.HourlyLimitMultiplier != 1.0 {
530530 t.Errorf("unknown labels must not affect policy, got %+v", p)
···561561 {"auto_suspended", false},
562562 {"highly_trusted", false},
563563 {"", false},
564564- {"shadow", false}, // missing colon
564564+ {"shadow", false}, // missing colon
565565 {"shadow_suspended", false}, // underscore, not prefix
566566 }
567567 for _, tc := range cases {
+4-4
internal/relay/publicrouter.go
···4343// (smtp.atmos.email) goes to infraHandler — unsubscribe + healthz. Redirect
4444// hosts emit a 301 to RedirectTo + the original request URI.
4545type PublicRouter struct {
4646- routes map[string]HostRoute
4747- siteHandler http.Handler
4848- infraHandler http.Handler
4949- fallback http.Handler // used when Host is not in routes
4646+ routes map[string]HostRoute
4747+ siteHandler http.Handler
4848+ infraHandler http.Handler
4949+ fallback http.Handler // used when Host is not in routes
5050}
51515252// NewPublicRouter constructs a router.
···1616 "time"
1717)
18181919+// Tunables that previously appeared as bare literals scattered through
2020+// the queue. Pulling them up to named constants makes the operational
2121+// behavior obvious from the top of the file and keeps duplicated values
2222+// in sync (e.g. the dialer timeout used to live both in NewQueue's
2323+// default DialMX and in deliverMessage's fallback closure).
2424+const (
2525+ // queueHousekeepingInterval is how long Run() waits between
2626+ // housekeeping ticks when nothing has been Enqueue'd. Drives the
2727+ // retry sweep — a lower value retries deferred entries sooner at
2828+ // the cost of busier loops; a higher one delays recovery after a
2929+ // transient remote MTA outage. 30s is the historical value.
3030+ queueHousekeepingInterval = 30 * time.Second
3131+3232+ // defaultMXDialTimeout caps the TCP connect to a single MX host.
3333+ // Production never overrides this; tests inject their own dialer
3434+ // via QueueConfig.DialMX, so this only governs the production
3535+ // default closure in NewQueue and deliverMessage.
3636+ defaultMXDialTimeout = 30 * time.Second
3737+3838+ // defaultDeliveryTimeout caps the entire deliver-to-MX cycle for
3939+ // a single recipient (DNS + dial + EHLO + STARTTLS + MAIL/RCPT/DATA).
4040+ // Crosses TCP and TLS handshakes so it has to be generous; 2m is
4141+ // long enough for high-latency providers but short enough that a
4242+ // hung deliver can't permanently wedge a worker slot.
4343+ defaultDeliveryTimeout = 2 * time.Minute
4444+)
4545+1946// QueueEntry represents a message waiting for delivery.
2047type QueueEntry struct {
2148 ID int64
···4067// OnDeliveryFunc is called after each delivery attempt.
4168type OnDeliveryFunc func(result DeliveryResult)
42697070+// DeliverFunc is the per-entry delivery dispatcher. Production wires
7171+// this to the package-internal deliverMessage (real MX lookup + SMTP);
7272+// integration tests inject a fake that records the entry without
7373+// touching the network. The relayDomain is forwarded so the real path
7474+// can use it as the EHLO hostname per RFC 5321 §4.1.1.1.
7575+//
7676+// Default (when QueueConfig.DeliverFunc is nil): the existing
7777+// deliverMessage call. Setting a non-nil value swaps it out — any
7878+// production caller that doesn't set the field keeps the original
7979+// behavior. See queue_deliver_inject_test.go for the test pattern.
8080+type DeliverFunc func(ctx context.Context, entry *QueueEntry, relayDomain string) DeliveryResult
8181+4382// Queue manages outbound message delivery with retries.
4483type Queue struct {
4584 mu sync.Mutex
···4786 notify chan struct{}
48874988 onDelivery OnDeliveryFunc
5050- spool *Spool // optional — if set, messages are persisted to disk
5151- relayDomain string // EHLO hostname (e.g. "atmos.email")
8989+ deliverFunc DeliverFunc
9090+ spool *Spool // optional — if set, messages are persisted to disk
9191+ relayDomain string // EHLO hostname (e.g. "atmos.email")
5292 metrics *Metrics // optional — nil-safe
53935494 maxRetries int
···66106 RelayDomain string // EHLO hostname for outbound delivery (e.g. "atmos.email")
67107 Workers int // concurrent delivery workers (default 5)
68108 DeliveryTimeout time.Duration // per-delivery timeout (default 2m)
109109+ // DeliverFunc, when non-nil, overrides the default per-entry
110110+ // delivery dispatcher. Production leaves this nil — the queue
111111+ // falls back to the package-internal deliverMessage which does
112112+ // real MX lookup + SMTP. Integration tests inject a fake that
113113+ // records the entry and returns a synthetic DeliveryResult so
114114+ // the test doesn't have to mock DNS or run a fake SMTP server
115115+ // at the edge of the queue worker (installment 4).
116116+ DeliverFunc DeliverFunc
117117+ // LookupMX, when non-nil, replaces the default
118118+ // net.DefaultResolver.LookupMX call inside the production deliver
119119+ // path. Production leaves this nil. Tests inject a resolver that
120120+ // returns a fixed MX (e.g. "test.local") so the deliver path can
121121+ // be exercised against a fake MTA without real DNS.
122122+ LookupMX func(ctx context.Context, domain string) ([]*net.MX, error)
123123+ // DialMX, when non-nil, replaces the default tcp dialer that
124124+ // connects to "<mxHost>:25" inside deliverToMX. Production leaves
125125+ // this nil. Tests inject a dialer that returns a connection to a
126126+ // fake MTA on a random local port regardless of the requested
127127+ // mxHost. Pair with LookupMX to exercise the real deliverMessage
128128+ // path against a fake server.
129129+ DialMX func(ctx context.Context, mxHost string) (net.Conn, error)
69130}
7013171132// DefaultQueueConfig returns sensible defaults for the delivery queue.
···81142 },
82143 MaxSize: 10000,
83144 Workers: 5,
8484- DeliveryTimeout: 2 * time.Minute,
145145+ DeliveryTimeout: defaultDeliveryTimeout,
85146 }
86147}
87148···99160 }
100161 timeout := cfg.DeliveryTimeout
101162 if timeout <= 0 {
102102- timeout = 2 * time.Minute
163163+ timeout = defaultDeliveryTimeout
164164+ }
165165+ // Resolve MX lookup + dialer to the production defaults when the
166166+ // caller didn't override them. Tests inject these to redirect the
167167+ // real deliver path at a fake MTA.
168168+ lookupMX := cfg.LookupMX
169169+ if lookupMX == nil {
170170+ lookupMX = net.DefaultResolver.LookupMX
171171+ }
172172+ dialMX := cfg.DialMX
173173+ if dialMX == nil {
174174+ dialMX = func(ctx context.Context, mxHost string) (net.Conn, error) {
175175+ d := net.Dialer{Timeout: defaultMXDialTimeout}
176176+ return d.DialContext(ctx, "tcp", mxHost+":25")
177177+ }
178178+ }
179179+ // Default DeliverFunc is the production deliver path with the
180180+ // resolved MX lookup + dialer baked in. Tests can still bypass
181181+ // the whole thing by setting cfg.DeliverFunc directly.
182182+ deliverFn := cfg.DeliverFunc
183183+ if deliverFn == nil {
184184+ deliverFn = func(ctx context.Context, entry *QueueEntry, relayDomain string) DeliveryResult {
185185+ return deliverMessageWith(ctx, entry, relayDomain, lookupMX, dialMX)
186186+ }
103187 }
104188 return &Queue{
105189 notify: make(chan struct{}, 1),
106190 onDelivery: onDelivery,
191191+ deliverFunc: deliverFn,
107192 relayDomain: cfg.RelayDomain,
108193 maxRetries: cfg.MaxRetries,
109194 maxSize: cfg.MaxSize,
···167252168253// LoadSpool reloads any messages from the spool directory into the queue.
169254// Call this once at startup, before Run.
255255+//
256256+// Pokes q.notify so the next Run loop picks the entries up immediately
257257+// instead of waiting on the 30s housekeeping timer. Without this kick,
258258+// every cold start delays processing of recovered messages by up to
259259+// 30s — fine for normal restarts, painful when the spool is large and
260260+// the operator just bounced the relay to clear an incident.
170261func (q *Queue) LoadSpool() (int, error) {
171262 if q.spool == nil {
172263 return 0, nil
···181272 q.entries = append(q.entries, e)
182273 }
183274 q.mu.Unlock()
275275+276276+ if len(entries) > 0 {
277277+ // Non-blocking notify so reloaded entries are picked up by the
278278+ // next Run iteration rather than the 30s timer.
279279+ select {
280280+ case q.notify <- struct{}{}:
281281+ default:
282282+ }
283283+ }
284284+184285 return len(entries), nil
185286}
186287···204305205306// Run processes the queue until the context is cancelled.
206307func (q *Queue) Run(ctx context.Context) error {
207207- timer := time.NewTimer(30 * time.Second)
308308+ timer := time.NewTimer(queueHousekeepingInterval)
208309 defer timer.Stop()
209310 for {
210311 select {
···217318 default:
218319 }
219320 }
220220- timer.Reset(30 * time.Second)
321321+ timer.Reset(queueHousekeepingInterval)
221322 q.processReady(ctx)
222323 case <-timer.C:
223223- timer.Reset(30 * time.Second)
324324+ timer.Reset(queueHousekeepingInterval)
224325 q.processReady(ctx)
225326 }
226327 }
···278379func (q *Queue) deliver(ctx context.Context, entry *QueueEntry) {
279380 deliverCtx, cancel := context.WithTimeout(ctx, q.deliveryTimeout)
280381 defer cancel()
281281- result := deliverMessage(deliverCtx, entry, q.relayDomain)
382382+ result := q.deliverFunc(deliverCtx, entry, q.relayDomain)
282383 entry.Attempts++
283384284385 if q.metrics != nil {
···330431 }
331432}
332433333333-// deliverMessage attempts direct MX delivery of a single message.
334334-// relayDomain is used as the EHLO hostname per RFC 5321 §4.1.1.1.
434434+// deliverMessage is the production deliver path with default MX lookup
435435+// (net.DefaultResolver) and TCP dial to port 25. Kept as a thin wrapper
436436+// over deliverMessageWith for callers that don't need to inject seams
437437+// (forwarder.go, opmail.go).
335438func deliverMessage(ctx context.Context, entry *QueueEntry, relayDomain string) DeliveryResult {
439439+ return deliverMessageWith(
440440+ ctx, entry, relayDomain,
441441+ net.DefaultResolver.LookupMX,
442442+ func(ctx context.Context, mxHost string) (net.Conn, error) {
443443+ d := net.Dialer{Timeout: defaultMXDialTimeout}
444444+ return d.DialContext(ctx, "tcp", mxHost+":25")
445445+ },
446446+ )
447447+}
448448+449449+// deliverMessageWith is the production deliver path, parameterized on
450450+// the MX lookup and TCP dialer it uses. Production wires these to
451451+// net.DefaultResolver.LookupMX and a tcp dialer to "<mxHost>:25"; tests
452452+// can swap them to redirect the real deliver path at a fake MTA on a
453453+// random local port. relayDomain is sent as the EHLO hostname
454454+// per RFC 5321 §4.1.1.1.
455455+func deliverMessageWith(
456456+ ctx context.Context,
457457+ entry *QueueEntry,
458458+ relayDomain string,
459459+ lookupMX func(ctx context.Context, domain string) ([]*net.MX, error),
460460+ dialMX func(ctx context.Context, mxHost string) (net.Conn, error),
461461+) DeliveryResult {
336462 result := DeliveryResult{EntryID: entry.ID, MemberDID: entry.MemberDID, Recipient: entry.To}
337463338464 // Extract recipient domain
···346472 domain := parts[1]
347473348474 // Look up MX records
349349- mxRecords, err := net.DefaultResolver.LookupMX(ctx, domain)
475475+ mxRecords, err := lookupMX(ctx, domain)
350476 if err != nil {
351477 result.Status = "deferred"
352478 result.Error = fmt.Sprintf("MX lookup failed: %v", err)
···362488 var lastErr error
363489 for _, mx := range mxRecords {
364490 host := strings.TrimSuffix(mx.Host, ".")
365365- code, err := deliverToMX(ctx, host, entry.From, entry.To, entry.Data, relayDomain)
491491+ code, err := deliverToMX(ctx, host, entry.From, entry.To, entry.Data, relayDomain, dialMX)
366492 if err == nil {
367493 result.Status = "sent"
368494 result.SMTPCode = code
···389515390516// deliverToMX connects to a single MX host and delivers the message.
391517// relayDomain is sent as the EHLO hostname per RFC 5321 §4.1.1.1.
392392-// Returns the SMTP response code and any error.
393393-func deliverToMX(ctx context.Context, mxHost, from, to string, data []byte, relayDomain string) (int, error) {
394394- dialer := net.Dialer{Timeout: 30 * time.Second}
395395- conn, err := dialer.DialContext(ctx, "tcp", mxHost+":25")
518518+// Returns the SMTP response code and any error. dialMX must produce a
519519+// connection already pointed at the destination MX (production wires
520520+// this to a tcp dialer to "<mxHost>:25").
521521+func deliverToMX(
522522+ ctx context.Context,
523523+ mxHost, from, to string,
524524+ data []byte,
525525+ relayDomain string,
526526+ dialMX func(ctx context.Context, mxHost string) (net.Conn, error),
527527+) (int, error) {
528528+ conn, err := dialMX(ctx, mxHost)
396529 if err != nil {
397530 return 0, fmt.Errorf("connect to %s: %w", mxHost, err)
398531 }
+3-4
internal/relay/queue_test.go
···136136 {errors.New("421 service not available"), 421},
137137 {errors.New("250 OK"), 250},
138138 {errors.New("no code here"), 0},
139139- {errors.New("5xx bad"), 0}, // non-digit in position 1
140140- {errors.New("55"), 0}, // too short
141141- {errors.New(""), 0}, // empty
139139+ {errors.New("5xx bad"), 0}, // non-digit in position 1
140140+ {errors.New("55"), 0}, // too short
141141+ {errors.New(""), 0}, // empty
142142 {errors.New("123 some msg"), 123},
143143 }
144144···174174 t.Error("spool file for rejected entry should have been removed")
175175 }
176176}
177177-
+19-16
internal/relay/smtp.go
···8585 sendCheck SendCheckFunc
8686 onAccept OnAcceptFunc
8787 domain string
8888- metrics *Metrics // optional — nil-safe
8989- dnsGate *DNSGate // optional — nil-safe
8888+ metrics *Metrics // optional — nil-safe
8989+ dnsGate *DNSGate // optional — nil-safe
9090}
91919292// SetMetrics attaches Prometheus metrics to the SMTP server. Nil-safe.
···233233 }
234234235235 if s.server.dnsGate != nil {
236236- var selectors []string
237237- if matched.DKIMSelector != "" {
238238- selectors = append(selectors, matched.DKIMSelector)
239239- }
240240- if err := s.server.dnsGate.Check(context.Background(), matched.Domain, selectors, matched.CreatedAt); err != nil {
241241- log.Printf("smtp.auth: did=%q domain=%s ip=%q success=false failure_reason=dns_verification error=%v", username, matched.Domain, s.conn.Hostname(), err)
242242- authFail()
243243- return &smtp.SMTPError{
244244- Code: 451,
245245- EnhancedCode: smtp.EnhancedCode{4, 7, 0},
246246- Message: "DNS verification failed — configure SPF and DKIM records for " + matched.Domain + " and retry",
247247- }
236236+ var selectors []string
237237+ if matched.DKIMSelector != "" {
238238+ selectors = append(selectors, matched.DKIMSelector+"r", matched.DKIMSelector+"e")
239239+ }
240240+ if err := s.server.dnsGate.Check(context.Background(), matched.Domain, selectors); err != nil {
241241+ log.Printf("smtp.auth: did=%q domain=%s ip=%q success=false failure_reason=dns_verification error=%v", username, matched.Domain, s.conn.Hostname(), err)
242242+ authFail()
243243+ return &smtp.SMTPError{
244244+ Code: 451,
245245+ EnhancedCode: smtp.EnhancedCode{4, 7, 0},
246246+ Message: "DNS verification failed — configure SPF and DKIM records for " + matched.Domain + " and retry",
248247 }
249248 }
249249+ }
250250251251- if mwd.Status == relaystore.StatusSuspended {
251251+ if mwd.Status == relaystore.StatusSuspended {
252252 log.Printf("smtp.auth: did=%q ip=%q success=false failure_reason=suspended", username, s.conn.Hostname())
253253 authFail()
254254 return &smtp.SMTPError{
···500500 r := textproto.NewReader(bufio.NewReader(strings.NewReader(string(data))))
501501 header, err := r.ReadMIMEHeader()
502502 if err != nil {
503503- return fmt.Errorf("From header domain must match %s", memberDomain)
503503+ // Lowercase per Go convention (ST1005); also wrap the real
504504+ // parse error so failures are debuggable instead of getting
505505+ // reported as a domain-alignment problem.
506506+ return fmt.Errorf("read MIME header: %w", err)
504507 }
505508506509 if header.Get("From") == "" {
+24-8
internal/relay/smtp_test.go
···217217 }
218218219219 var mu sync.Mutex
220220- var accepted []struct{ from string; to []string; data []byte }
220220+ var accepted []struct {
221221+ from string
222222+ to []string
223223+ data []byte
224224+ }
221225222226 accept := func(member *AuthMember, from string, to []string, data []byte) error {
223227 mu.Lock()
224224- accepted = append(accepted, struct{ from string; to []string; data []byte }{from, to, data})
228228+ accepted = append(accepted, struct {
229229+ from string
230230+ to []string
231231+ data []byte
232232+ }{from, to, data})
225233 mu.Unlock()
226234 return nil
227235 }
···784792 hash, _ := HashAPIKey(apiKey)
785793 lookup := func(ctx context.Context, did string) (*MemberWithDomains, error) {
786794 return &MemberWithDomains{
787787- DID: did,
788788- Status: relaystore.StatusActive,
795795+ DID: did,
796796+ Status: relaystore.StatusActive,
789797 Domains: []DomainInfo{{Domain: "example.com", APIKeyHash: hash}},
790798 }, nil
791799 }
···794802 from string
795803 to []string
796804 }
805805+ var accepted sync.WaitGroup
806806+ accepted.Add(1)
797807 accept := func(member *AuthMember, from string, to []string, data []byte) error {
798808 captured.mu.Lock()
799809 captured.from = from
800810 captured.to = append([]string(nil), to...)
801811 captured.mu.Unlock()
812812+ accepted.Done()
802813 return nil
803814 }
804815 _, addr, cleanup := testSMTPServer(t, lookup, nil, accept)
···820831 r.send(t, "DATA\r\n")
821832 r.send(t, "From: noreply@example.com\r\nTo: user@gmail.com\r\nSubject: x\r\n\r\nbody\r\n.\r\n")
822833834834+ accepted.Wait()
823835 captured.mu.Lock()
824836 defer captured.mu.Unlock()
825837 // Injection guard: the MAIL FROM value must be exactly what was in
···849861 hash, _ := HashAPIKey(apiKey)
850862 lookup := func(ctx context.Context, did string) (*MemberWithDomains, error) {
851863 return &MemberWithDomains{
852852- DID: did,
853853- Status: relaystore.StatusActive,
864864+ DID: did,
865865+ Status: relaystore.StatusActive,
854866 Domains: []DomainInfo{{Domain: "example.com", APIKeyHash: hash}},
855867 }, nil
856868 }
···858870 mu sync.Mutex
859871 to []string
860872 }
873873+ var accepted sync.WaitGroup
874874+ accepted.Add(1)
861875 accept := func(member *AuthMember, from string, to []string, data []byte) error {
862876 captured.mu.Lock()
863877 captured.to = append([]string(nil), to...)
864878 captured.mu.Unlock()
879879+ accepted.Done()
865880 return nil
866881 }
867882 _, addr, cleanup := testSMTPServer(t, lookup, nil, accept)
···882897 r.send(t, "DATA\r\n")
883898 r.send(t, "From: noreply@example.com\r\nTo: user@gmail.com\r\nSubject: x\r\n\r\nbody\r\n.\r\n")
884899900900+ accepted.Wait()
885901 captured.mu.Lock()
886902 defer captured.mu.Unlock()
887903 for _, to := range captured.to {
···910926 hash, _ := HashAPIKey(apiKey)
911927 lookup := func(ctx context.Context, did string) (*MemberWithDomains, error) {
912928 return &MemberWithDomains{
913913- DID: did,
914914- Status: relaystore.StatusActive,
929929+ DID: did,
930930+ Status: relaystore.StatusActive,
915931 Domains: []DomainInfo{{Domain: "example.com", APIKeyHash: hash}},
916932 }, nil
917933 }
+6-1
internal/relay/spool.go
···88 "log"
99 "os"
1010 "path/filepath"
1111+ "sort"
1112 "strings"
1213)
1314···4142// a message that Write claimed to persist. Without these fsyncs the
4243// rename can appear to succeed but be reordered behind a crash,
4344// leaving either a zero-length file or no file at all when the kernel
4444-// replays the journal — exactly the orphan case (#208) that produces
4545+// replays the journal — exactly the orphan case that produces
4546// duplicate-delivery on SMTP retry.
4647func (s *Spool) Write(entry *QueueEntry) error {
4748 se := spoolEntry{
···186187 Attempts: se.Attempts,
187188 })
188189 }
190190+191191+ sort.Slice(result, func(i, j int) bool {
192192+ return result[i].ID < result[j].ID
193193+ })
189194190195 return result, nil
191196}
···143143// Unsubscriber bundles the signing key + store for the HTTP handler and
144144// the outbound-header helpers.
145145type Unsubscriber struct {
146146- key []byte
147147- store SuppressionStore
146146+ key []byte
147147+ store SuppressionStore
148148 // BaseURL is the public URL prefix used in List-Unsubscribe headers,
149149 // e.g. "https://smtp.atmos.email". No trailing slash.
150150 BaseURL string
···181181// Handler returns an http.Handler that serves GET /u/{token} and POST /u/{token}.
182182//
183183// POST: RFC 8058 one-click unsubscribe. Body is ignored (per spec, MUAs send
184184-// "List-Unsubscribe=One-Click" but we accept any body). Records suppression,
185185-// returns 200 with minimal text/plain body.
184184+//
185185+// "List-Unsubscribe=One-Click" but we accept any body). Records suppression,
186186+// returns 200 with minimal text/plain body.
187187+//
186188// GET: Human-facing confirmation page. Records suppression on click and
187187-// returns a minimal HTML confirmation.
189189+//
190190+// returns a minimal HTML confirmation.
191191+//
188192// Invalid/expired tokens return 404 (to avoid leaking whether the signing
189193// key has changed).
190194//
+9-9
internal/relay/unsubscribe_test.go
···141141 now := time.Now()
142142143143 cases := []string{
144144- "", // empty
145145- "no-dot-separator", // missing signature separator
146146- "bad!chars.!!", // invalid base64
147147- "....", // empty components
148148- ".justsig", // empty payload
144144+ "", // empty
145145+ "no-dot-separator", // missing signature separator
146146+ "bad!chars.!!", // invalid base64
147147+ "....", // empty components
148148+ ".justsig", // empty payload
149149 }
150150 for _, c := range cases {
151151 if _, err := VerifyUnsubToken(key, c, now); err == nil {
···264264 u := NewUnsubscriber(key, store, "https://smtp.atmos.email")
265265266266 cases := []string{
267267- "/u/", // empty
268268- "/u/malformed", // no dot separator
269269- "/u/AAAA.BBBB", // valid b64 but bad sig
270270- "/u/some/nested/path", // extra slashes
267267+ "/u/", // empty
268268+ "/u/malformed", // no dot separator
269269+ "/u/AAAA.BBBB", // valid b64 but bad sig
270270+ "/u/some/nested/path", // extra slashes
271271 }
272272 for _, path := range cases {
273273 req := httptest.NewRequest(http.MethodPost, path, nil)
+2-2
internal/relay/warmup_scheduler.go
···1111)
12121313// MemberWarmupCandidate carries the per-member info the scheduler needs
1414-// to make a fair selection (#219). DID is required; CreatedAt is used to
1414+// to make a fair selection. DID is required; CreatedAt is used to
1515// boost newly-enrolled members so they reach mailbox-provider visibility
1616// faster than long-tenured ones who already have a sending history.
1717type MemberWarmupCandidate struct {
···2727// Selection is rotation-fair (every eligible member gets warmed up
2828// before any one repeats) with a tiebreaker that prefers newly-enrolled
2929// members so a long-tenured member can't crowd out new enrollees on
3030-// the first iteration through the pool. See #219.
3030+// the first iteration through the pool.
3131type WarmupScheduler struct {
3232 sender *WarmupSender
3333 listCandidates func(ctx context.Context) ([]MemberWarmupCandidate, error)
+2-2
internal/relaystore/bypass_audit_test.go
···7272}
73737474// TestListBypassDIDs_KeepsLegacyPermanent — entries migrated from the
7575-// pre-#213 schema have expires_at='' and represent already-deployed
7575+// pre-#213 schema have expires_at=” and represent already-deployed
7676// permanent bypasses. We must not retroactively evict them on the
7777// migration; an operator has to convert them by re-adding with expiry.
7878func TestListBypassDIDs_KeepsLegacyPermanent(t *testing.T) {
···141141142142// TestPurgeExpiredBypassDIDs_LeavesLegacyAlone confirms the
143143// grandfather invariant — even mid-purge, legacy permanent entries
144144-// (expires_at='') remain.
144144+// (expires_at=”) remain.
145145func TestPurgeExpiredBypassDIDs_LeavesLegacyAlone(t *testing.T) {
146146 s := testStore(t)
147147 ctx := context.Background()
+2-2
internal/relaystore/inbound_messages.go
···2424const (
2525 InboundClassBounceDSN = "bounce-dsn"
2626 InboundClassFBLARF = "fbl-arf"
2727- InboundClassOperator = "operator" // postmaster@ / abuse@ — operator-monitored
2828- InboundClassReply = "reply" // forwarded human reply to a member inbox
2727+ InboundClassOperator = "operator" // postmaster@ / abuse@ — operator-monitored
2828+ InboundClassReply = "reply" // forwarded human reply to a member inbox
2929 InboundClassSRSBounce = "srs-bounce"
3030 InboundClassUnknown = "unknown"
3131)
+1-1
internal/relaystore/observability.go
···9494// before any error escapes; InUse near MaxOpenConns means the
9595// next caller will wait. Combined with BusyErrorClassify on the
9696// hot writers, this gives operators a complete picture without
9797-// touching every callsite. Closes #210.
9797+// touching every callsite.
9898func (s *Store) SampleStats() PoolStats {
9999 st := s.db.Stats()
100100 return PoolStats{
+2-2
internal/relaystore/pending_notifications.go
···1515// dead-letter table rather than crashing the worker, so forgetting to
1616// wire one up is loud but non-fatal.
1717const (
1818- NotificationKindWelcome = "welcome"
1919- NotificationKindKeyRegenerated = "key_regenerated"
1818+ NotificationKindWelcome = "welcome"
1919+ NotificationKindKeyRegenerated = "key_regenerated"
2020 NotificationKindFBLComplaint = "fbl_complaint"
2121 NotificationKindEmailVerification = "email_verification"
2222)
+75
internal/relaystore/pii.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package relaystore
44+55+import (
66+ "crypto/aes"
77+ "crypto/cipher"
88+ "crypto/rand"
99+ "encoding/base64"
1010+ "fmt"
1111+ "strings"
1212+)
1313+1414+const piiPrefix = "ENC:"
1515+1616+// PIIKey is a 32-byte AES-256 key used for column-level encryption of PII
1717+// fields (contact_email). When nil, values are stored and returned as
1818+// plaintext (dev/test backward compat).
1919+type PIIKey []byte
2020+2121+// EncryptPII encrypts a plaintext value with AES-256-GCM. Returns a string
2222+// prefixed with "ENC:" followed by base64(nonce + ciphertext + tag).
2323+// If key is nil or value is empty, returns value unchanged.
2424+func EncryptPII(key PIIKey, value string) (string, error) {
2525+ if len(key) == 0 || value == "" {
2626+ return value, nil
2727+ }
2828+ block, err := aes.NewCipher(key)
2929+ if err != nil {
3030+ return "", fmt.Errorf("pii: new cipher: %w", err)
3131+ }
3232+ gcm, err := cipher.NewGCM(block)
3333+ if err != nil {
3434+ return "", fmt.Errorf("pii: new gcm: %w", err)
3535+ }
3636+ nonce := make([]byte, gcm.NonceSize())
3737+ if _, err := rand.Read(nonce); err != nil {
3838+ return "", fmt.Errorf("pii: rand nonce: %w", err)
3939+ }
4040+ ciphertext := gcm.Seal(nonce, nonce, []byte(value), nil)
4141+ return piiPrefix + base64.StdEncoding.EncodeToString(ciphertext), nil
4242+}
4343+4444+// DecryptPII decrypts a value produced by EncryptPII. If the value doesn't
4545+// have the "ENC:" prefix (plaintext or legacy row), it's returned as-is.
4646+// If key is nil, returns value unchanged (graceful degradation).
4747+func DecryptPII(key PIIKey, value string) (string, error) {
4848+ if !strings.HasPrefix(value, piiPrefix) {
4949+ return value, nil
5050+ }
5151+ if len(key) == 0 {
5252+ return "", fmt.Errorf("pii: encrypted value but no key configured")
5353+ }
5454+ raw, err := base64.StdEncoding.DecodeString(value[len(piiPrefix):])
5555+ if err != nil {
5656+ return "", fmt.Errorf("pii: base64 decode: %w", err)
5757+ }
5858+ block, err := aes.NewCipher(key)
5959+ if err != nil {
6060+ return "", fmt.Errorf("pii: new cipher: %w", err)
6161+ }
6262+ gcm, err := cipher.NewGCM(block)
6363+ if err != nil {
6464+ return "", fmt.Errorf("pii: new gcm: %w", err)
6565+ }
6666+ nonceSize := gcm.NonceSize()
6767+ if len(raw) < nonceSize {
6868+ return "", fmt.Errorf("pii: ciphertext too short")
6969+ }
7070+ plaintext, err := gcm.Open(nil, raw[:nonceSize], raw[nonceSize:], nil)
7171+ if err != nil {
7272+ return "", fmt.Errorf("pii: decrypt: %w", err)
7373+ }
7474+ return string(plaintext), nil
7575+}
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package relaystore
44+55+import (
66+ "context"
77+ "database/sql"
88+ "fmt"
99+ "time"
1010+)
1111+1212+// SenderReputation aggregates a member's send / bounce / complaint counts
1313+// over a rolling window, plus the current suspension state. It feeds the
1414+// labeler's clean-sender computation and gives operators an
1515+// at-a-glance view of any member's deliverability posture.
1616+type SenderReputation struct {
1717+ DID string `json:"did"`
1818+ Since time.Time `json:"since"`
1919+ Until time.Time `json:"until"`
2020+ Total int64 `json:"total"` // delivery_result + relay_rejected
2121+ Bounces int64 `json:"bounces"` // bounce_received
2222+ Complaints int64 `json:"complaints"` // FBL/ARF complaints attributed to this DID
2323+ SuspendedNow bool `json:"suspendedNow"` // members.status == 'suspended'
2424+}
2525+2626+// SenderReputation returns the per-DID rollup for events with
2727+// event_timestamp >= since. The Until field is set to time.Now() at the
2828+// moment of the call so callers can pin the window for downstream use.
2929+//
3030+// The DID is not validated here — callers should pass a syntactically
3131+// valid did:plc / did:web string. An unknown DID returns a zero-count
3232+// rollup (Total=0, Bounces=0, Complaints=0, SuspendedNow=false), not an
3333+// error: that is the same shape as a known member who has not sent in
3434+// the window, and the caller can decide what to do.
3535+func (s *Store) SenderReputation(ctx context.Context, did string, since time.Time) (*SenderReputation, error) {
3636+ until := time.Now().UTC()
3737+ rep := &SenderReputation{
3838+ DID: did,
3939+ Since: since.UTC(),
4040+ Until: until,
4141+ }
4242+4343+ sinceStr := formatTime(since.UTC())
4444+4545+ // Total + Bounces from relay_events. One scan is enough since the
4646+ // counts are cheap and we want both anyway; using two queries keeps
4747+ // the WHERE clauses readable and the indexes well-used
4848+ // (idx_relay_events_sender_did + the action_name secondary index).
4949+ if err := s.db.QueryRowContext(ctx,
5050+ `SELECT COUNT(*) FROM relay_events
5151+ WHERE sender_did = ? AND event_timestamp >= ?
5252+ AND action_name IN ('delivery_result','relay_rejected')`,
5353+ did, sinceStr,
5454+ ).Scan(&rep.Total); err != nil {
5555+ return nil, fmt.Errorf("count total events: %w", err)
5656+ }
5757+5858+ if err := s.db.QueryRowContext(ctx,
5959+ `SELECT COUNT(*) FROM relay_events
6060+ WHERE sender_did = ? AND event_timestamp >= ?
6161+ AND action_name = 'bounce_received'`,
6262+ did, sinceStr,
6363+ ).Scan(&rep.Bounces); err != nil {
6464+ return nil, fmt.Errorf("count bounce events: %w", err)
6565+ }
6666+6767+ // Complaints from inbound_messages (FBL/ARF). The classification
6868+ // constant matches InboundClassFBLARF in inbound_messages.go; using
6969+ // the literal here avoids a circular import-free constant export.
7070+ if err := s.db.QueryRowContext(ctx,
7171+ `SELECT COUNT(*) FROM inbound_messages
7272+ WHERE member_did = ? AND received_at >= ?
7373+ AND classification = ?`,
7474+ did, sinceStr, InboundClassFBLARF,
7575+ ).Scan(&rep.Complaints); err != nil {
7676+ return nil, fmt.Errorf("count complaints: %w", err)
7777+ }
7878+7979+ // Suspension state. A missing member row is not a SQL error — it
8080+ // just means we have no record of this DID in members, treat it as
8181+ // not suspended (the labeler will then evaluate purely on send
8282+ // volume, which is correct).
8383+ var status string
8484+ err := s.db.QueryRowContext(ctx,
8585+ `SELECT status FROM members WHERE did = ?`, did,
8686+ ).Scan(&status)
8787+ switch {
8888+ case err == sql.ErrNoRows:
8989+ rep.SuspendedNow = false
9090+ case err != nil:
9191+ return nil, fmt.Errorf("read member status: %w", err)
9292+ default:
9393+ rep.SuspendedNow = status == StatusSuspended
9494+ }
9595+9696+ return rep, nil
9797+}
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package scheduler
44+55+import (
66+ "context"
77+ "errors"
88+ "fmt"
99+ "io"
1010+ "log"
1111+ "net/http"
1212+ "net/url"
1313+ "strings"
1414+ "sync/atomic"
1515+ "time"
1616+1717+ "atmosphere-mail/internal/label"
1818+ "atmosphere-mail/internal/loghash"
1919+ "atmosphere-mail/internal/store"
2020+)
2121+2222+// TombstoneChecker periodically polls plc.directory for the current status
2323+// of every did:plc that has at least one active label. Tombstoned DIDs
2424+// (per the PLC #plc_tombstone op) get all of their labels negated.
2525+//
2626+// Why this exists: previously our labels stayed live indefinitely
2727+// once issued. If a member retired their atproto identity on PLC after
2828+// being labeled, our `verified-mail-operator` and `relay-member` labels
2929+// would continue to vouch for a non-existent account. The reverify
3030+// scheduler couldn't catch this because domain.Verify can pass briefly
3131+// via cached PDS records even after the source DID is gone.
3232+//
3333+// did:web DIDs are skipped — they're not on PLC, and their lifecycle is
3434+// already covered by the existing reverify path (the .well-known
3535+// document either resolves or it doesn't).
3636+//
3737+// Rate-limiting: PLC publishes fair-use guidelines suggesting on the
3838+// order of 2-3 req/s. We default to 500ms between requests (2 req/s)
3939+// with a configurable knob for ops to tune.
4040+type TombstoneChecker struct {
4141+ manager *label.Manager
4242+ store *store.Store
4343+ client *http.Client
4444+ plcURL string
4545+ interval time.Duration
4646+ delay time.Duration
4747+4848+ // Atomic counters exposed via Stats() — read by the labeler's
4949+ // /metrics handler. Names match the Prometheus convention used by
5050+ // the rest of the codebase.
5151+ checksOK atomic.Int64
5252+ checksTombstoned atomic.Int64
5353+ checksErr atomic.Int64
5454+ lastRunUnix atomic.Int64 // Unix seconds; 0 if never run
5555+}
5656+5757+// TombstoneStats is a snapshot of the checker's counters for observability.
5858+type TombstoneStats struct {
5959+ ChecksOK int64
6060+ ChecksTombstoned int64
6161+ ChecksErr int64
6262+ LastRunAt time.Time // zero value if never run
6363+}
6464+6565+// NewTombstoneChecker constructs a checker.
6666+//
6767+// plcURL: e.g. "https://plc.directory" (no trailing slash). Tests inject
6868+// an httptest.Server URL.
6969+// interval: how often to run the full pass. 24h is sensible for production.
7070+// requestDelay: minimum gap between PLC requests within a single pass,
7171+// for fair-use compliance. 500ms = 2 req/s.
7272+func NewTombstoneChecker(manager *label.Manager, st *store.Store, plcURL string, interval, requestDelay time.Duration) *TombstoneChecker {
7373+ return &TombstoneChecker{
7474+ manager: manager,
7575+ store: st,
7676+ client: &http.Client{Timeout: 30 * time.Second},
7777+ plcURL: strings.TrimRight(plcURL, "/"),
7878+ interval: interval,
7979+ delay: requestDelay,
8080+ }
8181+}
8282+8383+// Run starts the periodic loop. Blocks until ctx is cancelled. Returns
8484+// ctx.Err() on cancellation; logs (does not return) per-pass errors.
8585+func (t *TombstoneChecker) Run(ctx context.Context) error {
8686+ ticker := time.NewTicker(t.interval)
8787+ defer ticker.Stop()
8888+8989+ for {
9090+ select {
9191+ case <-ctx.Done():
9292+ return ctx.Err()
9393+ case <-ticker.C:
9494+ if err := t.RunOnce(ctx); err != nil {
9595+ log.Printf("plc-tombstone: pass error: %v", err)
9696+ }
9797+ }
9898+ }
9999+}
100100+101101+// RunOnce executes a single pass over all labeled did:plc DIDs.
102102+//
103103+// Errors at the per-DID level are logged and counted; only outermost
104104+// fatal errors (e.g. store unavailable) bubble up. This matches the
105105+// reverify scheduler's robustness: a transient PLC outage on one DID
106106+// shouldn't abort the whole sweep.
107107+func (t *TombstoneChecker) RunOnce(ctx context.Context) error {
108108+ defer t.lastRunUnix.Store(time.Now().Unix())
109109+110110+ atts, err := t.store.ListAttestations(ctx)
111111+ if err != nil {
112112+ return fmt.Errorf("list attestations: %w", err)
113113+ }
114114+115115+ // Distinct did:plc set. did:web is skipped — see package doc.
116116+ seen := make(map[string]struct{}, len(atts))
117117+ for _, a := range atts {
118118+ if !strings.HasPrefix(a.DID, "did:plc:") {
119119+ continue
120120+ }
121121+ seen[a.DID] = struct{}{}
122122+ }
123123+124124+ for did := range seen {
125125+ select {
126126+ case <-ctx.Done():
127127+ return ctx.Err()
128128+ default:
129129+ }
130130+131131+ status, err := t.checkDID(ctx, did)
132132+ switch {
133133+ case err != nil:
134134+ t.checksErr.Add(1)
135135+ log.Printf("plc-tombstone: check did_hash=%s: %v", loghash.ForLog(did), err)
136136+ case status == statusTombstoned:
137137+ t.checksTombstoned.Add(1)
138138+ log.Printf("plc-tombstone: detected tombstone did_hash=%s — negating all labels", loghash.ForLog(did))
139139+ if err := t.manager.NegateAllLabelsForDID(ctx, did, "plc_tombstone"); err != nil {
140140+ log.Printf("plc-tombstone: negate did_hash=%s: %v", loghash.ForLog(did), err)
141141+ }
142142+ default:
143143+ t.checksOK.Add(1)
144144+ }
145145+146146+ // Fair-use rate limit between PLC requests. Skipped on the
147147+ // last DID via the loop's natural exit.
148148+ select {
149149+ case <-ctx.Done():
150150+ return ctx.Err()
151151+ case <-time.After(t.delay):
152152+ }
153153+ }
154154+155155+ return nil
156156+}
157157+158158+type plcStatus int
159159+160160+const (
161161+ statusOK plcStatus = iota
162162+ statusTombstoned
163163+)
164164+165165+// checkDID issues a single PLC lookup for the given DID. Returns:
166166+// - (statusOK, nil) on HTTP 200
167167+// - (statusTombstoned, nil) on HTTP 410 Gone (the canonical PLC
168168+// tombstone signal — the directory returns
169169+// 410 with a body containing the tombstone
170170+// op for any DID that's been retired)
171171+// - (_, err) on network error, 5xx after retries, or
172172+// unexpected status code
173173+//
174174+// 4xx (other than 410) is reported as an error rather than treated as
175175+// tombstone — those usually indicate a malformed DID or a PLC API change
176176+// rather than a real deactivation, and labels should NOT come down on
177177+// guesses.
178178+func (t *TombstoneChecker) checkDID(ctx context.Context, did string) (plcStatus, error) {
179179+ const maxAttempts = 3
180180+ backoff := 1 * time.Second
181181+182182+ var lastErr error
183183+ for attempt := 1; attempt <= maxAttempts; attempt++ {
184184+ req, err := http.NewRequestWithContext(ctx, http.MethodGet, t.plcURL+"/"+url.PathEscape(did), nil)
185185+ if err != nil {
186186+ return 0, fmt.Errorf("build request: %w", err)
187187+ }
188188+ req.Header.Set("User-Agent", "atmosphere-mail-labeler/1 (+https://atmospheremail.com)")
189189+190190+ resp, err := t.client.Do(req)
191191+ if err != nil {
192192+ lastErr = err
193193+ if attempt < maxAttempts {
194194+ select {
195195+ case <-ctx.Done():
196196+ return 0, ctx.Err()
197197+ case <-time.After(backoff):
198198+ backoff *= 2
199199+ continue
200200+ }
201201+ }
202202+ return 0, fmt.Errorf("after %d attempts: %w", maxAttempts, lastErr)
203203+ }
204204+205205+ // Drain + close body even when we're going to discard.
206206+ _, _ = io.Copy(io.Discard, io.LimitReader(resp.Body, 1<<20))
207207+ resp.Body.Close()
208208+209209+ switch resp.StatusCode {
210210+ case http.StatusOK:
211211+ return statusOK, nil
212212+ case http.StatusGone:
213213+ return statusTombstoned, nil
214214+ default:
215215+ if resp.StatusCode >= 500 && attempt < maxAttempts {
216216+ select {
217217+ case <-ctx.Done():
218218+ return 0, ctx.Err()
219219+ case <-time.After(backoff):
220220+ backoff *= 2
221221+ continue
222222+ }
223223+ }
224224+ return 0, fmt.Errorf("plc returned status %d", resp.StatusCode)
225225+ }
226226+ }
227227+ return 0, errors.New("unreachable")
228228+}
229229+230230+// Stats returns a snapshot of the checker's counters for the
231231+// labeler's /metrics endpoint.
232232+func (t *TombstoneChecker) Stats() TombstoneStats {
233233+ last := t.lastRunUnix.Load()
234234+ var when time.Time
235235+ if last > 0 {
236236+ when = time.Unix(last, 0).UTC()
237237+ }
238238+ return TombstoneStats{
239239+ ChecksOK: t.checksOK.Load(),
240240+ ChecksTombstoned: t.checksTombstoned.Load(),
241241+ ChecksErr: t.checksErr.Load(),
242242+ LastRunAt: when,
243243+ }
244244+}
+285
internal/scheduler/plc_tombstone_test.go
···11+// SPDX-License-Identifier: AGPL-3.0-or-later
22+33+package scheduler
44+55+import (
66+ "context"
77+ "net/http"
88+ "net/http/httptest"
99+ "strings"
1010+ "sync/atomic"
1111+ "testing"
1212+ "time"
1313+1414+ "atmosphere-mail/internal/label"
1515+ "atmosphere-mail/internal/store"
1616+)
1717+1818+// plcFixture is a minimal stand-in for plc.directory's GET /{did}
1919+// endpoint. Per-DID responses are configured up-front; the handler
2020+// records every request so tests can assert call counts.
2121+type plcFixture struct {
2222+ responses map[string]int // did -> http status to return
2323+ calls atomic.Int64
2424+}
2525+2626+func newPLCFixture(responses map[string]int) *plcFixture {
2727+ return &plcFixture{responses: responses}
2828+}
2929+3030+func (f *plcFixture) ServeHTTP(w http.ResponseWriter, r *http.Request) {
3131+ f.calls.Add(1)
3232+ // Path is "/<did>" — strip the leading slash.
3333+ did := strings.TrimPrefix(r.URL.Path, "/")
3434+ status, ok := f.responses[did]
3535+ if !ok {
3636+ http.Error(w, "did not configured in fixture", http.StatusNotFound)
3737+ return
3838+ }
3939+ w.WriteHeader(status)
4040+ w.Write([]byte("{}\n"))
4141+}
4242+4343+func newTestManager(t *testing.T) (*label.Manager, *store.Store) {
4444+ t.Helper()
4545+ s, err := store.New(":memory:")
4646+ if err != nil {
4747+ t.Fatal(err)
4848+ }
4949+ t.Cleanup(func() { s.Close() })
5050+5151+ signer := newSigner(t)
5252+ mgr := label.NewManager(signer, s, passDNS(), passDomain())
5353+ return mgr, s
5454+}
5555+5656+// seedLabeled inserts an attestation for did/domain and pushes it
5757+// through ProcessAttestation so a real label exists. Returns the
5858+// number of active labels created.
5959+func seedLabeled(t *testing.T, ctx context.Context, mgr *label.Manager, s *store.Store, did, domain string) int {
6060+ t.Helper()
6161+ att := &store.Attestation{
6262+ DID: did,
6363+ Domain: domain,
6464+ DKIMSelectors: []string{"default"},
6565+ CreatedAt: time.Now().UTC(),
6666+ }
6767+ if err := s.UpsertAttestation(ctx, att); err != nil {
6868+ t.Fatal(err)
6969+ }
7070+ if err := mgr.ProcessAttestation(ctx, att); err != nil {
7171+ t.Fatal(err)
7272+ }
7373+ labels, err := s.GetActiveLabelsForDID(ctx, did)
7474+ if err != nil {
7575+ t.Fatal(err)
7676+ }
7777+ return len(labels)
7878+}
7979+8080+// TestTombstoneChecker_NegatesOn410 is the core happy-path: a labeled
8181+// DID returns 410 Gone from the fixture (the PLC tombstone signal),
8282+// and the checker negates all of its active labels.
8383+func TestTombstoneChecker_NegatesOn410(t *testing.T) {
8484+ ctx := context.Background()
8585+ mgr, s := newTestManager(t)
8686+8787+ did := "did:plc:tombstoneaaaaaaaaaaaaaaa"
8888+ if n := seedLabeled(t, ctx, mgr, s, did, "tombstone.example.com"); n == 0 {
8989+ t.Fatal("setup: expected at least 1 active label")
9090+ }
9191+9292+ fixture := newPLCFixture(map[string]int{did: http.StatusGone})
9393+ srv := httptest.NewServer(fixture)
9494+ defer srv.Close()
9595+9696+ checker := NewTombstoneChecker(mgr, s, srv.URL, time.Hour, 1*time.Millisecond)
9797+ if err := checker.RunOnce(ctx); err != nil {
9898+ t.Fatalf("RunOnce: %v", err)
9999+ }
100100+101101+ stats := checker.Stats()
102102+ if stats.ChecksTombstoned != 1 {
103103+ t.Errorf("ChecksTombstoned = %d, want 1", stats.ChecksTombstoned)
104104+ }
105105+ if stats.ChecksOK != 0 {
106106+ t.Errorf("ChecksOK = %d, want 0", stats.ChecksOK)
107107+ }
108108+ if stats.LastRunAt.IsZero() {
109109+ t.Error("LastRunAt should be set after RunOnce")
110110+ }
111111+112112+ labels, err := s.GetActiveLabelsForDID(ctx, did)
113113+ if err != nil {
114114+ t.Fatal(err)
115115+ }
116116+ if len(labels) != 0 {
117117+ t.Errorf("got %d active labels, want 0 after tombstone", len(labels))
118118+ }
119119+}
120120+121121+// TestTombstoneChecker_KeepsOn200 guards against false positives: a
122122+// healthy DID (200) must NOT have its labels touched.
123123+func TestTombstoneChecker_KeepsOn200(t *testing.T) {
124124+ ctx := context.Background()
125125+ mgr, s := newTestManager(t)
126126+127127+ did := "did:plc:healthyaaaaaaaaaaaaaaaa3"
128128+ beforeCount := seedLabeled(t, ctx, mgr, s, did, "healthy.example.com")
129129+ if beforeCount == 0 {
130130+ t.Fatal("setup: expected at least 1 active label")
131131+ }
132132+133133+ fixture := newPLCFixture(map[string]int{did: http.StatusOK})
134134+ srv := httptest.NewServer(fixture)
135135+ defer srv.Close()
136136+137137+ checker := NewTombstoneChecker(mgr, s, srv.URL, time.Hour, 1*time.Millisecond)
138138+ if err := checker.RunOnce(ctx); err != nil {
139139+ t.Fatalf("RunOnce: %v", err)
140140+ }
141141+142142+ stats := checker.Stats()
143143+ if stats.ChecksOK != 1 {
144144+ t.Errorf("ChecksOK = %d, want 1", stats.ChecksOK)
145145+ }
146146+ if stats.ChecksTombstoned != 0 {
147147+ t.Errorf("ChecksTombstoned = %d, want 0", stats.ChecksTombstoned)
148148+ }
149149+150150+ labels, err := s.GetActiveLabelsForDID(ctx, did)
151151+ if err != nil {
152152+ t.Fatal(err)
153153+ }
154154+ if len(labels) != beforeCount {
155155+ t.Errorf("got %d active labels, want %d (200 must not negate)", len(labels), beforeCount)
156156+ }
157157+}
158158+159159+// TestTombstoneChecker_SkipsDIDWeb proves did:web DIDs never hit PLC.
160160+// PLC has no record of did:web identities, so polling them would just
161161+// generate noise and burn rate-limit budget.
162162+func TestTombstoneChecker_SkipsDIDWeb(t *testing.T) {
163163+ ctx := context.Background()
164164+ mgr, s := newTestManager(t)
165165+166166+ did := "did:web:webonly.example.com"
167167+ seedLabeled(t, ctx, mgr, s, did, "webonly.example.com")
168168+169169+ // Fixture returns 410 for everything — but the checker should
170170+ // never call it since the DID is did:web.
171171+ fixture := newPLCFixture(map[string]int{did: http.StatusGone})
172172+ srv := httptest.NewServer(fixture)
173173+ defer srv.Close()
174174+175175+ checker := NewTombstoneChecker(mgr, s, srv.URL, time.Hour, 1*time.Millisecond)
176176+ if err := checker.RunOnce(ctx); err != nil {
177177+ t.Fatalf("RunOnce: %v", err)
178178+ }
179179+180180+ if got := fixture.calls.Load(); got != 0 {
181181+ t.Errorf("PLC was called %d times for did:web, want 0", got)
182182+ }
183183+184184+ labels, err := s.GetActiveLabelsForDID(ctx, did)
185185+ if err != nil {
186186+ t.Fatal(err)
187187+ }
188188+ if len(labels) == 0 {
189189+ t.Error("did:web labels should be untouched by the tombstone checker")
190190+ }
191191+}
192192+193193+// TestTombstoneChecker_5xxIsErrorNotTombstone is the safety-critical
194194+// case: PLC having a bad day (503, 504) must NOT be misread as a
195195+// tombstone. Negating live members on a transient PLC outage would
196196+// be a serious operator-trust failure.
197197+func TestTombstoneChecker_5xxIsErrorNotTombstone(t *testing.T) {
198198+ ctx := context.Background()
199199+ mgr, s := newTestManager(t)
200200+201201+ did := "did:plc:plcdownaaaaaaaaaaaaaaaaa"
202202+ beforeCount := seedLabeled(t, ctx, mgr, s, did, "plcdown.example.com")
203203+204204+ // Always-503 fixture; checker should retry up to maxAttempts then
205205+ // give up and count the result as an error, not a tombstone.
206206+ fixture := &alwaysStatusFixture{status: http.StatusServiceUnavailable}
207207+ srv := httptest.NewServer(fixture)
208208+ defer srv.Close()
209209+210210+ checker := NewTombstoneChecker(mgr, s, srv.URL, time.Hour, 1*time.Millisecond)
211211+ // Shrink the retry budget so the test runs fast — we don't need
212212+ // to verify the exponential ladder, just that the final outcome is
213213+ // an error and labels stay live.
214214+ checker.client = newFastRetryClient()
215215+216216+ if err := checker.RunOnce(ctx); err != nil {
217217+ t.Fatalf("RunOnce should not fail on per-DID error: %v", err)
218218+ }
219219+220220+ stats := checker.Stats()
221221+ if stats.ChecksErr != 1 {
222222+ t.Errorf("ChecksErr = %d, want 1", stats.ChecksErr)
223223+ }
224224+ if stats.ChecksTombstoned != 0 {
225225+ t.Errorf("ChecksTombstoned = %d, want 0 (5xx must NOT be misread as tombstone)", stats.ChecksTombstoned)
226226+ }
227227+228228+ labels, err := s.GetActiveLabelsForDID(ctx, did)
229229+ if err != nil {
230230+ t.Fatal(err)
231231+ }
232232+ if len(labels) != beforeCount {
233233+ t.Errorf("got %d active labels after 5xx, want %d preserved", len(labels), beforeCount)
234234+ }
235235+}
236236+237237+// TestTombstoneChecker_4xxIsErrorNotTombstone is the same guard for
238238+// non-410 4xx codes. A 400/404 from PLC could mean "we changed the API"
239239+// or "your DID was malformed" — either way, NOT a tombstone signal.
240240+func TestTombstoneChecker_4xxIsErrorNotTombstone(t *testing.T) {
241241+ ctx := context.Background()
242242+ mgr, s := newTestManager(t)
243243+244244+ did := "did:plc:misshapeaaaaaaaaaaaaaaaa"
245245+ seedLabeled(t, ctx, mgr, s, did, "misshape.example.com")
246246+247247+ fixture := newPLCFixture(map[string]int{did: http.StatusBadRequest})
248248+ srv := httptest.NewServer(fixture)
249249+ defer srv.Close()
250250+251251+ checker := NewTombstoneChecker(mgr, s, srv.URL, time.Hour, 1*time.Millisecond)
252252+ if err := checker.RunOnce(ctx); err != nil {
253253+ t.Fatalf("RunOnce: %v", err)
254254+ }
255255+256256+ stats := checker.Stats()
257257+ if stats.ChecksErr != 1 {
258258+ t.Errorf("ChecksErr = %d, want 1", stats.ChecksErr)
259259+ }
260260+ if stats.ChecksTombstoned != 0 {
261261+ t.Errorf("ChecksTombstoned = %d, want 0 (400 must not negate)", stats.ChecksTombstoned)
262262+ }
263263+264264+ labels, err := s.GetActiveLabelsForDID(ctx, did)
265265+ if err != nil {
266266+ t.Fatal(err)
267267+ }
268268+ if len(labels) == 0 {
269269+ t.Error("400 must not cause labels to be negated")
270270+ }
271271+}
272272+273273+// alwaysStatusFixture serves a fixed status code regardless of path.
274274+// Used to test the retry path without needing per-DID configuration.
275275+type alwaysStatusFixture struct{ status int }
276276+277277+func (f *alwaysStatusFixture) ServeHTTP(w http.ResponseWriter, _ *http.Request) {
278278+ w.WriteHeader(f.status)
279279+}
280280+281281+// newFastRetryClient returns an http.Client with a tiny timeout so the
282282+// 5xx-retry test doesn't spend real seconds waiting on backoffs.
283283+func newFastRetryClient() *http.Client {
284284+ return &http.Client{Timeout: 500 * time.Millisecond}
285285+}
+6-11
internal/server/diagnostics.go
···66 "encoding/json"
77 "log"
88 "net/http"
99- "regexp"
99+1010+ didpkg "atmosphere-mail/internal/did"
1011)
11121212-// validDID matches did:plc (base32-lower, 24 chars) and did:web formats.
1313-// did:web allows alphanumeric, dots, hyphens, and colons (path separators).
1414-// Percent-encoding is excluded to prevent log injection via %0a/%0d.
1515-// did:web bounded to 253 chars (max DNS name).
1616-var validDID = regexp.MustCompile(`^(did:plc:[a-z2-7]{24}|did:web:[a-zA-Z0-9._:-]{1,253})$`)
1717-1813type verificationStatusResponse struct {
1919- DID string `json:"did"`
2020- Attestations []attestationStatus `json:"attestations"`
2121- ActiveLabels []string `json:"activeLabels"`
1414+ DID string `json:"did"`
1515+ Attestations []attestationStatus `json:"attestations"`
1616+ ActiveLabels []string `json:"activeLabels"`
2217}
23182419type attestationStatus struct {
···3631 }
37323833 did := r.URL.Query().Get("did")
3939- if !validDID.MatchString(did) {
3434+ if !didpkg.Valid(did) {
4035 http.Error(w, "did parameter required", http.StatusBadRequest)
4136 return
4237 }
+43
internal/server/server.go
···1919 maxBackfillLabels = 10000
2020)
21212222+// PLCTombstoneStats is the subset of internal/scheduler.TombstoneStats
2323+// the metrics endpoint needs. Defining the interface here (rather than
2424+// importing scheduler) avoids a server→scheduler import cycle when the
2525+// scheduler grows to depend on label, which depends on store, which is
2626+// where Server lives via internal/server.
2727+type PLCTombstoneStats struct {
2828+ ChecksOK int64
2929+ ChecksTombstoned int64
3030+ ChecksErr int64
3131+ LastRunAt time.Time // zero if the checker has never run
3232+}
3333+2234// Server handles XRPC endpoints for the labeler.
2335type Server struct {
2436 store *store.Store
···2638 mux *http.ServeMux
2739 wsConns atomic.Int64
28404141+ // plcTombstoneStats, when non-nil, is called by the /metrics handler
4242+ // to surface PLC-tombstone-check counters. The labeler wires this
4343+ // after constructing the checker; tests leave it nil to keep the
4444+ // metrics endpoint behavior stable.
4545+ plcTombstoneStats func() PLCTombstoneStats
4646+2947 // WebSocket connection tracking for graceful shutdown
3048 wsMu sync.Mutex
3149 wsTracked map[*websocket.Conn]struct{}
3250}
33515252+// SetPLCTombstoneStatsProvider wires a PLC tombstone-check stats source
5353+// into the metrics endpoint. Calling with nil unwires it. Safe to call
5454+// at most once during startup; not concurrency-safe with active /metrics
5555+// requests (those would observe a torn read of the func pointer).
5656+func (s *Server) SetPLCTombstoneStatsProvider(fn func() PLCTombstoneStats) {
5757+ s.plcTombstoneStats = fn
5858+}
5959+3460// New creates a labeler XRPC server.
3561func New(s *store.Store, labelerDID string) *Server {
3662 srv := &Server{
···79105 fmt.Fprintf(w, "# HELP atmosphere_websocket_connections Current number of WebSocket connections.\n")
80106 fmt.Fprintf(w, "# TYPE atmosphere_websocket_connections gauge\n")
81107 fmt.Fprintf(w, "atmosphere_websocket_connections %d\n", s.wsConns.Load())
108108+109109+ if s.plcTombstoneStats != nil {
110110+ ts := s.plcTombstoneStats()
111111+ fmt.Fprintf(w, "# HELP labeler_plc_status_checks_total PLC status checks per outcome.\n")
112112+ fmt.Fprintf(w, "# TYPE labeler_plc_status_checks_total counter\n")
113113+ fmt.Fprintf(w, "labeler_plc_status_checks_total{result=\"ok\"} %d\n", ts.ChecksOK)
114114+ fmt.Fprintf(w, "labeler_plc_status_checks_total{result=\"tombstoned\"} %d\n", ts.ChecksTombstoned)
115115+ fmt.Fprintf(w, "labeler_plc_status_checks_total{result=\"err\"} %d\n", ts.ChecksErr)
116116+ // last-run timestamp lets ops alert on staleness ("checker
117117+ // hasn't run in 48h" etc). Zero means never run, so emit only
118118+ // when populated.
119119+ if !ts.LastRunAt.IsZero() {
120120+ fmt.Fprintf(w, "# HELP labeler_plc_status_last_run_unix_seconds Unix timestamp of last completed PLC tombstone-check pass.\n")
121121+ fmt.Fprintf(w, "# TYPE labeler_plc_status_last_run_unix_seconds gauge\n")
122122+ fmt.Fprintf(w, "labeler_plc_status_last_run_unix_seconds %d\n", ts.LastRunAt.Unix())
123123+ }
124124+ }
82125}
8312684127func (s *Server) trackConn(conn *websocket.Conn) {
+8-5
osprey/config/labels.yaml
···149149 connotation: neutral
150150 description: "Destination domain reported a complaint in the last 7 days"
151151152152- # Content spray shadow labels (observe-only, no enforcement yet)
153153- shadow:content_spray:
152152+ # Content spray labels — #196 promoted live 2026-05-02 after a
153153+ # bake-in audit confirmed zero shadow:content_spray* firings against
154154+ # Osprey's entity_labels table on atmos-ops. Replaced the earlier
155155+ # shadow:content_spray and shadow:content_spray_extreme entries.
156156+ content_spray:
154157 valid_for: [SenderDID]
155158 connotation: negative
156156- description: "Shadow: same message body sent to 15+ unique recipients in last hour — possible bulk/newsletter"
159159+ description: "Same message body sent to 15+ unique recipients in last hour — observational, no verdict"
157160158158- shadow:content_spray_extreme:
161161+ content_spray_extreme:
159162 valid_for: [SenderDID]
160163 connotation: negative
161161- description: "Shadow: same message body sent to 50+ unique recipients in last hour — bulk mail"
164164+ description: "Same message body sent to 50+ unique recipients in last hour — hard reject"
+20-10
osprey/rules/rules/content_spray.sml
···99# same_content_recipients_last_hour counts distinct recipients who got
1010# the same fingerprint from this sender in the last hour.
1111#
1212-# Shadow mode first: labels are prefixed with shadow: so they're logged
1313-# but don't affect send behavior. Promote to real labels after bake-in
1414-# confirms zero false positives on production traffic.
1212+# Promoted from shadow mode to live enforcement on 2026-05-02 (#196).
1313+# Bake-in audit: zero shadow:content_spray firings across the entire
1414+# shadow window with three production members. The fingerprint
1515+# normalization is deliberately gentle (lowercase + collapse blank
1616+# lines), so transactional senders who include per-recipient tokens
1717+# fingerprint differently per recipient and never trip the threshold.
1518#
1619# Privacy: the relay stores only the sha256 hash, never email addresses
1720# or body content. The counter is a scalar — Osprey sees only the number.
···1922Import(rules=['models/relay.sml'])
20232124# Moderate content spray: same body to 15+ unique recipients in an hour.
2222-# Legitimate transactional senders won't hit this because each message
2323-# body contains recipient-specific tokens.
2525+# Observational label only — no verdict — because the upper bound on
2626+# legitimate small-scale "send the same announcement to a dozen friends"
2727+# style use cannot be ruled out for the cooperative's audience. The
2828+# 12h-expiring label feeds into reputation rules and gives operators a
2929+# trail without surprising members with rejects.
2430ContentSpray = Rule(
2531 when_all=[
2632 EventType == 'relay_attempt',
2733 SameContentRecipientsLastHour != None,
2834 SameContentRecipientsLastHour >= 15,
2935 ],
3030- description='Same message body sent to 15+ unique recipients in last hour — possible bulk/newsletter'
3636+ description='Same message body sent to 15+ unique recipients in last hour'
3137)
32383339WhenRules(
3440 rules_any=[ContentSpray],
3541 then=[
3636- LabelAdd(entity=SenderDID, label='shadow:content_spray', expires_after=TimeDelta(hours=12)),
4242+ LabelAdd(entity=SenderDID, label='content_spray', expires_after=TimeDelta(hours=12)),
3743 ],
3844)
39454046# Extreme content spray: same body to 50+ unique recipients in an hour.
4141-# No legitimate transactional use case produces this pattern.
4747+# No legitimate transactional pattern produces this; it's bulk mail. The
4848+# cooperative is not an ESP — list operators belong on dedicated infra
4949+# whose IP reputation is theirs alone. Hard reject + 3-day label so the
5050+# member sees a 550 immediately and the audit trail captures the event.
4251ExtremeContentSpray = Rule(
4352 when_all=[
4453 EventType == 'relay_attempt',
4554 SameContentRecipientsLastHour != None,
4655 SameContentRecipientsLastHour >= 50,
4756 ],
4848- description='Same message body sent to 50+ unique recipients in last hour — bulk mail'
5757+ description='Same message body sent to 50+ unique recipients in last hour — bulk mail reject'
4958)
50595160WhenRules(
5261 rules_any=[ExtremeContentSpray],
5362 then=[
5454- LabelAdd(entity=SenderDID, label='shadow:content_spray_extreme', expires_after=TimeDelta(days=1)),
6363+ LabelAdd(entity=SenderDID, label='content_spray_extreme', expires_after=TimeDelta(days=3)),
6464+ DeclareVerdict(verdict='reject'),
5565 ],
5666)
···11+description: |
22+ A relay_attempt where same_content_recipients_last_hour = 75 fires
33+ ExtremeContentSpray (#196). The rule applies the content_spray_extreme
44+ label and issues a reject verdict so the relay returns 550 at SMTP
55+ close. Also fires the moderate ContentSpray rule (15+ threshold) since
66+ 75 ≥ 15 — both labels are applied and the reject from the extreme
77+ rule wins.
88+99+ Uses recipient_count=1 to avoid triggering bulk_extreme/warming-bulk
1010+ rules that would muddy the label assertion. Uses member_age_days=60
1111+ to stay out of warming-tier rules.
1212+1313+labels_applied:
1414+ - SenderDID/content_spray/add
1515+ - SenderDID/content_spray_extreme/add
1616+1717+verdicts:
1818+ - reject
1919+2020+labels_forbidden:
2121+ - SenderDID/shadow:content_spray/add
2222+ - SenderDID/shadow:content_spray_extreme/add
2323+ - SenderDID/extreme_bulk/add
···11+description: |
22+ A relay_attempt where same_content_recipients_last_hour = 20 fires
33+ ContentSpray (#196) but NOT ExtremeContentSpray. Applies the
44+ observational content_spray label with no reject verdict — the
55+ moderate threshold is for reputation tracking, not active rejection.
66+77+ Uses recipient_count=1 + member_age_days=60 so unrelated bulk and
88+ warming rules stay quiet.
99+1010+labels_applied:
1111+ - SenderDID/content_spray/add
1212+1313+verdicts: []
1414+1515+labels_forbidden:
1616+ - SenderDID/content_spray_extreme/add
1717+ - SenderDID/shadow:content_spray/add
1818+ - SenderDID/extreme_bulk/add
···11+// Copyright 2021 The Prometheus Authors
22+// Licensed under the Apache License, Version 2.0 (the "License");
33+// you may not use this file except in compliance with the License.
44+// You may obtain a copy of the License at
55+//
66+// http://www.apache.org/licenses/LICENSE-2.0
77+//
88+// Unless required by applicable law or agreed to in writing, software
99+// distributed under the License is distributed on an "AS IS" BASIS,
1010+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1111+// See the License for the specific language governing permissions and
1212+// limitations under the License.
1313+1414+// Package collectors provides implementations of prometheus.Collector to
1515+// conveniently collect process and Go-related metrics.
1616+package collectors
1717+1818+import "github.com/prometheus/client_golang/prometheus"
1919+2020+// NewBuildInfoCollector returns a collector collecting a single metric
2121+// "go_build_info" with the constant value 1 and three labels "path", "version",
2222+// and "checksum". Their label values contain the main module path, version, and
2323+// checksum, respectively. The labels will only have meaningful values if the
2424+// binary is built with Go module support and from source code retrieved from
2525+// the source repository (rather than the local file system). This is usually
2626+// accomplished by building from outside of GOPATH, specifying the full address
2727+// of the main package, e.g. "GO111MODULE=on go run
2828+// github.com/prometheus/client_golang/examples/random". If built without Go
2929+// module support, all label values will be "unknown". If built with Go module
3030+// support but using the source code from the local file system, the "path" will
3131+// be set appropriately, but "checksum" will be empty and "version" will be
3232+// "(devel)".
3333+//
3434+// This collector uses only the build information for the main module. See
3535+// https://github.com/povilasv/prommod for an example of a collector for the
3636+// module dependencies.
3737+func NewBuildInfoCollector() prometheus.Collector {
3838+ //nolint:staticcheck // Ignore SA1019 until v2.
3939+ return prometheus.NewBuildInfoCollector()
4040+}
···11+// Copyright 2021 The Prometheus Authors
22+// Licensed under the Apache License, Version 2.0 (the "License");
33+// you may not use this file except in compliance with the License.
44+// You may obtain a copy of the License at
55+//
66+// http://www.apache.org/licenses/LICENSE-2.0
77+//
88+// Unless required by applicable law or agreed to in writing, software
99+// distributed under the License is distributed on an "AS IS" BASIS,
1010+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1111+// See the License for the specific language governing permissions and
1212+// limitations under the License.
1313+1414+package collectors
1515+1616+import (
1717+ "database/sql"
1818+1919+ "github.com/prometheus/client_golang/prometheus"
2020+)
2121+2222+type dbStatsCollector struct {
2323+ db *sql.DB
2424+2525+ maxOpenConnections *prometheus.Desc
2626+2727+ openConnections *prometheus.Desc
2828+ inUseConnections *prometheus.Desc
2929+ idleConnections *prometheus.Desc
3030+3131+ waitCount *prometheus.Desc
3232+ waitDuration *prometheus.Desc
3333+ maxIdleClosed *prometheus.Desc
3434+ maxIdleTimeClosed *prometheus.Desc
3535+ maxLifetimeClosed *prometheus.Desc
3636+}
3737+3838+// NewDBStatsCollector returns a collector that exports metrics about the given *sql.DB.
3939+// See https://golang.org/pkg/database/sql/#DBStats for more information on stats.
4040+func NewDBStatsCollector(db *sql.DB, dbName string) prometheus.Collector {
4141+ fqName := func(name string) string {
4242+ return "go_sql_" + name
4343+ }
4444+ return &dbStatsCollector{
4545+ db: db,
4646+ maxOpenConnections: prometheus.NewDesc(
4747+ fqName("max_open_connections"),
4848+ "Maximum number of open connections to the database.",
4949+ nil, prometheus.Labels{"db_name": dbName},
5050+ ),
5151+ openConnections: prometheus.NewDesc(
5252+ fqName("open_connections"),
5353+ "The number of established connections both in use and idle.",
5454+ nil, prometheus.Labels{"db_name": dbName},
5555+ ),
5656+ inUseConnections: prometheus.NewDesc(
5757+ fqName("in_use_connections"),
5858+ "The number of connections currently in use.",
5959+ nil, prometheus.Labels{"db_name": dbName},
6060+ ),
6161+ idleConnections: prometheus.NewDesc(
6262+ fqName("idle_connections"),
6363+ "The number of idle connections.",
6464+ nil, prometheus.Labels{"db_name": dbName},
6565+ ),
6666+ waitCount: prometheus.NewDesc(
6767+ fqName("wait_count_total"),
6868+ "The total number of connections waited for.",
6969+ nil, prometheus.Labels{"db_name": dbName},
7070+ ),
7171+ waitDuration: prometheus.NewDesc(
7272+ fqName("wait_duration_seconds_total"),
7373+ "The total time blocked waiting for a new connection.",
7474+ nil, prometheus.Labels{"db_name": dbName},
7575+ ),
7676+ maxIdleClosed: prometheus.NewDesc(
7777+ fqName("max_idle_closed_total"),
7878+ "The total number of connections closed due to SetMaxIdleConns.",
7979+ nil, prometheus.Labels{"db_name": dbName},
8080+ ),
8181+ maxIdleTimeClosed: prometheus.NewDesc(
8282+ fqName("max_idle_time_closed_total"),
8383+ "The total number of connections closed due to SetConnMaxIdleTime.",
8484+ nil, prometheus.Labels{"db_name": dbName},
8585+ ),
8686+ maxLifetimeClosed: prometheus.NewDesc(
8787+ fqName("max_lifetime_closed_total"),
8888+ "The total number of connections closed due to SetConnMaxLifetime.",
8989+ nil, prometheus.Labels{"db_name": dbName},
9090+ ),
9191+ }
9292+}
9393+9494+// Describe implements Collector.
9595+func (c *dbStatsCollector) Describe(ch chan<- *prometheus.Desc) {
9696+ ch <- c.maxOpenConnections
9797+ ch <- c.openConnections
9898+ ch <- c.inUseConnections
9999+ ch <- c.idleConnections
100100+ ch <- c.waitCount
101101+ ch <- c.waitDuration
102102+ ch <- c.maxIdleClosed
103103+ ch <- c.maxLifetimeClosed
104104+ ch <- c.maxIdleTimeClosed
105105+}
106106+107107+// Collect implements Collector.
108108+func (c *dbStatsCollector) Collect(ch chan<- prometheus.Metric) {
109109+ stats := c.db.Stats()
110110+ ch <- prometheus.MustNewConstMetric(c.maxOpenConnections, prometheus.GaugeValue, float64(stats.MaxOpenConnections))
111111+ ch <- prometheus.MustNewConstMetric(c.openConnections, prometheus.GaugeValue, float64(stats.OpenConnections))
112112+ ch <- prometheus.MustNewConstMetric(c.inUseConnections, prometheus.GaugeValue, float64(stats.InUse))
113113+ ch <- prometheus.MustNewConstMetric(c.idleConnections, prometheus.GaugeValue, float64(stats.Idle))
114114+ ch <- prometheus.MustNewConstMetric(c.waitCount, prometheus.CounterValue, float64(stats.WaitCount))
115115+ ch <- prometheus.MustNewConstMetric(c.waitDuration, prometheus.CounterValue, stats.WaitDuration.Seconds())
116116+ ch <- prometheus.MustNewConstMetric(c.maxIdleClosed, prometheus.CounterValue, float64(stats.MaxIdleClosed))
117117+ ch <- prometheus.MustNewConstMetric(c.maxLifetimeClosed, prometheus.CounterValue, float64(stats.MaxLifetimeClosed))
118118+ ch <- prometheus.MustNewConstMetric(c.maxIdleTimeClosed, prometheus.CounterValue, float64(stats.MaxIdleTimeClosed))
119119+}
···11+// Copyright 2021 The Prometheus Authors
22+// Licensed under the Apache License, Version 2.0 (the "License");
33+// you may not use this file except in compliance with the License.
44+// You may obtain a copy of the License at
55+//
66+// http://www.apache.org/licenses/LICENSE-2.0
77+//
88+// Unless required by applicable law or agreed to in writing, software
99+// distributed under the License is distributed on an "AS IS" BASIS,
1010+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1111+// See the License for the specific language governing permissions and
1212+// limitations under the License.
1313+1414+package collectors
1515+1616+import "github.com/prometheus/client_golang/prometheus"
1717+1818+// NewExpvarCollector returns a newly allocated expvar Collector.
1919+//
2020+// An expvar Collector collects metrics from the expvar interface. It provides a
2121+// quick way to expose numeric values that are already exported via expvar as
2222+// Prometheus metrics. Note that the data models of expvar and Prometheus are
2323+// fundamentally different, and that the expvar Collector is inherently slower
2424+// than native Prometheus metrics. Thus, the expvar Collector is probably great
2525+// for experiments and prototyping, but you should seriously consider a more
2626+// direct implementation of Prometheus metrics for monitoring production
2727+// systems.
2828+//
2929+// The exports map has the following meaning:
3030+//
3131+// The keys in the map correspond to expvar keys, i.e. for every expvar key you
3232+// want to export as Prometheus metric, you need an entry in the exports
3333+// map. The descriptor mapped to each key describes how to export the expvar
3434+// value. It defines the name and the help string of the Prometheus metric
3535+// proxying the expvar value. The type will always be Untyped.
3636+//
3737+// For descriptors without variable labels, the expvar value must be a number or
3838+// a bool. The number is then directly exported as the Prometheus sample
3939+// value. (For a bool, 'false' translates to 0 and 'true' to 1). Expvar values
4040+// that are not numbers or bools are silently ignored.
4141+//
4242+// If the descriptor has one variable label, the expvar value must be an expvar
4343+// map. The keys in the expvar map become the various values of the one
4444+// Prometheus label. The values in the expvar map must be numbers or bools again
4545+// as above.
4646+//
4747+// For descriptors with more than one variable label, the expvar must be a
4848+// nested expvar map, i.e. where the values of the topmost map are maps again
4949+// etc. until a depth is reached that corresponds to the number of labels. The
5050+// leaves of that structure must be numbers or bools as above to serve as the
5151+// sample values.
5252+//
5353+// Anything that does not fit into the scheme above is silently ignored.
5454+func NewExpvarCollector(exports map[string]*prometheus.Desc) prometheus.Collector {
5555+ //nolint:staticcheck // Ignore SA1019 until v2.
5656+ return prometheus.NewExpvarCollector(exports)
5757+}
···11+// Copyright 2021 The Prometheus Authors
22+// Licensed under the Apache License, Version 2.0 (the "License");
33+// you may not use this file except in compliance with the License.
44+// You may obtain a copy of the License at
55+//
66+// http://www.apache.org/licenses/LICENSE-2.0
77+//
88+// Unless required by applicable law or agreed to in writing, software
99+// distributed under the License is distributed on an "AS IS" BASIS,
1010+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1111+// See the License for the specific language governing permissions and
1212+// limitations under the License.
1313+1414+//go:build !go1.17
1515+// +build !go1.17
1616+1717+package collectors
1818+1919+import "github.com/prometheus/client_golang/prometheus"
2020+2121+// NewGoCollector returns a collector that exports metrics about the current Go
2222+// process. This includes memory stats. To collect those, runtime.ReadMemStats
2323+// is called. This requires to “stop the world”, which usually only happens for
2424+// garbage collection (GC). Take the following implications into account when
2525+// deciding whether to use the Go collector:
2626+//
2727+// 1. The performance impact of stopping the world is the more relevant the more
2828+// frequently metrics are collected. However, with Go1.9 or later the
2929+// stop-the-world time per metrics collection is very short (~25µs) so that the
3030+// performance impact will only matter in rare cases. However, with older Go
3131+// versions, the stop-the-world duration depends on the heap size and can be
3232+// quite significant (~1.7 ms/GiB as per
3333+// https://go-review.googlesource.com/c/go/+/34937).
3434+//
3535+// 2. During an ongoing GC, nothing else can stop the world. Therefore, if the
3636+// metrics collection happens to coincide with GC, it will only complete after
3737+// GC has finished. Usually, GC is fast enough to not cause problems. However,
3838+// with a very large heap, GC might take multiple seconds, which is enough to
3939+// cause scrape timeouts in common setups. To avoid this problem, the Go
4040+// collector will use the memstats from a previous collection if
4141+// runtime.ReadMemStats takes more than 1s. However, if there are no previously
4242+// collected memstats, or their collection is more than 5m ago, the collection
4343+// will block until runtime.ReadMemStats succeeds.
4444+//
4545+// NOTE: The problem is solved in Go 1.15, see
4646+// https://github.com/golang/go/issues/19812 for the related Go issue.
4747+func NewGoCollector() prometheus.Collector {
4848+ return prometheus.NewGoCollector()
4949+}
···11+// Copyright 2021 The Prometheus Authors
22+// Licensed under the Apache License, Version 2.0 (the "License");
33+// you may not use this file except in compliance with the License.
44+// You may obtain a copy of the License at
55+//
66+// http://www.apache.org/licenses/LICENSE-2.0
77+//
88+// Unless required by applicable law or agreed to in writing, software
99+// distributed under the License is distributed on an "AS IS" BASIS,
1010+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1111+// See the License for the specific language governing permissions and
1212+// limitations under the License.
1313+1414+//go:build go1.17
1515+// +build go1.17
1616+1717+package collectors
1818+1919+import (
2020+ "regexp"
2121+2222+ "github.com/prometheus/client_golang/prometheus"
2323+ "github.com/prometheus/client_golang/prometheus/internal"
2424+)
2525+2626+var (
2727+ // MetricsAll allows all the metrics to be collected from Go runtime.
2828+ MetricsAll = GoRuntimeMetricsRule{regexp.MustCompile("/.*")}
2929+ // MetricsGC allows only GC metrics to be collected from Go runtime.
3030+ // e.g. go_gc_cycles_automatic_gc_cycles_total
3131+ // NOTE: This does not include new class of "/cpu/classes/gc/..." metrics.
3232+ // Use custom metric rule to access those.
3333+ MetricsGC = GoRuntimeMetricsRule{regexp.MustCompile(`^/gc/.*`)}
3434+ // MetricsMemory allows only memory metrics to be collected from Go runtime.
3535+ // e.g. go_memory_classes_heap_free_bytes
3636+ MetricsMemory = GoRuntimeMetricsRule{regexp.MustCompile(`^/memory/.*`)}
3737+ // MetricsScheduler allows only scheduler metrics to be collected from Go runtime.
3838+ // e.g. go_sched_goroutines_goroutines
3939+ MetricsScheduler = GoRuntimeMetricsRule{regexp.MustCompile(`^/sched/.*`)}
4040+ // MetricsDebug allows only debug metrics to be collected from Go runtime.
4141+ // e.g. go_godebug_non_default_behavior_gocachetest_events_total
4242+ MetricsDebug = GoRuntimeMetricsRule{regexp.MustCompile(`^/godebug/.*`)}
4343+)
4444+4545+// WithGoCollectorMemStatsMetricsDisabled disables metrics that is gathered in runtime.MemStats structure such as:
4646+//
4747+// go_memstats_alloc_bytes
4848+// go_memstats_alloc_bytes_total
4949+// go_memstats_sys_bytes
5050+// go_memstats_mallocs_total
5151+// go_memstats_frees_total
5252+// go_memstats_heap_alloc_bytes
5353+// go_memstats_heap_sys_bytes
5454+// go_memstats_heap_idle_bytes
5555+// go_memstats_heap_inuse_bytes
5656+// go_memstats_heap_released_bytes
5757+// go_memstats_heap_objects
5858+// go_memstats_stack_inuse_bytes
5959+// go_memstats_stack_sys_bytes
6060+// go_memstats_mspan_inuse_bytes
6161+// go_memstats_mspan_sys_bytes
6262+// go_memstats_mcache_inuse_bytes
6363+// go_memstats_mcache_sys_bytes
6464+// go_memstats_buck_hash_sys_bytes
6565+// go_memstats_gc_sys_bytes
6666+// go_memstats_other_sys_bytes
6767+// go_memstats_next_gc_bytes
6868+//
6969+// so the metrics known from pre client_golang v1.12.0,
7070+//
7171+// NOTE(bwplotka): The above represents runtime.MemStats statistics, but they are
7272+// actually implemented using new runtime/metrics package. (except skipped go_memstats_gc_cpu_fraction
7373+// -- see https://github.com/prometheus/client_golang/issues/842#issuecomment-861812034 for explanation).
7474+//
7575+// Some users might want to disable this on collector level (although you can use scrape relabelling on Prometheus),
7676+// because similar metrics can be now obtained using WithGoCollectorRuntimeMetrics. Note that the semantics of new
7777+// metrics might be different, plus the names can be change over time with different Go version.
7878+//
7979+// NOTE(bwplotka): Changing metric names can be tedious at times as the alerts, recording rules and dashboards have to be adjusted.
8080+// The old metrics are also very useful, with many guides and books written about how to interpret them.
8181+//
8282+// As a result our recommendation would be to stick with MemStats like metrics and enable other runtime/metrics if you are interested
8383+// in advanced insights Go provides. See ExampleGoCollector_WithAdvancedGoMetrics.
8484+func WithGoCollectorMemStatsMetricsDisabled() func(options *internal.GoCollectorOptions) {
8585+ return func(o *internal.GoCollectorOptions) {
8686+ o.DisableMemStatsLikeMetrics = true
8787+ }
8888+}
8989+9090+// GoRuntimeMetricsRule allow enabling and configuring particular group of runtime/metrics.
9191+// TODO(bwplotka): Consider adding ability to adjust buckets.
9292+type GoRuntimeMetricsRule struct {
9393+ // Matcher represents RE2 expression will match the runtime/metrics from https://golang.bg/src/runtime/metrics/description.go
9494+ // Use `regexp.MustCompile` or `regexp.Compile` to create this field.
9595+ Matcher *regexp.Regexp
9696+}
9797+9898+// WithGoCollectorRuntimeMetrics allows enabling and configuring particular group of runtime/metrics.
9999+// See the list of metrics https://golang.bg/src/runtime/metrics/description.go (pick the Go version you use there!).
100100+// You can use this option in repeated manner, which will add new rules. The order of rules is important, the last rule
101101+// that matches particular metrics is applied.
102102+func WithGoCollectorRuntimeMetrics(rules ...GoRuntimeMetricsRule) func(options *internal.GoCollectorOptions) {
103103+ rs := make([]internal.GoCollectorRule, len(rules))
104104+ for i, r := range rules {
105105+ rs[i] = internal.GoCollectorRule{
106106+ Matcher: r.Matcher,
107107+ }
108108+ }
109109+110110+ return func(o *internal.GoCollectorOptions) {
111111+ o.RuntimeMetricRules = append(o.RuntimeMetricRules, rs...)
112112+ }
113113+}
114114+115115+// WithoutGoCollectorRuntimeMetrics allows disabling group of runtime/metrics that you might have added in WithGoCollectorRuntimeMetrics.
116116+// It behaves similarly to WithGoCollectorRuntimeMetrics just with deny-list semantics.
117117+func WithoutGoCollectorRuntimeMetrics(matchers ...*regexp.Regexp) func(options *internal.GoCollectorOptions) {
118118+ rs := make([]internal.GoCollectorRule, len(matchers))
119119+ for i, m := range matchers {
120120+ rs[i] = internal.GoCollectorRule{
121121+ Matcher: m,
122122+ Deny: true,
123123+ }
124124+ }
125125+126126+ return func(o *internal.GoCollectorOptions) {
127127+ o.RuntimeMetricRules = append(o.RuntimeMetricRules, rs...)
128128+ }
129129+}
130130+131131+// GoCollectionOption represents Go collection option flag.
132132+// Deprecated.
133133+type GoCollectionOption uint32
134134+135135+const (
136136+ // GoRuntimeMemStatsCollection represents the metrics represented by runtime.MemStats structure.
137137+ //
138138+ // Deprecated: Use WithGoCollectorMemStatsMetricsDisabled() function to disable those metrics in the collector.
139139+ GoRuntimeMemStatsCollection GoCollectionOption = 1 << iota
140140+ // GoRuntimeMetricsCollection is the new set of metrics represented by runtime/metrics package.
141141+ //
142142+ // Deprecated: Use WithGoCollectorRuntimeMetrics(GoRuntimeMetricsRule{Matcher: regexp.MustCompile("/.*")})
143143+ // function to enable those metrics in the collector.
144144+ GoRuntimeMetricsCollection
145145+)
146146+147147+// WithGoCollections allows enabling different collections for Go collector on top of base metrics.
148148+//
149149+// Deprecated: Use WithGoCollectorRuntimeMetrics() and WithGoCollectorMemStatsMetricsDisabled() instead to control metrics.
150150+func WithGoCollections(flags GoCollectionOption) func(options *internal.GoCollectorOptions) {
151151+ return func(options *internal.GoCollectorOptions) {
152152+ if flags&GoRuntimeMemStatsCollection == 0 {
153153+ WithGoCollectorMemStatsMetricsDisabled()(options)
154154+ }
155155+156156+ if flags&GoRuntimeMetricsCollection != 0 {
157157+ WithGoCollectorRuntimeMetrics(GoRuntimeMetricsRule{Matcher: regexp.MustCompile("/.*")})(options)
158158+ }
159159+ }
160160+}
161161+162162+// NewGoCollector returns a collector that exports metrics about the current Go
163163+// process using debug.GCStats (base metrics) and runtime/metrics (both in MemStats style and new ones).
164164+func NewGoCollector(opts ...func(o *internal.GoCollectorOptions)) prometheus.Collector {
165165+ //nolint:staticcheck // Ignore SA1019 until v2.
166166+ return prometheus.NewGoCollector(opts...)
167167+}
···11+// Copyright 2021 The Prometheus Authors
22+// Licensed under the Apache License, Version 2.0 (the "License");
33+// you may not use this file except in compliance with the License.
44+// You may obtain a copy of the License at
55+//
66+// http://www.apache.org/licenses/LICENSE-2.0
77+//
88+// Unless required by applicable law or agreed to in writing, software
99+// distributed under the License is distributed on an "AS IS" BASIS,
1010+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1111+// See the License for the specific language governing permissions and
1212+// limitations under the License.
1313+1414+package collectors
1515+1616+import "github.com/prometheus/client_golang/prometheus"
1717+1818+// ProcessCollectorOpts defines the behavior of a process metrics collector
1919+// created with NewProcessCollector.
2020+type ProcessCollectorOpts struct {
2121+ // PidFn returns the PID of the process the collector collects metrics
2222+ // for. It is called upon each collection. By default, the PID of the
2323+ // current process is used, as determined on construction time by
2424+ // calling os.Getpid().
2525+ PidFn func() (int, error)
2626+ // If non-empty, each of the collected metrics is prefixed by the
2727+ // provided string and an underscore ("_").
2828+ Namespace string
2929+ // If true, any error encountered during collection is reported as an
3030+ // invalid metric (see NewInvalidMetric). Otherwise, errors are ignored
3131+ // and the collected metrics will be incomplete. (Possibly, no metrics
3232+ // will be collected at all.) While that's usually not desired, it is
3333+ // appropriate for the common "mix-in" of process metrics, where process
3434+ // metrics are nice to have, but failing to collect them should not
3535+ // disrupt the collection of the remaining metrics.
3636+ ReportErrors bool
3737+}
3838+3939+// NewProcessCollector returns a collector which exports the current state of
4040+// process metrics including CPU, memory and file descriptor usage as well as
4141+// the process start time. The detailed behavior is defined by the provided
4242+// ProcessCollectorOpts. The zero value of ProcessCollectorOpts creates a
4343+// collector for the current process with an empty namespace string and no error
4444+// reporting.
4545+//
4646+// The collector only works on operating systems with a Linux-style proc
4747+// filesystem and on Microsoft Windows. On other operating systems, it will not
4848+// collect any metrics.
4949+func NewProcessCollector(opts ProcessCollectorOpts) prometheus.Collector {
5050+ //nolint:staticcheck // Ignore SA1019 until v2.
5151+ return prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{
5252+ PidFn: opts.PidFn,
5353+ Namespace: opts.Namespace,
5454+ ReportErrors: opts.ReportErrors,
5555+ })
5656+}