···11+# Data source fetches
22+33+An automation can declare a list of **fetches** that run after the trigger
44+event matches but before actions fire. Each fetch resolves to a named entry in
55+the `fetchContext`, which action templates can reference via
66+`{{fetchName.record.*}}` and per-fetch conditions can gate on.
77+88+There are two kinds today, discriminated by the `kind` field on the stored
99+`FetchStep` ([lib/db/schema.ts](../lib/db/schema.ts)):
1010+1111+- **`kind: "record"`** — resolve a specific AT URI to its record. The default
1212+ if `kind` is absent (legacy rows).
1313+- **`kind: "search"`** — find a record in a repo by field equality. Used to
1414+ answer "does a record already exist that matches X?" before acting.
1515+1616+Both kinds return a `FetchContextEntry` with a `found` flag; neither throws on
1717+"not found" — that signal is observed via `exists` / `not-exists` conditions
1818+on the fetch.
1919+2020+```ts
2121+type FetchContextEntry = {
2222+ found: boolean;
2323+ uri: string;
2424+ cid: string;
2525+ did?: string;
2626+ collection?: string;
2727+ rkey?: string;
2828+ record: Record<string, unknown>;
2929+};
3030+```
3131+3232+## Record fetches
3333+3434+The simple shape. One HTTP request, O(1), no pagination.
3535+3636+```json
3737+{
3838+ "kind": "record",
3939+ "name": "parentPost",
4040+ "uri": "at://{{event.commit.record.reply.parent.uri}}"
4141+}
4242+```
4343+4444+### Resolution
4545+4646+[lib/actions/fetcher.ts](../lib/actions/fetcher.ts):
4747+4848+1. The `uri` template is rendered against the event + upstream context + owner
4949+ DID. `{{self}}`, `{{event.*}}`, and `{{otherFetch.*}}` all work.
5050+2. The rendered string is validated against the AT URI shape
5151+ (`at://did/collection/rkey`). Non-AT-URI → the fetch errors with a log entry.
5252+3. The URI is fetched via `fetchRecord` ([lib/pds/resolver.ts](../lib/pds/resolver.ts)),
5353+ which resolves the DID to its PDS and calls `com.atproto.repo.getRecord`.
5454+4. A 404 writes `found: false`; any other failure is treated as an error.
5555+5656+Record fetches are **independent of each other** and run in parallel via
5757+`Promise.all`. Their `conditions` are evaluated once all record fetches have
5858+resolved (and before any search fetches begin).
5959+6060+### When to use
6161+6262+- Enriching the event with the parent post, the quoted record, the followed
6363+ subject's profile, etc.
6464+- Any case where you already know the exact AT URI to look up.
6565+6666+## Search fetches
6767+6868+The richer shape. Answers "is there a record in this repo/collection whose
6969+field X equals Y?" Used primarily to prevent duplicates in mirror-style
7070+automations — e.g. "only create a Sifa follow if no Sifa follow for this
7171+subject already exists."
7272+7373+```json
7474+{
7575+ "kind": "search",
7676+ "name": "existingMirror",
7777+ "repo": "{{self}}",
7878+ "collection": "id.sifa.graph.follow",
7979+ "where": [
8080+ { "field": "subject", "operator": "eq", "value": "{{event.commit.record.subject}}" }
8181+ ],
8282+ "limit": 1,
8383+ "conditions": [
8484+ { "field": "found", "operator": "not-exists", "value": "" }
8585+ ]
8686+}
8787+```
8888+8989+### Inputs
9090+9191+- `repo` — template that must resolve to a DID. Typically `{{self}}` but may
9292+ reference an event field or an upstream fetch.
9393+- `collection` — literal NSID (not a template). The collection to scan.
9494+- `where` — list of equality clauses. Currently only `operator: "eq"` is
9595+ supported. Multiple clauses are ANDed.
9696+- `limit` — max number of matches to accept. Defaults to 1. The current
9797+ executor always returns the *first* match as the context entry; `limit` just
9898+ controls how many matches the executor is willing to find before stopping.
9999+- `conditions` — per-fetch conditions evaluated after the search resolves.
100100+ Typically `found` + `exists` / `not-exists` to gate on presence.
101101+102102+### Execution strategy
103103+104104+[lib/actions/searcher.ts](../lib/actions/searcher.ts):
105105+106106+Search doesn't have a single primitive in AT Proto; there's no server-side
107107+equivalent of "get a record by field equality." The executor picks one of two
108108+strategies.
109109+110110+#### 1. Appview fast-path (Bluesky follows only)
111111+112112+The specific case of "does `actor` follow `subject` on Bluesky" is answered in
113113+O(1) by the Bluesky appview's `app.bsky.graph.getRelationships` endpoint. The
114114+executor detects this shape:
115115+116116+```ts
117117+if (step.collection === "app.bsky.graph.follow") {
118118+ const subjectClause = hasOnlyClause(step, "subject", "eq");
119119+ if (subjectClause) {
120120+ // → single appview request, parses the `following` AT URI out of the response
121121+ }
122122+}
123123+```
124124+125125+Trigger conditions:
126126+127127+- `collection === "app.bsky.graph.follow"`
128128+- Exactly one `where` clause
129129+- That clause is `subject eq <DID>`
130130+131131+Any other shape on `app.bsky.graph.follow` falls through to the generic path.
132132+133133+If the appview returns no `following` URI for the subject, the entry is
134134+`notFoundEntry()` (i.e. `found: false`). This is the correct answer — the user
135135+doesn't follow the subject — and it's distinct from an appview transport
136136+failure, which returns `null` and falls through to `listRecords`.
137137+138138+#### 2. Generic `listRecords` scan
139139+140140+The fallback. Resolves the repo's PDS endpoint, paginates
141141+`com.atproto.repo.listRecords`, and filters results client-side against the
142142+`where` clauses.
143143+144144+```
145145+LIST_RECORDS_PAGE_SIZE = 100 // capped by the listRecords lexicon
146146+MAX_LIST_RECORDS_PAGES = 100 // → 10k records scanned at most
147147+HTTP_TIMEOUT_MS = 10s // per page
148148+```
149149+150150+The page size is a hard ceiling: `com.atproto.repo.listRecords` declares
151151+`limit` as `minimum: 1, maximum: 100, default: 50`, so spec-compliant PDSs
152152+will reject or clamp anything higher. To scan further we can only add pages.
153153+154154+Per page:
155155+156156+1. Fetch up to 100 records via `listRecords`.
157157+2. For each record, check every `where` clause via dotted-path read (same
158158+ machinery the condition layer uses, `readPath` on the record value).
159159+3. Collect matches until `limit` is reached; the first match is what ends up
160160+ in the context entry.
161161+4. If the PDS returns a `cursor`, continue; otherwise stop.
162162+163163+If the scan completes without finding a match, the entry is `notFoundEntry()`.
164164+If the 100-page cap is hit *and* the cursor would continue, a warning is
165165+logged and the entry is still `notFoundEntry()` — we intentionally treat an
166166+exhausted scan as "not found" rather than erroring, because the usual caller
167167+is a `not-exists` gate and false negatives are preferable to hard failures.
168168+This is a tradeoff worth understanding: for collections large enough to
169169+exceed 10,000 records, the "does X exist?" answer may be incorrect. For the
170170+motivating use case — follow lists — this covers the vast majority of
171171+accounts; heavy-follower accounts (>10k followees) may see stale-not-found
172172+results until the cap is raised or replaced with an indexed data source.
173173+174174+### Cost
175175+176176+- **Appview path**: one HTTP request, a few KB, returned in tens of ms.
177177+- **listRecords path**: up to 100 HTTP requests (each up to 100 records). A
178178+ small repo resolves in one page; a repo on the order of 10k records is
179179+ typically a few seconds when the PDS is healthy. Each page is 10s-capped
180180+ individually, so the theoretical worst case is 1000s — and because
181181+ searches run sequentially in the fetcher, that serializes with any other
182182+ searches in the same automation. In practice that worst case only shows up
183183+ when a PDS is degraded or throttling; budget for a few seconds, plan for
184184+ a few more.
185185+186186+Searches are **sequential** in the fetcher — unlike record fetches which run in
187187+parallel — because a search's `repo` or `where` clauses can template against
188188+upstream fetch results. Running them serially keeps that dependency model
189189+straightforward for the MVP.
190190+191191+### The `where` clause model is deliberately narrow
192192+193193+The current shape (equality only, always AND) is intentional. AT Proto has no
194194+server-side query language for record fields, so every operator added to
195195+`where` must be evaluated in `listRecords`-scan code. Equality covers the
196196+anti-duplicate use case that motivated search; richer operators would encourage
197197+queries that are quietly expensive on large repos.
198198+199199+## Per-fetch conditions
200200+201201+Both fetch kinds support a `conditions` array. These run *after* the fetch
202202+resolves and are evaluated against the fetch's own entry — paths are
203203+entry-scoped, not event-scoped:
204204+205205+- `field: "found"` + `exists` / `not-exists` tests the boolean flag directly
206206+ (special-cased, because stringifying `false` would otherwise read as
207207+ non-empty).
208208+- `field: "record.subject"` walks into the fetched record.
209209+- `field: "uri"` / `field: "cid"` test the top-level entry fields.
210210+211211+If any condition on any fetch fails, `resolveFetches` returns `skip: true` and
212212+the handler short-circuits before actions. This is treated as normal filtering
213213+— no delivery log entry, no error. In dry-run mode the handler does write a
214214+"skipped by <fetchName>" log so authors can debug why the automation isn't
215215+firing.
216216+217217+## Interaction with fetch errors
218218+219219+A fetch **error** (bad URI, PDS unreachable, search throws) is distinct from a
220220+fetch **not-finding** anything. Errors are collected into the `errors` array
221221+on the `FetchResolution`; the entry is not added to the context, subsequent
222222+fetches that template against it get `undefined`. Dry-run surfaces these as
223223+"Fetch failed: <name>" entries; real runs log them to console but continue.
224224+225225+The typical pattern for "only act if X doesn't exist yet" therefore looks like:
226226+227227+```json
228228+{
229229+ "kind": "search",
230230+ "name": "existingMirror",
231231+ "repo": "{{self}}",
232232+ "collection": "id.sifa.graph.follow",
233233+ "where": [{ "field": "subject", "operator": "eq", "value": "{{event.commit.record.subject}}" }],
234234+ "limit": 1,
235235+ "conditions": [
236236+ { "field": "found", "operator": "not-exists", "value": "" }
237237+ ]
238238+}
239239+```
240240+241241+- Search resolves → `found: false` (nothing matched).
242242+- Condition `found not-exists` passes.
243243+- Handler proceeds to actions.
244244+245245+If the same subject already exists:
246246+247247+- Search resolves → `found: true` with the existing record.
248248+- Condition `found not-exists` fails.
249249+- `skip: true` bubbles up, actions never run, no log in production.
+178
docs/wanted-dids.md
···11+# `wantedDids` vs `event.did` conditions
22+33+Airglow exposes two overlapping ways to say "only run this automation for specific
44+accounts":
55+66+1. **`wantedDids`** — a list of DIDs set on the automation row. Passed to Jetstream
77+ as the `wantedDids` query parameter when the WebSocket subscription is opened.
88+2. **A trigger condition on `event.did`** — an entry in the automation's
99+ top-level `conditions` list using `field: "event.did"`. Evaluated in-process
1010+ by the matcher after Jetstream has already delivered the event.
1111+1212+Both can express "match only commits from DID X". They look redundant but live
1313+at different layers and have very different cost profiles. This document
1414+explains when each is appropriate, why both exist, and why some collections are
1515+forced to use `wantedDids` via `NSID_REQUIRES_DIDS`.
1616+1717+## Where each filter runs
1818+1919+```
2020+Jetstream firehose
2121+ │
2222+ │ wantedCollections + wantedDids filter here (server-side, AT Proto infra)
2323+ ▼
2424+Airglow WebSocket message
2525+ │
2626+ │ matchConditions(event, conditions, ownerDid) — in-process, per automation
2727+ ▼
2828+Fetches → actions
2929+```
3030+3131+`wantedDids` is a **subscription-level** filter: Jetstream never sends the event
3232+to Airglow in the first place. The trigger condition is an **in-process**
3333+filter: the event crosses the network, the worker parses the JSON, then the
3434+matcher decides the condition doesn't hold and drops it.
3535+3636+Two consequences follow from that difference:
3737+3838+- Only `wantedDids` reduces bandwidth and CPU on the Airglow worker.
3939+- Only a condition can combine DID filtering with other event shape checks
4040+ (e.g. "did X AND record.subject starts with Y").
4141+4242+## One subscription per canonical DID set
4343+4444+`JetstreamManager` partitions active automations by their resolved `wantedDids`
4545+list (`{{self}}` is expanded, entries deduped and sorted). Each distinct
4646+partition opens exactly one WebSocket; the empty partition is the "global"
4747+firehose subscription.
4848+4949+So adding `wantedDids` to an automation is not free at the subscription layer:
5050+if no other automation shares that exact DID set, a new WebSocket is opened for
5151+it. Conversely, many automations that share the same owner DID (via `{{self}}`)
5252+coalesce into a single subscription.
5353+5454+Consumers ([lib/jetstream/consumer.ts](../lib/jetstream/consumer.ts)):
5555+5656+```ts
5757+const resolvedDids = canonicalDids(row.wantedDids, row.did);
5858+const key = resolvedDids.join(",");
5959+// ...partition automations by `key`, then one JetstreamSubscription per partition.
6060+```
6161+6262+## Tradeoffs
6363+6464+### Prefer `wantedDids` when
6565+6666+- The NSID is **high-volume** (anything under `app.bsky.*`, `chat.bsky.*`,
6767+ etc.). Without a DID filter, Jetstream will fire thousands of events per
6868+ second just for `app.bsky.feed.post`.
6969+- The set of target accounts is **small and known** — typically just the owner
7070+ (`{{self}}`) or a handful of friends.
7171+- The filter is **stable**. Changing `wantedDids` triggers a subscription
7272+ reconfigure (and in the cross-partition case, reopens a WebSocket).
7373+7474+### Prefer a trigger condition on `event.did` when
7575+7676+- The NSID is **low-volume** (custom lexicons, niche collections). The global
7777+ subscription already carries the event at negligible cost.
7878+- You need to combine DID filtering with **other event-shape conditions**:
7979+8080+ ```json
8181+ {
8282+ "conditions": [
8383+ { "field": "event.did", "operator": "eq", "value": "did:plc:abc" },
8484+ { "field": "subject", "operator": "startsWith", "value": "did:plc:xyz" }
8585+ ]
8686+ }
8787+ ```
8888+8989+- You want a **small dynamic set** of DIDs without re-partitioning subscriptions
9090+ — e.g. block/allow lists that change often.
9191+9292+### When you'd use both
9393+9494+Nothing stops you from setting `wantedDids` *and* adding `event.did` conditions
9595+on top, and there's a real use case for it: `wantedDids` narrows the firehose
9696+to a set of accounts cheaply, and the condition layer then applies additional
9797+constraints (record fields, subject DIDs, etc.) that Jetstream can't express.
9898+9999+## The high-volume case: `app.bsky.*` and `NSID_REQUIRES_DIDS`
100100+101101+Bluesky collections (`app.bsky.feed.post`, `app.bsky.graph.follow`, …) are the
102102+busiest on the network by several orders of magnitude. Subscribing to
103103+`app.bsky.feed.post` with no DID filter is effectively subscribing to the
104104+firehose — millions of events per hour. A single Airglow worker running the
105105+per-automation matcher against every one of those is not a feasible shape.
106106+107107+To protect the instance from an automation author accidentally doing that,
108108+[lib/config.ts](../lib/config.ts) exposes a third NSID list:
109109+110110+```ts
111111+// NSIDs listed here are only allowed when the automation declares a non-empty
112112+// wantedDids. Used to gate high-volume collections (e.g. app.bsky.*) on
113113+// Jetstream-level DID filtering instead of a blanket firehose subscription.
114114+nsidRequireDids: env("NSID_REQUIRES_DIDS", "").split(",").filter(Boolean),
115115+```
116116+117117+The manager checks this list during partitioning
118118+([lib/jetstream/consumer.ts](../lib/jetstream/consumer.ts)):
119119+120120+```ts
121121+if (
122122+ nsidRequiresWantedDids(row.lexicon, config.nsidRequireDids) &&
123123+ resolvedDids.length === 0
124124+) {
125125+ console.warn(
126126+ `Jetstream: skipping ${row.uri} — ${row.lexicon} requires wantedDids but none are set`,
127127+ );
128128+ continue;
129129+}
130130+```
131131+132132+`nsidRequiresWantedDids` does glob matching — a pattern ending in `.*` matches
133133+by prefix ([lib/lexicons/match.ts](../lib/lexicons/match.ts)). So a typical
134134+production config sets:
135135+136136+```
137137+NSID_REQUIRES_DIDS=app.bsky.*,chat.bsky.*
138138+```
139139+140140+and any automation listening to those collections **must** declare a non-empty
141141+`wantedDids`. Automations that violate the rule are silently skipped (with a
142142+warning log) at partition time — they won't crash the manager, they just don't
143143+get a subscription.
144144+145145+A trigger condition on `event.did` does **not** satisfy the gate: the check is
146146+structural (`resolvedDids.length === 0`), because the whole point is to avoid
147147+opening the firehose subscription in the first place. If `wantedDids` is empty,
148148+the automation would end up in the global partition regardless of what its
149149+conditions say.
150150+151151+### Why the gate is config-driven rather than baked in
152152+153153+The three NSID env vars — `NSID_ALLOWLIST`, `NSID_BLOCKLIST`,
154154+`NSID_REQUIRES_DIDS` — together let an operator shape what an instance allows
155155+without code changes:
156156+157157+- `NSID_ALLOWLIST` / `NSID_BLOCKLIST` control **which** collections can be
158158+ subscribed to at all.
159159+- `NSID_REQUIRES_DIDS` controls **how** the expensive ones may be subscribed to.
160160+161161+A hobby instance can leave all three empty and run against the full firehose.
162162+A shared instance typically wants `app.bsky.*` in `NSID_REQUIRES_DIDS` so
163163+users can still automate on Bluesky events, but only scoped to accounts they
164164+care about.
165165+166166+## Rules of thumb
167167+168168+| Situation | Use |
169169+| ---------------------------------------------------------- | ----------------- |
170170+| Automation listens to an `app.bsky.*` collection | `wantedDids` |
171171+| Owner-only automation on a custom lexicon | Either; `wantedDids` preferred |
172172+| "Anyone posting about topic X on `run.airglow.*`" | Condition on record fields, no DID filter |
173173+| Stable list of ≤ a few hundred DIDs | `wantedDids` |
174174+| Rapidly changing DID allowlist | `event.did` condition |
175175+| Filter by DID **and** by record-shape in the same rule | `wantedDids` + conditions |
176176+177177+When in doubt: if the NSID is in `NSID_REQUIRES_DIDS`, you don't get a choice
178178+— the manager will skip the automation until `wantedDids` is populated.
+4-1
lib/actions/searcher.ts
···99} from "./template.js";
10101111const BSKY_APPVIEW = "https://api.bsky.app";
1212+// `com.atproto.repo.listRecords` caps `limit` at 100 per page in the lexicon,
1313+// so 100 is the ceiling regardless of what we'd prefer. To scan more records
1414+// we can only raise the page count.
1215const LIST_RECORDS_PAGE_SIZE = 100;
1313-const MAX_LIST_RECORDS_PAGES = 20;
1616+const MAX_LIST_RECORDS_PAGES = 100; // → 10k records scanned at most
1417const HTTP_TIMEOUT_MS = 10_000;
15181619function render(template: string, event: JetstreamEvent, upstream: FetchContext, ownerDid: string) {