docs: document data source fetches and wanted-dids · exosphere.site/airglow@b81fb65

+249

docs/data-source-fetches.md

··· 1 + # Data source fetches 2 + 3 + An automation can declare a list of **fetches** that run after the trigger 4 + event matches but before actions fire. Each fetch resolves to a named entry in 5 + the `fetchContext`, which action templates can reference via 6 + `{{fetchName.record.*}}` and per-fetch conditions can gate on. 7 + 8 + There are two kinds today, discriminated by the `kind` field on the stored 9 + `FetchStep` ([lib/db/schema.ts](../lib/db/schema.ts)): 10 + 11 + - **`kind: "record"`** — resolve a specific AT URI to its record. The default 12 + if `kind` is absent (legacy rows). 13 + - **`kind: "search"`** — find a record in a repo by field equality. Used to 14 + answer "does a record already exist that matches X?" before acting. 15 + 16 + Both kinds return a `FetchContextEntry` with a `found` flag; neither throws on 17 + "not found" — that signal is observed via `exists` / `not-exists` conditions 18 + on the fetch. 19 + 20 + ```ts 21 + type FetchContextEntry = { 22 + found: boolean; 23 + uri: string; 24 + cid: string; 25 + did?: string; 26 + collection?: string; 27 + rkey?: string; 28 + record: Record<string, unknown>; 29 + }; 30 + ``` 31 + 32 + ## Record fetches 33 + 34 + The simple shape. One HTTP request, O(1), no pagination. 35 + 36 + ```json 37 + { 38 + "kind": "record", 39 + "name": "parentPost", 40 + "uri": "at://{{event.commit.record.reply.parent.uri}}" 41 + } 42 + ``` 43 + 44 + ### Resolution 45 + 46 + [lib/actions/fetcher.ts](../lib/actions/fetcher.ts): 47 + 48 + 1. The `uri` template is rendered against the event + upstream context + owner 49 + DID. `{{self}}`, `{{event.*}}`, and `{{otherFetch.*}}` all work. 50 + 2. The rendered string is validated against the AT URI shape 51 + (`at://did/collection/rkey`). Non-AT-URI → the fetch errors with a log entry. 52 + 3. The URI is fetched via `fetchRecord` ([lib/pds/resolver.ts](../lib/pds/resolver.ts)), 53 + which resolves the DID to its PDS and calls `com.atproto.repo.getRecord`. 54 + 4. A 404 writes `found: false`; any other failure is treated as an error. 55 + 56 + Record fetches are **independent of each other** and run in parallel via 57 + `Promise.all`. Their `conditions` are evaluated once all record fetches have 58 + resolved (and before any search fetches begin). 59 + 60 + ### When to use 61 + 62 + - Enriching the event with the parent post, the quoted record, the followed 63 + subject's profile, etc. 64 + - Any case where you already know the exact AT URI to look up. 65 + 66 + ## Search fetches 67 + 68 + The richer shape. Answers "is there a record in this repo/collection whose 69 + field X equals Y?" Used primarily to prevent duplicates in mirror-style 70 + automations — e.g. "only create a Sifa follow if no Sifa follow for this 71 + subject already exists." 72 + 73 + ```json 74 + { 75 + "kind": "search", 76 + "name": "existingMirror", 77 + "repo": "{{self}}", 78 + "collection": "id.sifa.graph.follow", 79 + "where": [ 80 + { "field": "subject", "operator": "eq", "value": "{{event.commit.record.subject}}" } 81 + ], 82 + "limit": 1, 83 + "conditions": [ 84 + { "field": "found", "operator": "not-exists", "value": "" } 85 + ] 86 + } 87 + ``` 88 + 89 + ### Inputs 90 + 91 + - `repo` — template that must resolve to a DID. Typically `{{self}}` but may 92 + reference an event field or an upstream fetch. 93 + - `collection` — literal NSID (not a template). The collection to scan. 94 + - `where` — list of equality clauses. Currently only `operator: "eq"` is 95 + supported. Multiple clauses are ANDed. 96 + - `limit` — max number of matches to accept. Defaults to 1. The current 97 + executor always returns the *first* match as the context entry; `limit` just 98 + controls how many matches the executor is willing to find before stopping. 99 + - `conditions` — per-fetch conditions evaluated after the search resolves. 100 + Typically `found` + `exists` / `not-exists` to gate on presence. 101 + 102 + ### Execution strategy 103 + 104 + [lib/actions/searcher.ts](../lib/actions/searcher.ts): 105 + 106 + Search doesn't have a single primitive in AT Proto; there's no server-side 107 + equivalent of "get a record by field equality." The executor picks one of two 108 + strategies. 109 + 110 + #### 1. Appview fast-path (Bluesky follows only) 111 + 112 + The specific case of "does `actor` follow `subject` on Bluesky" is answered in 113 + O(1) by the Bluesky appview's `app.bsky.graph.getRelationships` endpoint. The 114 + executor detects this shape: 115 + 116 + ```ts 117 + if (step.collection === "app.bsky.graph.follow") { 118 + const subjectClause = hasOnlyClause(step, "subject", "eq"); 119 + if (subjectClause) { 120 + // → single appview request, parses the `following` AT URI out of the response 121 + } 122 + } 123 + ``` 124 + 125 + Trigger conditions: 126 + 127 + - `collection === "app.bsky.graph.follow"` 128 + - Exactly one `where` clause 129 + - That clause is `subject eq <DID>` 130 + 131 + Any other shape on `app.bsky.graph.follow` falls through to the generic path. 132 + 133 + If the appview returns no `following` URI for the subject, the entry is 134 + `notFoundEntry()` (i.e. `found: false`). This is the correct answer — the user 135 + doesn't follow the subject — and it's distinct from an appview transport 136 + failure, which returns `null` and falls through to `listRecords`. 137 + 138 + #### 2. Generic `listRecords` scan 139 + 140 + The fallback. Resolves the repo's PDS endpoint, paginates 141 + `com.atproto.repo.listRecords`, and filters results client-side against the 142 + `where` clauses. 143 + 144 + ``` 145 + LIST_RECORDS_PAGE_SIZE = 100 // capped by the listRecords lexicon 146 + MAX_LIST_RECORDS_PAGES = 100 // → 10k records scanned at most 147 + HTTP_TIMEOUT_MS = 10s // per page 148 + ``` 149 + 150 + The page size is a hard ceiling: `com.atproto.repo.listRecords` declares 151 + `limit` as `minimum: 1, maximum: 100, default: 50`, so spec-compliant PDSs 152 + will reject or clamp anything higher. To scan further we can only add pages. 153 + 154 + Per page: 155 + 156 + 1. Fetch up to 100 records via `listRecords`. 157 + 2. For each record, check every `where` clause via dotted-path read (same 158 + machinery the condition layer uses, `readPath` on the record value). 159 + 3. Collect matches until `limit` is reached; the first match is what ends up 160 + in the context entry. 161 + 4. If the PDS returns a `cursor`, continue; otherwise stop. 162 + 163 + If the scan completes without finding a match, the entry is `notFoundEntry()`. 164 + If the 100-page cap is hit *and* the cursor would continue, a warning is 165 + logged and the entry is still `notFoundEntry()` — we intentionally treat an 166 + exhausted scan as "not found" rather than erroring, because the usual caller 167 + is a `not-exists` gate and false negatives are preferable to hard failures. 168 + This is a tradeoff worth understanding: for collections large enough to 169 + exceed 10,000 records, the "does X exist?" answer may be incorrect. For the 170 + motivating use case — follow lists — this covers the vast majority of 171 + accounts; heavy-follower accounts (>10k followees) may see stale-not-found 172 + results until the cap is raised or replaced with an indexed data source. 173 + 174 + ### Cost 175 + 176 + - **Appview path**: one HTTP request, a few KB, returned in tens of ms. 177 + - **listRecords path**: up to 100 HTTP requests (each up to 100 records). A 178 + small repo resolves in one page; a repo on the order of 10k records is 179 + typically a few seconds when the PDS is healthy. Each page is 10s-capped 180 + individually, so the theoretical worst case is 1000s — and because 181 + searches run sequentially in the fetcher, that serializes with any other 182 + searches in the same automation. In practice that worst case only shows up 183 + when a PDS is degraded or throttling; budget for a few seconds, plan for 184 + a few more. 185 + 186 + Searches are **sequential** in the fetcher — unlike record fetches which run in 187 + parallel — because a search's `repo` or `where` clauses can template against 188 + upstream fetch results. Running them serially keeps that dependency model 189 + straightforward for the MVP. 190 + 191 + ### The `where` clause model is deliberately narrow 192 + 193 + The current shape (equality only, always AND) is intentional. AT Proto has no 194 + server-side query language for record fields, so every operator added to 195 + `where` must be evaluated in `listRecords`-scan code. Equality covers the 196 + anti-duplicate use case that motivated search; richer operators would encourage 197 + queries that are quietly expensive on large repos. 198 + 199 + ## Per-fetch conditions 200 + 201 + Both fetch kinds support a `conditions` array. These run *after* the fetch 202 + resolves and are evaluated against the fetch's own entry — paths are 203 + entry-scoped, not event-scoped: 204 + 205 + - `field: "found"` + `exists` / `not-exists` tests the boolean flag directly 206 + (special-cased, because stringifying `false` would otherwise read as 207 + non-empty). 208 + - `field: "record.subject"` walks into the fetched record. 209 + - `field: "uri"` / `field: "cid"` test the top-level entry fields. 210 + 211 + If any condition on any fetch fails, `resolveFetches` returns `skip: true` and 212 + the handler short-circuits before actions. This is treated as normal filtering 213 + — no delivery log entry, no error. In dry-run mode the handler does write a 214 + "skipped by <fetchName>" log so authors can debug why the automation isn't 215 + firing. 216 + 217 + ## Interaction with fetch errors 218 + 219 + A fetch **error** (bad URI, PDS unreachable, search throws) is distinct from a 220 + fetch **not-finding** anything. Errors are collected into the `errors` array 221 + on the `FetchResolution`; the entry is not added to the context, subsequent 222 + fetches that template against it get `undefined`. Dry-run surfaces these as 223 + "Fetch failed: <name>" entries; real runs log them to console but continue. 224 + 225 + The typical pattern for "only act if X doesn't exist yet" therefore looks like: 226 + 227 + ```json 228 + { 229 + "kind": "search", 230 + "name": "existingMirror", 231 + "repo": "{{self}}", 232 + "collection": "id.sifa.graph.follow", 233 + "where": [{ "field": "subject", "operator": "eq", "value": "{{event.commit.record.subject}}" }], 234 + "limit": 1, 235 + "conditions": [ 236 + { "field": "found", "operator": "not-exists", "value": "" } 237 + ] 238 + } 239 + ``` 240 + 241 + - Search resolves → `found: false` (nothing matched). 242 + - Condition `found not-exists` passes. 243 + - Handler proceeds to actions. 244 + 245 + If the same subject already exists: 246 + 247 + - Search resolves → `found: true` with the existing record. 248 + - Condition `found not-exists` fails. 249 + - `skip: true` bubbles up, actions never run, no log in production.

+178

docs/wanted-dids.md

··· 1 + # `wantedDids` vs `event.did` conditions 2 + 3 + Airglow exposes two overlapping ways to say "only run this automation for specific 4 + accounts": 5 + 6 + 1. **`wantedDids`** — a list of DIDs set on the automation row. Passed to Jetstream 7 + as the `wantedDids` query parameter when the WebSocket subscription is opened. 8 + 2. **A trigger condition on `event.did`** — an entry in the automation's 9 + top-level `conditions` list using `field: "event.did"`. Evaluated in-process 10 + by the matcher after Jetstream has already delivered the event. 11 + 12 + Both can express "match only commits from DID X". They look redundant but live 13 + at different layers and have very different cost profiles. This document 14 + explains when each is appropriate, why both exist, and why some collections are 15 + forced to use `wantedDids` via `NSID_REQUIRES_DIDS`. 16 + 17 + ## Where each filter runs 18 + 19 + ``` 20 + Jetstream firehose 21 + │ 22 + │ wantedCollections + wantedDids filter here (server-side, AT Proto infra) 23 + ▼ 24 + Airglow WebSocket message 25 + │ 26 + │ matchConditions(event, conditions, ownerDid) — in-process, per automation 27 + ▼ 28 + Fetches → actions 29 + ``` 30 + 31 + `wantedDids` is a **subscription-level** filter: Jetstream never sends the event 32 + to Airglow in the first place. The trigger condition is an **in-process** 33 + filter: the event crosses the network, the worker parses the JSON, then the 34 + matcher decides the condition doesn't hold and drops it. 35 + 36 + Two consequences follow from that difference: 37 + 38 + - Only `wantedDids` reduces bandwidth and CPU on the Airglow worker. 39 + - Only a condition can combine DID filtering with other event shape checks 40 + (e.g. "did X AND record.subject starts with Y"). 41 + 42 + ## One subscription per canonical DID set 43 + 44 + `JetstreamManager` partitions active automations by their resolved `wantedDids` 45 + list (`{{self}}` is expanded, entries deduped and sorted). Each distinct 46 + partition opens exactly one WebSocket; the empty partition is the "global" 47 + firehose subscription. 48 + 49 + So adding `wantedDids` to an automation is not free at the subscription layer: 50 + if no other automation shares that exact DID set, a new WebSocket is opened for 51 + it. Conversely, many automations that share the same owner DID (via `{{self}}`) 52 + coalesce into a single subscription. 53 + 54 + Consumers ([lib/jetstream/consumer.ts](../lib/jetstream/consumer.ts)): 55 + 56 + ```ts 57 + const resolvedDids = canonicalDids(row.wantedDids, row.did); 58 + const key = resolvedDids.join(","); 59 + // ...partition automations by `key`, then one JetstreamSubscription per partition. 60 + ``` 61 + 62 + ## Tradeoffs 63 + 64 + ### Prefer `wantedDids` when 65 + 66 + - The NSID is **high-volume** (anything under `app.bsky.*`, `chat.bsky.*`, 67 + etc.). Without a DID filter, Jetstream will fire thousands of events per 68 + second just for `app.bsky.feed.post`. 69 + - The set of target accounts is **small and known** — typically just the owner 70 + (`{{self}}`) or a handful of friends. 71 + - The filter is **stable**. Changing `wantedDids` triggers a subscription 72 + reconfigure (and in the cross-partition case, reopens a WebSocket). 73 + 74 + ### Prefer a trigger condition on `event.did` when 75 + 76 + - The NSID is **low-volume** (custom lexicons, niche collections). The global 77 + subscription already carries the event at negligible cost. 78 + - You need to combine DID filtering with **other event-shape conditions**: 79 + 80 + ```json 81 + { 82 + "conditions": [ 83 + { "field": "event.did", "operator": "eq", "value": "did:plc:abc" }, 84 + { "field": "subject", "operator": "startsWith", "value": "did:plc:xyz" } 85 + ] 86 + } 87 + ``` 88 + 89 + - You want a **small dynamic set** of DIDs without re-partitioning subscriptions 90 + — e.g. block/allow lists that change often. 91 + 92 + ### When you'd use both 93 + 94 + Nothing stops you from setting `wantedDids` *and* adding `event.did` conditions 95 + on top, and there's a real use case for it: `wantedDids` narrows the firehose 96 + to a set of accounts cheaply, and the condition layer then applies additional 97 + constraints (record fields, subject DIDs, etc.) that Jetstream can't express. 98 + 99 + ## The high-volume case: `app.bsky.*` and `NSID_REQUIRES_DIDS` 100 + 101 + Bluesky collections (`app.bsky.feed.post`, `app.bsky.graph.follow`, …) are the 102 + busiest on the network by several orders of magnitude. Subscribing to 103 + `app.bsky.feed.post` with no DID filter is effectively subscribing to the 104 + firehose — millions of events per hour. A single Airglow worker running the 105 + per-automation matcher against every one of those is not a feasible shape. 106 + 107 + To protect the instance from an automation author accidentally doing that, 108 + [lib/config.ts](../lib/config.ts) exposes a third NSID list: 109 + 110 + ```ts 111 + // NSIDs listed here are only allowed when the automation declares a non-empty 112 + // wantedDids. Used to gate high-volume collections (e.g. app.bsky.*) on 113 + // Jetstream-level DID filtering instead of a blanket firehose subscription. 114 + nsidRequireDids: env("NSID_REQUIRES_DIDS", "").split(",").filter(Boolean), 115 + ``` 116 + 117 + The manager checks this list during partitioning 118 + ([lib/jetstream/consumer.ts](../lib/jetstream/consumer.ts)): 119 + 120 + ```ts 121 + if ( 122 + nsidRequiresWantedDids(row.lexicon, config.nsidRequireDids) && 123 + resolvedDids.length === 0 124 + ) { 125 + console.warn( 126 + `Jetstream: skipping ${row.uri} — ${row.lexicon} requires wantedDids but none are set`, 127 + ); 128 + continue; 129 + } 130 + ``` 131 + 132 + `nsidRequiresWantedDids` does glob matching — a pattern ending in `.*` matches 133 + by prefix ([lib/lexicons/match.ts](../lib/lexicons/match.ts)). So a typical 134 + production config sets: 135 + 136 + ``` 137 + NSID_REQUIRES_DIDS=app.bsky.*,chat.bsky.* 138 + ``` 139 + 140 + and any automation listening to those collections **must** declare a non-empty 141 + `wantedDids`. Automations that violate the rule are silently skipped (with a 142 + warning log) at partition time — they won't crash the manager, they just don't 143 + get a subscription. 144 + 145 + A trigger condition on `event.did` does **not** satisfy the gate: the check is 146 + structural (`resolvedDids.length === 0`), because the whole point is to avoid 147 + opening the firehose subscription in the first place. If `wantedDids` is empty, 148 + the automation would end up in the global partition regardless of what its 149 + conditions say. 150 + 151 + ### Why the gate is config-driven rather than baked in 152 + 153 + The three NSID env vars — `NSID_ALLOWLIST`, `NSID_BLOCKLIST`, 154 + `NSID_REQUIRES_DIDS` — together let an operator shape what an instance allows 155 + without code changes: 156 + 157 + - `NSID_ALLOWLIST` / `NSID_BLOCKLIST` control **which** collections can be 158 + subscribed to at all. 159 + - `NSID_REQUIRES_DIDS` controls **how** the expensive ones may be subscribed to. 160 + 161 + A hobby instance can leave all three empty and run against the full firehose. 162 + A shared instance typically wants `app.bsky.*` in `NSID_REQUIRES_DIDS` so 163 + users can still automate on Bluesky events, but only scoped to accounts they 164 + care about. 165 + 166 + ## Rules of thumb 167 + 168 + | Situation | Use | 169 + | ---------------------------------------------------------- | ----------------- | 170 + | Automation listens to an `app.bsky.*` collection | `wantedDids` | 171 + | Owner-only automation on a custom lexicon | Either; `wantedDids` preferred | 172 + | "Anyone posting about topic X on `run.airglow.*`" | Condition on record fields, no DID filter | 173 + | Stable list of ≤ a few hundred DIDs | `wantedDids` | 174 + | Rapidly changing DID allowlist | `event.did` condition | 175 + | Filter by DID **and** by record-shape in the same rule | `wantedDids` + conditions | 176 + 177 + When in doubt: if the NSID is in `NSID_REQUIRES_DIDS`, you don't get a choice 178 + — the manager will skip the automation until `wantedDids` is populated.

+4 -1

lib/actions/searcher.ts

··· 9 9 } from "./template.js"; 10 10 11 11 const BSKY_APPVIEW = "https://api.bsky.app"; 12 + // `com.atproto.repo.listRecords` caps `limit` at 100 per page in the lexicon, 13 + // so 100 is the ceiling regardless of what we'd prefer. To scan more records 14 + // we can only raise the page count. 12 15 const LIST_RECORDS_PAGE_SIZE = 100; 13 - const MAX_LIST_RECORDS_PAGES = 20; 16 + const MAX_LIST_RECORDS_PAGES = 100; // → 10k records scanned at most 14 17 const HTTP_TIMEOUT_MS = 10_000; 15 18 16 19 function render(template: string, event: JetstreamEvent, upstream: FetchContext, ownerDid: string) {

Configure Feed

Configure Feed