very fast at protocol indexer with flexible filtering, xrpc queries, cursor-backed event stream, and more, built on fjall
rust fjall at-protocol atproto indexer
58
fork

Configure Feed

Select the types of activity you want to include in your feed.

[config] load .env

dawn 940dff13 76f20936

+93 -23
+63 -22
README.md
··· 1 1 # hydrant 2 2 3 - `hydrant` is an AT Protocol indexer built on the `fjall` database. it's built to be flexible, supporting both full-network indexing and filtered indexing (e.g., by DID), allowing querying with XRPCs (not only `com.atproto.*`!), providing an ordered event stream, etc. 3 + `hydrant` is an AT Protocol indexer built on the `fjall` database. it's built to 4 + be flexible, supporting both full-network indexing and filtered indexing (e.g., 5 + by DID), allowing querying with XRPCs (not only `com.atproto.*`!), providing an 6 + ordered event stream, etc. 4 7 5 - you can see [random.wisp.place](https://tangled.org/did:plc:dfl62fgb7wtjj3fcbb72naae/random.wisp.place) (standalone binary using http API) or the [statusphere example](./examples/statusphere.rs) (hydrant-as-library) for examples on how to use hydrant. 8 + you can see 9 + [random.wisp.place](https://tangled.org/did:plc:dfl62fgb7wtjj3fcbb72naae/random.wisp.place) 10 + (standalone binary using http API) or the [statusphere 11 + example](./examples/statusphere.rs) (hydrant-as-library) for examples on how to 12 + use hydrant. 6 13 7 - **WARNING: *the db format is not stable yet.*** it's in active development so if you are going to rely on the db format being stable, don't (eg. for query features, if you are using ephemeral mode this doesn't matter for example, or you dont mind losing your existing backfilled data in hydrant if you already processed them.). 14 + **WARNING: *the db format is not stable yet.*** it's in active development so if 15 + you are going to rely on the db format being stable, don't (eg. for query 16 + features, if you are using ephemeral mode this doesn't matter for example, or 17 + you dont mind losing your existing backfilled data in hydrant if you already 18 + processed them.). 8 19 9 20 ## vs `tap` 10 21 11 - while [`tap`](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) is designed as a firehose consumer and simply just propagates events while handling sync, `hydrant` is flexible, it allows you to directly query the database for records, and it also provides an ordered view of events, allowing the use of a cursor to fetch events from a specific point. it can act as both an indexer or an ephemeral view of some window of events. 22 + while [`tap`](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) is 23 + designed as a firehose consumer and simply just propagates events while handling 24 + sync, `hydrant` is flexible, it allows you to directly query the database for 25 + records, and it also provides an ordered view of events, allowing the use of a 26 + cursor to fetch events from a specific point. it can act as both an indexer or 27 + an ephemeral view of some window of events. 12 28 13 29 ### stream behavior 14 30 ··· 24 40 25 41 ### multiple relay support 26 42 27 - `hydrant` supports connecting to multiple relays simultaneously for firehose ingestion. when `RELAY_HOSTS` is configured with multiple URLs: 43 + `hydrant` supports connecting to multiple relays simultaneously for firehose 44 + ingestion. when `RELAY_HOSTS` is configured with multiple URLs: 28 45 29 46 - one independent firehose stream loop is spawned per relay 30 47 - each relay maintains its own firehose cursor state 31 48 - all ingestion loops share the same worker pool and database 32 49 33 - commit events are de-duplicated according to the repo `rev`. account / identity events are de-duplicated using the `time` field. 34 - todo: decide what to do on relay-side account takedowns or if relays set the `time` field. 50 + commit events are de-duplicated according to the repo `rev`. account / identity 51 + events are de-duplicated using the `time` field. todo: decide what to do on 52 + relay-side account takedowns or if relays set the `time` field. 35 53 36 54 ### crawler sources 37 55 38 - the crawler is configured separately from the firehose via `CRAWLER_URLS`. each source is a `[mode::]url` entry where the mode prefix is optional and defaults to `by_collection` in filter mode or `relay` in full-network mode. 56 + the crawler is configured separately from the firehose via `CRAWLER_URLS`. each 57 + source is a `[mode::]url` entry where the mode prefix is optional and defaults 58 + to `by_collection` in filter mode or `relay` in full-network mode. 39 59 40 - - `relay`: enumerates the network via `com.atproto.sync.listRepos`, then checks each repo's collections via `describeRepo`. used for full-network discovery. 41 - - `by_collection`: queries `com.atproto.sync.listReposByCollection` for each configured signal. more efficient for filtered indexing since it only surfaces repos that have matching records. 42 - cursors are stored per collection. 60 + - `relay`: enumerates the network via `com.atproto.sync.listRepos`, then checks 61 + each repo's collections via `describeRepo`. used for full-network discovery. 62 + - `by_collection`: queries `com.atproto.sync.listReposByCollection` for each 63 + configured signal. more efficient for filtered indexing since it only surfaces 64 + repos that have matching records. cursors are stored per collection. 43 65 44 66 ``` 45 67 CRAWLER_URLS=by_collection::https://lightrail.microcosm.blue,relay::wss://bsky.network ··· 49 71 50 72 ## configuration 51 73 52 - `hydrant` is configured via environment variables. all variables are prefixed with `HYDRANT_` (except `RUST_LOG`). 74 + `hydrant` is configured via environment variables. all variables are prefixed 75 + with `HYDRANT_` (except `RUST_LOG`). if a `.env` file exists in the working 76 + directory, it will also be loaded automatically. 53 77 54 78 | variable | default | description | 55 79 | :--- | :--- | :--- | ··· 91 115 92 116 - `GET /ingestion`: get the current ingestion status. 93 117 - returns `{ "crawler": bool, "firehose": bool, "backfill": bool }`. 94 - - `PATCH /ingestion`: enable or disable ingestion components at runtime without restarting. 95 - - body: `{ "crawler"?: bool, "firehose"?: bool, "backfill"?: bool }` — only provided fields are updated. 96 - - when disabled, each component finishes its current task before pausing (e.g. the backfill worker completes any in-flight repo syncs, the firehose finishes processing the current message). they resume immediately when re-enabled. 118 + - `PATCH /ingestion`: enable or disable ingestion components at runtime without 119 + restarting. 120 + - body: `{ "crawler"?: bool, "firehose"?: bool, "backfill"?: bool }` — only 121 + provided fields are updated. 122 + - when disabled, each component finishes its current task before pausing (e.g. 123 + the backfill worker completes any in-flight repo syncs, the firehose 124 + finishes processing the current message). they resume immediately when 125 + re-enabled. 97 126 98 127 #### database operations 99 128 100 - - `POST /db/train`: train zstd compression dictionaries for the `repos`, `blocks`, and `events` keyspaces. dictionaries are written to disk; a restart is required to apply them. the crawler, firehose, and backfill worker are paused for the duration and restored on completion. 101 - - `POST /db/compact`: trigger a full major compaction of all database keyspaces in parallel. the crawler, firehose, and backfill worker are paused for the duration and restored on completion. 102 - - `DELETE /cursors`: reset all stored cursors for a given URL. body: `{ "key": "..." }` where key is a URL. clears the relay crawler cursor, and any by-collection cursors associated with that URL. causes the next crawler pass to restart from the beginning. 129 + - `POST /db/train`: train zstd compression dictionaries for the `repos`, 130 + `blocks`, and `events` keyspaces. dictionaries are written to disk; a restart 131 + is required to apply them. the crawler, firehose, and backfill worker are 132 + paused for the duration and restored on completion. 133 + - `POST /db/compact`: trigger a full major compaction of all database keyspaces 134 + in parallel. the crawler, firehose, and backfill worker are paused for the 135 + duration and restored on completion. 136 + - `DELETE /cursors`: reset all stored cursors for a given URL. body: `{ "key": "..." }` 137 + where key is a URL. clears the relay crawler cursor, and any by-collection cursors 138 + associated with that URL. causes the next crawler pass to restart from the beginning. 103 139 104 140 #### filter mode 105 141 ··· 139 175 - `limit`: max results (default 100, max 1000) 140 176 - `cursor`: opaque key for paginating. 141 177 - `partition`: `all` (default), `pending` (backfill queue), or `resync` (retries) 142 - - `GET /repos/{did}`: get the sync status and metadata of a specific repository. also returns the handle, PDS URL and the atproto signing key (these won't be available before the repo has been backfilled once at least). 178 + - `GET /repos/{did}`: get the sync status and metadata of a specific repository. 179 + also returns the handle, PDS URL and the atproto signing key (these won't be 180 + available before the repo has been backfilled once at least). 143 181 - `PUT /repos`: explicitly track repositories. accepts an NDJSON body of `{"did": "..."}` (or JSON array of the same). 144 182 - `DELETE /repos`: untrack repositories. accepts an NDJSON body of `{"did": "..."}` (or JSON array of the same). 145 183 ··· 182 220 183 221 ### blue.microcosm.links.* 184 222 185 - hydrant implements a subset of [microcosm constellation](https://constellation.microcosm.blue/) when it's built with the `backlinks` cargo feature (`cargo build --features backlinks`). 223 + hydrant implements a subset of [microcosm constellation](https://constellation.microcosm.blue/) 224 + when it's built with the `backlinks` cargo feature (`cargo build --features backlinks`). 186 225 187 - when enabled, hydrant indexes all AT URI and DID references found inside stored records into a reverse index. this lets you efficiently answer "what records link to this subject?". 226 + when enabled, hydrant indexes all AT URI and DID references found inside stored records into a 227 + reverse index. this lets you efficiently answer "what records link to this subject?". 188 228 189 229 #### blue.microcosm.links.getBacklinks 190 230 ··· 200 240 201 241 returns `{ backlinks: [{ uri, cid }], cursor? }`. 202 242 203 - results are ordered by source record rkey (ascending by default, descending when `reverse=true`). the cursor is stable across new insertions for TID rkey records. 243 + results are ordered by source record rkey (ascending by default, descending when `reverse=true`). 244 + the cursor is stable across new insertions for TID rkey records. 204 245 205 246 #### blue.microcosm.links.getBacklinksCount 206 247
+30 -1
src/config.rs
··· 5 5 use std::time::Duration; 6 6 use url::Url; 7 7 8 + /// loads `.env` from the current directory, setting any variables not already in the environment. 9 + fn load_dotenv() { 10 + let Ok(contents) = std::fs::read_to_string(".env") else { 11 + return; 12 + }; 13 + for line in contents.lines() { 14 + let line = line.trim(); 15 + if line.is_empty() || line.starts_with('#') { 16 + continue; 17 + } 18 + let Some((key, val)) = line.split_once('=') else { 19 + continue; 20 + }; 21 + let key = key.trim(); 22 + let val = val.trim(); 23 + let val = val 24 + .strip_prefix('"') 25 + .and_then(|v| v.strip_suffix('"')) 26 + .or_else(|| val.strip_prefix('\'').and_then(|v| v.strip_suffix('\''))) 27 + .unwrap_or(val); 28 + if std::env::var(key).is_err() { 29 + // SAFETY: single-threaded at startup; no other threads are reading env yet. 30 + unsafe { std::env::set_var(key, val) }; 31 + } 32 + } 33 + } 34 + 8 35 #[derive(Debug, Clone, Copy, PartialEq, Eq)] 9 36 pub enum CrawlerMode { 10 37 /// enumerate via `com.atproto.sync.listRepos`, then check signals with `describeRepo`. ··· 316 343 } 317 344 } 318 345 319 - /// reads and builds the config from environment variables. 346 + /// reads and builds the config from environment variables, loading `.env` first if present. 320 347 pub fn from_env() -> Result<Self> { 348 + load_dotenv(); 349 + 321 350 macro_rules! cfg { 322 351 (@val $key:expr) => { 323 352 std::env::var(concat!("HYDRANT_", $key))