···11# hydrant
2233-`hydrant` is an AT Protocol indexer built on the `fjall` database. it's built to be flexible, supporting both full-network indexing and filtered indexing (e.g., by DID), allowing querying with XRPCs (not only `com.atproto.*`!), providing an ordered event stream, etc.
33+`hydrant` is an AT Protocol indexer built on the `fjall` database. it's built to
44+be flexible, supporting both full-network indexing and filtered indexing (e.g.,
55+by DID), allowing querying with XRPCs (not only `com.atproto.*`!), providing an
66+ordered event stream, etc.
4755-you can see [random.wisp.place](https://tangled.org/did:plc:dfl62fgb7wtjj3fcbb72naae/random.wisp.place) (standalone binary using http API) or the [statusphere example](./examples/statusphere.rs) (hydrant-as-library) for examples on how to use hydrant.
88+you can see
99+[random.wisp.place](https://tangled.org/did:plc:dfl62fgb7wtjj3fcbb72naae/random.wisp.place)
1010+(standalone binary using http API) or the [statusphere
1111+example](./examples/statusphere.rs) (hydrant-as-library) for examples on how to
1212+use hydrant.
61377-**WARNING: *the db format is not stable yet.*** it's in active development so if you are going to rely on the db format being stable, don't (eg. for query features, if you are using ephemeral mode this doesn't matter for example, or you dont mind losing your existing backfilled data in hydrant if you already processed them.).
1414+**WARNING: *the db format is not stable yet.*** it's in active development so if
1515+you are going to rely on the db format being stable, don't (eg. for query
1616+features, if you are using ephemeral mode this doesn't matter for example, or
1717+you dont mind losing your existing backfilled data in hydrant if you already
1818+processed them.).
819920## vs `tap`
10211111-while [`tap`](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) is designed as a firehose consumer and simply just propagates events while handling sync, `hydrant` is flexible, it allows you to directly query the database for records, and it also provides an ordered view of events, allowing the use of a cursor to fetch events from a specific point. it can act as both an indexer or an ephemeral view of some window of events.
2222+while [`tap`](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) is
2323+designed as a firehose consumer and simply just propagates events while handling
2424+sync, `hydrant` is flexible, it allows you to directly query the database for
2525+records, and it also provides an ordered view of events, allowing the use of a
2626+cursor to fetch events from a specific point. it can act as both an indexer or
2727+an ephemeral view of some window of events.
12281329### stream behavior
1430···24402541### multiple relay support
26422727-`hydrant` supports connecting to multiple relays simultaneously for firehose ingestion. when `RELAY_HOSTS` is configured with multiple URLs:
4343+`hydrant` supports connecting to multiple relays simultaneously for firehose
4444+ingestion. when `RELAY_HOSTS` is configured with multiple URLs:
28452946- one independent firehose stream loop is spawned per relay
3047- each relay maintains its own firehose cursor state
3148- all ingestion loops share the same worker pool and database
32493333-commit events are de-duplicated according to the repo `rev`. account / identity events are de-duplicated using the `time` field.
3434-todo: decide what to do on relay-side account takedowns or if relays set the `time` field.
5050+commit events are de-duplicated according to the repo `rev`. account / identity
5151+events are de-duplicated using the `time` field. todo: decide what to do on
5252+relay-side account takedowns or if relays set the `time` field.
35533654### crawler sources
37553838-the crawler is configured separately from the firehose via `CRAWLER_URLS`. each source is a `[mode::]url` entry where the mode prefix is optional and defaults to `by_collection` in filter mode or `relay` in full-network mode.
5656+the crawler is configured separately from the firehose via `CRAWLER_URLS`. each
5757+source is a `[mode::]url` entry where the mode prefix is optional and defaults
5858+to `by_collection` in filter mode or `relay` in full-network mode.
39594040-- `relay`: enumerates the network via `com.atproto.sync.listRepos`, then checks each repo's collections via `describeRepo`. used for full-network discovery.
4141-- `by_collection`: queries `com.atproto.sync.listReposByCollection` for each configured signal. more efficient for filtered indexing since it only surfaces repos that have matching records.
4242-cursors are stored per collection.
6060+- `relay`: enumerates the network via `com.atproto.sync.listRepos`, then checks
6161+ each repo's collections via `describeRepo`. used for full-network discovery.
6262+- `by_collection`: queries `com.atproto.sync.listReposByCollection` for each
6363+ configured signal. more efficient for filtered indexing since it only surfaces
6464+ repos that have matching records. cursors are stored per collection.
43654466```
4567CRAWLER_URLS=by_collection::https://lightrail.microcosm.blue,relay::wss://bsky.network
···49715072## configuration
51735252-`hydrant` is configured via environment variables. all variables are prefixed with `HYDRANT_` (except `RUST_LOG`).
7474+`hydrant` is configured via environment variables. all variables are prefixed
7575+with `HYDRANT_` (except `RUST_LOG`). if a `.env` file exists in the working
7676+directory, it will also be loaded automatically.
53775478| variable | default | description |
5579| :--- | :--- | :--- |
···9111592116- `GET /ingestion`: get the current ingestion status.
93117 - returns `{ "crawler": bool, "firehose": bool, "backfill": bool }`.
9494-- `PATCH /ingestion`: enable or disable ingestion components at runtime without restarting.
9595- - body: `{ "crawler"?: bool, "firehose"?: bool, "backfill"?: bool }` — only provided fields are updated.
9696- - when disabled, each component finishes its current task before pausing (e.g. the backfill worker completes any in-flight repo syncs, the firehose finishes processing the current message). they resume immediately when re-enabled.
118118+- `PATCH /ingestion`: enable or disable ingestion components at runtime without
119119+ restarting.
120120+ - body: `{ "crawler"?: bool, "firehose"?: bool, "backfill"?: bool }` — only
121121+ provided fields are updated.
122122+ - when disabled, each component finishes its current task before pausing (e.g.
123123+ the backfill worker completes any in-flight repo syncs, the firehose
124124+ finishes processing the current message). they resume immediately when
125125+ re-enabled.
9712698127#### database operations
99128100100-- `POST /db/train`: train zstd compression dictionaries for the `repos`, `blocks`, and `events` keyspaces. dictionaries are written to disk; a restart is required to apply them. the crawler, firehose, and backfill worker are paused for the duration and restored on completion.
101101-- `POST /db/compact`: trigger a full major compaction of all database keyspaces in parallel. the crawler, firehose, and backfill worker are paused for the duration and restored on completion.
102102-- `DELETE /cursors`: reset all stored cursors for a given URL. body: `{ "key": "..." }` where key is a URL. clears the relay crawler cursor, and any by-collection cursors associated with that URL. causes the next crawler pass to restart from the beginning.
129129+- `POST /db/train`: train zstd compression dictionaries for the `repos`,
130130+ `blocks`, and `events` keyspaces. dictionaries are written to disk; a restart
131131+ is required to apply them. the crawler, firehose, and backfill worker are
132132+ paused for the duration and restored on completion.
133133+- `POST /db/compact`: trigger a full major compaction of all database keyspaces
134134+ in parallel. the crawler, firehose, and backfill worker are paused for the
135135+ duration and restored on completion.
136136+- `DELETE /cursors`: reset all stored cursors for a given URL. body: `{ "key": "..." }`
137137+ where key is a URL. clears the relay crawler cursor, and any by-collection cursors
138138+ associated with that URL. causes the next crawler pass to restart from the beginning.
103139104140#### filter mode
105141···139175 - `limit`: max results (default 100, max 1000)
140176 - `cursor`: opaque key for paginating.
141177 - `partition`: `all` (default), `pending` (backfill queue), or `resync` (retries)
142142-- `GET /repos/{did}`: get the sync status and metadata of a specific repository. also returns the handle, PDS URL and the atproto signing key (these won't be available before the repo has been backfilled once at least).
178178+- `GET /repos/{did}`: get the sync status and metadata of a specific repository.
179179+ also returns the handle, PDS URL and the atproto signing key (these won't be
180180+ available before the repo has been backfilled once at least).
143181- `PUT /repos`: explicitly track repositories. accepts an NDJSON body of `{"did": "..."}` (or JSON array of the same).
144182- `DELETE /repos`: untrack repositories. accepts an NDJSON body of `{"did": "..."}` (or JSON array of the same).
145183···182220183221### blue.microcosm.links.*
184222185185-hydrant implements a subset of [microcosm constellation](https://constellation.microcosm.blue/) when it's built with the `backlinks` cargo feature (`cargo build --features backlinks`).
223223+hydrant implements a subset of [microcosm constellation](https://constellation.microcosm.blue/)
224224+when it's built with the `backlinks` cargo feature (`cargo build --features backlinks`).
186225187187-when enabled, hydrant indexes all AT URI and DID references found inside stored records into a reverse index. this lets you efficiently answer "what records link to this subject?".
226226+when enabled, hydrant indexes all AT URI and DID references found inside stored records into a
227227+reverse index. this lets you efficiently answer "what records link to this subject?".
188228189229#### blue.microcosm.links.getBacklinks
190230···200240201241returns `{ backlinks: [{ uri, cid }], cursor? }`.
202242203203-results are ordered by source record rkey (ascending by default, descending when `reverse=true`). the cursor is stable across new insertions for TID rkey records.
243243+results are ordered by source record rkey (ascending by default, descending when `reverse=true`).
244244+the cursor is stable across new insertions for TID rkey records.
204245205246#### blue.microcosm.links.getBacklinksCount
206247
+30-1
src/config.rs
···55use std::time::Duration;
66use url::Url;
7788+/// loads `.env` from the current directory, setting any variables not already in the environment.
99+fn load_dotenv() {
1010+ let Ok(contents) = std::fs::read_to_string(".env") else {
1111+ return;
1212+ };
1313+ for line in contents.lines() {
1414+ let line = line.trim();
1515+ if line.is_empty() || line.starts_with('#') {
1616+ continue;
1717+ }
1818+ let Some((key, val)) = line.split_once('=') else {
1919+ continue;
2020+ };
2121+ let key = key.trim();
2222+ let val = val.trim();
2323+ let val = val
2424+ .strip_prefix('"')
2525+ .and_then(|v| v.strip_suffix('"'))
2626+ .or_else(|| val.strip_prefix('\'').and_then(|v| v.strip_suffix('\'')))
2727+ .unwrap_or(val);
2828+ if std::env::var(key).is_err() {
2929+ // SAFETY: single-threaded at startup; no other threads are reading env yet.
3030+ unsafe { std::env::set_var(key, val) };
3131+ }
3232+ }
3333+}
3434+835#[derive(Debug, Clone, Copy, PartialEq, Eq)]
936pub enum CrawlerMode {
1037 /// enumerate via `com.atproto.sync.listRepos`, then check signals with `describeRepo`.
···316343 }
317344 }
318345319319- /// reads and builds the config from environment variables.
346346+ /// reads and builds the config from environment variables, loading `.env` first if present.
320347 pub fn from_env() -> Result<Self> {
348348+ load_dotenv();
349349+321350 macro_rules! cfg {
322351 (@val $key:expr) => {
323352 std::env::var(concat!("HYDRANT_", $key))