very fast at protocol indexer with flexible filtering, xrpc queries, cursor-backed event stream, and more, built on fjall
rust fjall at-protocol atproto indexer
58
fork

Configure Feed

Select the types of activity you want to include in your feed.

[docs] update agents.md

dawn 7dd2efc7 5d9cdd75

+35 -10
+35 -10
AGENTS.md
··· 34 34 ## System architecture 35 35 36 36 Hydrant consists of several components: 37 - - **[`hydrant::ingest::firehose`]**: Connects to an upstream Firehose (Relay) and filters events. It manages the transition between discovery and synchronization. 37 + - **[`hydrant::ingest::firehose`]**: Connects to one or more upstream Firehose relays and filters events. It manages the transition between discovery and synchronization. Multiple relay sources are supported and can be managed at runtime via the API. 38 38 - **[`hydrant::ingest::worker`]**: Processes buffered Firehose messages concurrently using sharded workers. Verifies signatures, updates repository state (handling account status events like deactivations), detects gaps for backfill, and persists records. 39 - - **[`hydrant::crawler`]**: Periodically enumerates the network via `com.atproto.sync.listRepos` to discover new repositories. In `Full` mode it is enabled by default; in `Filter` mode it is opt-in via `HYDRANT_ENABLE_CRAWLER`. 39 + - **[`hydrant::crawler`]**: Enumerates the network to discover repositories. Supports two modes: `Relay` (via `com.atproto.sync.listRepos`) and `ByCollection` (via `com.atproto.sync.listReposByCollection`). Multiple sources can be configured, each with their own mode and cursor. In `Full` mode the relay crawler is enabled by default; `Filter` mode requires opt-in via `HYDRANT_ENABLE_CRAWLER`. 40 40 - **[`hydrant::resolver`]**: Manages DID resolution and key lookups. Supports multiple PLC directory sources with failover and caching. 41 41 - **[`hydrant::backfill`]**: A dedicated worker that fetches full repository CAR files. Uses LIFO prioritization and adaptive concurrency to manage backfill load efficiently. 42 - - **[`hydrant::api`]**: An Axum-based XRPC server implementing repository read methods (`getRecord`, `listRecords`) and system stats. It also provides a WebSocket event stream and management APIs: 42 + - **[`hydrant::backlinks`]** (feature-gated): Maintains a reverse index of record references. Exposes `blue.microcosm.links.getBacklinks` and `blue.microcosm.links.getBacklinksCount` XRPC endpoints. 43 + - **[`hydrant::api`]**: An Axum-based XRPC server implementing repository read methods (`getRecord`, `listRecords`, `countRecords`) and system stats. It also provides a WebSocket event stream and management APIs: 43 44 - `/filter` (`GET`/`PATCH`): Configure indexing mode, signals, and collection patterns. 44 - - `/repos` (`GET`/`PUT`/`DELETE`): Repository management. 45 + - `/repos` (`GET`/`PUT`/`DELETE`): Repository management (supports pagination). 46 + - `/ingestion` (`GET`/`PATCH`): Pause/resume crawler, firehose, and backfill components at runtime. 47 + - `/crawler/sources` (`GET`/`POST`/`DELETE`): Manage crawler relay sources at runtime. 48 + - `/firehose/sources` (`GET`/`POST`/`DELETE`): Manage firehose relay sources at runtime. 49 + - `/db/train` (`POST`): Train per-keyspace zstd dictionaries. 50 + - `/db/compact` (`POST`): Trigger manual database compaction. 51 + - `/cursors` (`DELETE`): Reset cursors. 45 52 - Persistence worker (in `src/main.rs`): Manages periodic background flushes of the LSM-tree and cursor state. 46 53 47 54 ### Lazy event inflation 48 55 49 56 To minimize latency in `apply_commit` and the backfill worker, events are stored in a compact `StoredEvent` format. The expansion into full TAP-compatible JSON (including fetching record content from the CAS and DAG-CBOR parsing) is performed lazily within the WebSocket stream handler. 50 57 58 + ### Library API 59 + 60 + Hydrant can be used as an embedded library. The public surface is exposed via `src/lib.rs`: 61 + - `hydrant::config` — configuration structs and builder helpers 62 + - `hydrant::control` — high-level `Hydrant` handle, `RepoHandle`, `RepoManager` 63 + - `hydrant::filter` — filter types and operations 64 + - `hydrant::types` — shared data types (`RepoState`, etc.) 65 + 66 + See `examples/statusphere.rs` for a usage example. 67 + 51 68 ## General conventions 52 69 53 70 ### Correctness over convenience ··· 68 85 69 86 ### Storage and serialization 70 87 - **State**: Use `rmp-serde` (MessagePack) for all internal state (`RepoState`, `ErrorState`, `StoredEvent`). 71 - - **Blocks**: Store IPLD blocks as raw DAG-CBOR bytes in the CAS. This avoids expensive transcoding and allows direct serving of block content. 88 + - **Blocks**: Store IPLD blocks as raw DAG-CBOR bytes in the CAS, keyed by `{collection}|{CID bytes}`. The collection prefix enables per-collection zstd dictionary training for better compression ratios. 72 89 - **Cursors**: Store cursors as big-endian bytes (`u64`/`i64`). 90 + - **Compression**: Configurable via `HYDRANT_DATA_COMPRESSION` (`lz4`, `zstd`, `none`). Per-keyspace zstd dictionaries can be trained via `POST /db/train` and are stored as `dict_{keyspace}.bin` in the database directory. 73 91 - **Keyspaces**: Use the `keys.rs` module to maintain consistent composite key formats. 74 92 75 93 ## Database schema (keyspaces) ··· 77 95 Hydrant uses multiple `fjall` keyspaces: 78 96 - `repos`: Maps `{DID}` -> `RepoState` (MessagePack). 79 97 - `records`: Maps `{DID}|{COL}|{RKey}` -> `{CID}` (Binary). 80 - - `blocks`: Maps `{CID}` -> `Block Data` (Raw CBOR). 81 - - `events`: Maps `{ID}` (u64) -> `StoredEvent` (MessagePack). This is the source for the JSON stream API. 82 - - `cursors`: Maps `firehose_cursor` or `crawler_cursor` -> `Value` (u64/i64 BE Bytes). 83 - - `pending`: Queue of `{Timestamp}|{DID}` -> `Empty` (Backfill queue). 98 + - `blocks`: Maps `{collection}|{CID bytes}` -> `Block Data` (Raw CBOR). The collection prefix enables per-collection zstd dictionary training. 99 + - `events`: Maps `{ID}` (u64 BE) -> `StoredEvent` (MessagePack). This is the source for the JSON stream API. 100 + - `cursors`: Maps per-relay cursor keys -> `Value` (u64/i64 BE Bytes). Keys: `firehose_cursor|{relay}`, `crawler_cursor|{relay}`, `by_collection_cursor|{url}|{collection}`. 101 + - `pending`: Queue of `{ID}` (u64 BE) -> `Empty` (Backfill queue). 84 102 - `resync`: Maps `{DID}` -> `ResyncState` (MessagePack) for retry logic/tombstones. 85 103 - `resync_buffer`: Maps `{DID}|{Rev}` -> `Commit` (MessagePack). Used to buffer live events during backfill. 86 104 - `counts`: Maps `k|{NAME}` or `r|{DID}|{COL}` -> `Count` (u64 BE Bytes). 87 105 - `filter`: Stores filter config. Handled by the `db::filter` module. Includes mode key `m` -> `FilterMode` (MessagePack), and set entries for signals (`s|{NSID}`), collections (`c|{NSID}`), and excludes (`x|{DID}`) -> empty value. 88 - - `crawler`: Stores crawler state with prefixed keys. Failed crawl entries use `f|{DID}` -> empty value, representing repos that failed signal checking during crawl discovery. 106 + - `crawler`: Stores crawler and firehose source URLs with prefixed keys. Retry entries use `ret|{DID}` -> empty value. Crawler sources use `src|{URL}` -> empty value. Firehose sources use `firehose|{URL}` -> empty value. 107 + - `backlinks` (feature-gated): Reverse index of record references for the backlinks feature. 89 108 90 109 ## Safe commands 91 110 ··· 96 115 - `nu tests/stream_test.nu` - Tests WebSocket streaming functionality. Verifies both live event streaming during backfill and historical replay with cursor. 97 116 - `nu tests/authenticated_stream_test.nu` - Tests authenticated event streaming. Verifies that create, update, and delete actions on a real account are correctly streamed by Hydrant in the correct order. Requires `TEST_REPO` and `TEST_PASSWORD` in `.env`. 98 117 - `nu tests/debug_endpoints.nu` - Tests debug/introspection endpoints (`/debug/iter`, `/debug/get`) and verifies DB content and serialization. 118 + - `nu tests/api_test.nu` - Tests management API endpoints (filter, repos, ingestion, sources). 119 + - `nu tests/repos_api_test.nu` - Tests the `/repos` API endpoints including pagination and single-repo lookup. 120 + - `nu tests/signal_filter_test.nu` - Verifies signal-based filtered indexing. 121 + - `nu tests/collection_index_test.nu` - Tests collection-indexed crawling via `listReposByCollection`. 122 + - `nu tests/backlinks_test.nu` - Tests backlinks indexing and XRPC query endpoints (requires `backlinks` feature). 123 + - `nu tests/ephemeral_gc.nu` - Tests ephemeral mode TTL expiration and event watermark cleanup. 99 124 100 125 ## Rust code style 101 126