very fast at protocol indexer with flexible filtering, xrpc queries, cursor-backed event stream, and more, built on fjall
rust fjall at-protocol atproto indexer
58
fork

Configure Feed

Select the types of activity you want to include in your feed.

[seed] implement seeding firehose sources from relay listHosts

dawn 3ccb5332 f037e1d1

+309 -6
+33 -4
README.md
··· 1 1 #### table-of-contents 2 2 3 3 -> [hydrant](#hydrant)</br> 4 - -> [vs tap](#vs-tap) | [stream](#stream-behavior) | [multi-relay](#multiple-relay-support) | [crawler sources](#crawler-sources)</br> 4 + -> [vs tap](#vs-tap) | [stream](#stream-behavior) | [multi-relay](#multiple-relay-support) | [seeding](#firehose-seeding) | [crawler sources](#crawler-sources)</br> 5 5 -> [configuration](#configuration) | [build features](#build-features)</br> 6 6 -> [rest api](#rest-api) | [filter](#filter-management) | [ingestion](#ingestion-control) | [crawler](#crawler-management) | [firehose](#firehose-management) | [pds](#pds-management) | [repos](#repository-management)</br> 7 7 -> [xrpc api](#data-access-xrpc) | [backlinks](#bluemicrocosmlinks) | [identity](#bluemicrocosmidentity) | [atproto](#comatproto) | [custom](#systemsgazehydrant) ··· 78 78 since they forward commits from many PDSes by design. this means you will trust 79 79 the relay on this though. 80 80 81 + #### firehose seeding 82 + 83 + <small>[<- back to toc](#table-of-contents)</small> 84 + 85 + in relay mode, `RELAY_HOSTS` defaults to empty. set `SEED_HOSTS` to one or more 86 + relay base URLs and hydrant will call `com.atproto.sync.listHosts` on each at 87 + startup, adding every returned PDS as a firehose source: 88 + 89 + ``` 90 + HYDRANT_SEED_HOSTS=https://bsky.network 91 + ``` 92 + 93 + seeding runs as a background task so the main firehose loop is not blocked. seed 94 + URLs are fetched concurrently (up to four at a time) and the full `listHosts` 95 + pagination is consumed for each. if a request fails partway through, the hosts 96 + collected so far are still added and the failure is logged. 97 + 98 + each discovered host is added as a persistent PDS firehose source (`is_pds: true`), 99 + equivalent to calling `POST /firehose/sources`. 100 + 101 + banned hosts (`status: "banned"`) are skipped. all other statuses are included 102 + since the firehose ingestor retries on disconnect and transiently-unavailable 103 + hosts will reconnect on their own. 104 + 105 + seeding runs from latest cursor on restart so new PDS' added to the upstream relay 106 + since the last start are picked up automatically (if they haven't through firehose). 107 + sources that are already running are detected and skipped, so re-seeding is idempotent. 108 + 81 109 ### crawler sources 82 110 83 111 <small>[<- back to toc](#table-of-contents)</small> ··· 117 145 | :--- | :--- | :--- | 118 146 | `DATABASE_PATH` | `./hydrant.db` | path to the database folder. | 119 147 | `RUST_LOG` | `info` | log filter directives (e.g., `debug`, `hydrant=trace`). [`tracing` env-filter syntax](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html). | 120 - | `RELAY_HOST` | `wss://relay.fire.hose.cam/` | URL of the relay (firehose only). | 121 - | `RELAY_HOSTS` | | comma-separated list of firehose sources (firehose only). if unset, falls back to `RELAY_HOST`. prefix a URL with `pds::` to mark it as a direct PDS connection (e.g. `pds::wss://pds.example.com`). bare URLs are treated as relays. | 148 + | `RELAY_HOST` | `wss://relay.fire.hose.cam/` (indexer), empty (relay) | URL of a firehose source. | 149 + | `RELAY_HOSTS` | | comma-separated list of firehose sources. if unset, falls back to `RELAY_HOST`. prefix a URL with `pds::` to mark it as a direct PDS connection (e.g. `pds::wss://pds.example.com`). bare URLs are treated as relays. defaults to empty in relay mode, PDS' are expected to be seeded via `SEED_HOSTS` or the firehose management API. | 150 + | `SEED_HOSTS` | `https://bsky.network` (relay) | comma-separated list of base URLs to call `com.atproto.sync.listHosts` on at startup. hydrant adds every non-banned host as a PDS firehose source. see [firehose seeding](#firehose-seeding). | 122 151 | `CRAWLER_URLS` | relay hosts in full-network mode, `https://lightrail.microcosm.blue` in filter mode | comma-separated list of `[mode::]url` crawler sources. mode is `relay` or `by_collection`; bare URLs use the default mode. set to empty string to disable crawling. | 123 152 | `PLC_URL` | `https://plc.wtf`, `https://plc.directory` if full network | base URL(s) of the PLC directory (comma-separated for multiple). | 124 153 | `EPHEMERAL` | `false` | if enabled, no records are stored. events are deleted after a certain duration (`EPHEMERAL_TTL`). | 125 154 | `EPHEMERAL_TTL` | `60min`, `3d` in relay mode | decides after how long events should be deleted. | 126 - | `FULL_NETWORK` | `false` | if `true`, discovers and indexes all repositories in the network. | 155 + | `FULL_NETWORK` | `false` (indexer), `true` (relay) | if `true`, discovers and indexes all repositories in the network. | 127 156 | `FILTER_SIGNALS` | | comma-separated list of NSID patterns to use for the filter (e.g. `app.bsky.feed.post,app.bsky.graph.*`). | 128 157 | `FILTER_COLLECTIONS` | | comma-separated list of NSID patterns to use for the collections filter. | 129 158 | `FILTER_EXCLUDES` | | comma-separated list of DIDs to exclude from indexing. |
+57 -2
src/config.rs
··· 362 362 /// set via `HYDRANT_ENABLE_BACKLINKS=true`. 363 363 pub enable_backlinks: bool, 364 364 365 + /// base URL(s) of relay or aggregator services to seed firehose PDS sources from at startup. 366 + /// 367 + /// hydrant calls `com.atproto.sync.listHosts` on each URL and adds the returned PDSes 368 + /// as firehose sources (with `is_pds = true`). account counts from the response are 369 + /// applied to newly-seen hosts to initialise rate-limiting immediately. 370 + /// 371 + /// set via `HYDRANT_SEED_HOSTS` as a comma-separated list of base URLs. 372 + pub seed_hosts: Vec<Url>, 365 373 /// list of trusted PDS/relay hosts to pre-assign to the "trusted" rate tier at startup. 366 374 /// set via `HYDRANT_TRUSTED_HOSTS` as a comma-separated list of hostnames. 367 375 /// hosts not present in this list use the "default" tier unless assigned via the API. ··· 424 432 const BASE_MEMTABLE_MB: u64 = 32; 425 433 Self { 426 434 database_path: PathBuf::from("./hydrant.db"), 427 - full_network: false, 428 435 ephemeral: false, 429 436 #[cfg(feature = "indexer")] 430 437 ephemeral_ttl: Duration::from_secs(3600), // 1 hour 431 438 #[cfg(feature = "relay")] 432 439 ephemeral_ttl: Duration::from_secs(3600 * 24 * 3), // 3 days 440 + #[cfg(not(feature = "relay"))] 441 + full_network: false, 442 + #[cfg(feature = "relay")] 443 + full_network: true, 444 + #[cfg(not(feature = "relay"))] 433 445 relays: vec![FirehoseSource { 434 446 url: Url::parse("wss://relay.fire.hose.cam/").unwrap(), 435 447 is_pds: false, 436 448 }], 449 + #[cfg(feature = "relay")] 450 + relays: vec![], 451 + #[cfg(not(feature = "relay"))] 452 + seed_hosts: vec![], 453 + #[cfg(feature = "relay")] 454 + seed_hosts: vec![Url::parse("https://bsky.network").unwrap()], 437 455 plc_urls: vec![Url::parse("https://plc.wtf").unwrap()], 438 456 enable_firehose: true, 439 457 firehose_workers: 8, ··· 503 521 load_dotenv(); 504 522 505 523 // full_network is read first since it determines which defaults to use. 506 - let full_network: bool = cfg!("FULL_NETWORK", false); 524 + // relay mode defaults to true so that the network is indexed by default. 525 + #[cfg(feature = "relay")] 526 + let default_full_network = true; 527 + #[cfg(not(feature = "relay"))] 528 + let default_full_network = false; 529 + let full_network: bool = cfg!("FULL_NETWORK", default_full_network); 507 530 let defaults = full_network 508 531 .then(Self::full_network) 509 532 .unwrap_or_else(Self::default); ··· 642 665 } 643 666 } 644 667 668 + let seed_hosts: Vec<Url> = std::env::var("HYDRANT_SEED_HOSTS") 669 + .ok() 670 + .map(|s| { 671 + s.split(',') 672 + .filter_map(|u| { 673 + let u = u.trim(); 674 + if u.is_empty() { 675 + return None; 676 + } 677 + Url::parse(u).ok().or_else(|| { 678 + tracing::warn!("invalid seed host URL: {u}"); 679 + None 680 + }) 681 + }) 682 + .collect() 683 + }) 684 + .unwrap_or_else(|| defaults.seed_hosts.clone()); 685 + 645 686 let trusted_hosts = std::env::var("HYDRANT_TRUSTED_HOSTS") 646 687 .ok() 647 688 .map(|s| { ··· 676 717 database_path, 677 718 full_network, 678 719 ephemeral, 720 + seed_hosts, 679 721 ephemeral_ttl, 680 722 relays: relay_hosts, 681 723 plc_urls, ··· 809 851 } 810 852 if self.enable_backlinks { 811 853 config_line!(f, "backlinks", "enabled")?; 854 + } 855 + if !self.seed_hosts.is_empty() { 856 + config_line!( 857 + f, 858 + "seed hosts", 859 + format_args!( 860 + "{:?}", 861 + self.seed_hosts 862 + .iter() 863 + .map(|u| u.as_str()) 864 + .collect::<Vec<_>>() 865 + ) 866 + )?; 812 867 } 813 868 Ok(()) 814 869 }
+11
src/control/mod.rs
··· 5 5 pub(crate) mod firehose; 6 6 pub(crate) mod pds; 7 7 pub(crate) mod repos; 8 + mod seed; 8 9 pub(crate) mod stream; 9 10 10 11 pub use crawler::{CrawlerHandle, CrawlerSourceInfo}; ··· 430 431 .tasks 431 432 .insert_async(source.url.clone(), handle) 432 433 .await; 434 + } 435 + 436 + // 10c. seed firehose PDS sources from listHosts on configured seed URLs 437 + if !config.seed_hosts.is_empty() { 438 + let seed_urls = config.seed_hosts.clone(); 439 + let firehose = firehose.clone(); 440 + let state = state.clone(); 441 + tokio::spawn(async move { 442 + seed::seed_from_list_hosts(&seed_urls, &firehose, &state).await; 443 + }); 433 444 } 434 445 435 446 // 11. spawn crawler infrastructure
+200
src/control/seed.rs
··· 1 + use std::sync::Arc; 2 + use std::time::Duration; 3 + 4 + use futures::StreamExt; 5 + use jacquard_api::com_atproto::sync::HostStatus; 6 + use jacquard_api::com_atproto::sync::list_hosts::ListHostsOutput; 7 + use miette::IntoDiagnostic; 8 + use tracing::{info, warn}; 9 + use url::Url; 10 + 11 + use super::firehose::FirehoseHandle; 12 + use crate::db::{self, keys}; 13 + use crate::state::AppState; 14 + 15 + const MAX_CONCURRENT_SEEDS: usize = 4; 16 + 17 + /// seed firehose pds sources by calling `com.atproto.sync.listHosts` on each seed URL. 18 + /// banned pds' are not added, everything else is (including offline) 19 + pub(crate) async fn seed_from_list_hosts( 20 + seed_urls: &[Url], 21 + firehose: &FirehoseHandle, 22 + state: &Arc<AppState>, 23 + ) { 24 + info!("will seed urls..."); 25 + 26 + let http = reqwest::Client::builder() 27 + .user_agent(concat!( 28 + env!("CARGO_PKG_NAME"), 29 + "/", 30 + env!("CARGO_PKG_VERSION") 31 + )) 32 + .timeout(Duration::from_secs(10)) 33 + .build() 34 + .expect("that reqwest will build"); 35 + 36 + let mut futs = futures::stream::iter(seed_urls.iter().cloned()) 37 + .map(|seed_url| { 38 + let firehose = firehose.clone(); 39 + let state = state.clone(); 40 + let http = http.clone(); 41 + async move { seed_one(&seed_url, &firehose, &state, &http).await } 42 + }) 43 + .buffer_unordered(MAX_CONCURRENT_SEEDS); 44 + 45 + while let Some(_) = futs.next().await {} 46 + } 47 + 48 + #[tracing::instrument(skip_all, fields(seed_url = %seed_url))] 49 + async fn seed_one( 50 + seed_url: &Url, 51 + firehose: &FirehoseHandle, 52 + state: &Arc<AppState>, 53 + http: &reqwest::Client, 54 + ) { 55 + let cursor_key = keys::seed_cursor_key(seed_url.as_str()); 56 + 57 + // resume from the last saved cursor so we don't re-page through already-seen hosts 58 + let mut cursor: Option<String> = { 59 + let ks = state.db.cursors.clone(); 60 + let key = cursor_key.clone(); 61 + match db::Db::get(ks, key).await { 62 + Ok(Some(b)) => rmp_serde::from_slice::<String>(&b).ok(), 63 + Ok(None) => None, 64 + Err(e) => { 65 + warn!(err = %e, "failed to load seed cursor, starting from scratch"); 66 + None 67 + } 68 + } 69 + }; 70 + 71 + if cursor.is_some() { 72 + info!(cursor = ?cursor, "resuming seed from saved cursor"); 73 + } else { 74 + info!("seeding firehose sources from listHosts"); 75 + } 76 + 77 + let mut total = 0usize; 78 + let mut added = 0usize; 79 + 80 + loop { 81 + let url = list_hosts_url(seed_url, cursor.as_deref()); 82 + let resp = match http.get(url).send().await { 83 + Ok(r) => r, 84 + Err(e) => { 85 + warn!(err = %e, "failed to fetch listHosts, stopping"); 86 + break; 87 + } 88 + }; 89 + 90 + if !resp.status().is_success() { 91 + warn!(status = %resp.status(), "listHosts returned error status, stopping"); 92 + break; 93 + } 94 + 95 + let bytes = match resp.bytes().await { 96 + Ok(b) => b, 97 + Err(e) => { 98 + warn!(err = %e, "failed to read listHosts response, stopping"); 99 + break; 100 + } 101 + }; 102 + 103 + let body: ListHostsOutput<'_> = match serde_json::from_slice(&bytes) { 104 + Ok(b) => b, 105 + Err(e) => { 106 + warn!(err = %e, "failed to parse listHosts response, stopping"); 107 + break; 108 + } 109 + }; 110 + 111 + let next_cursor = body.cursor.as_deref().map(str::to_owned); 112 + total += body.hosts.len(); 113 + 114 + for host in &body.hosts { 115 + // skip banned hosts; everything else (active, idle, offline, throttled) is included 116 + // since the firehose ingestor handles reconnection for transiently-unavailable hosts 117 + if matches!(host.status, Some(HostStatus::Banned)) { 118 + continue; 119 + } 120 + 121 + let wss_url_str = format!("wss://{}/", host.hostname); 122 + let wss_url = match Url::parse(&wss_url_str) { 123 + Ok(u) => u, 124 + Err(e) => { 125 + warn!(hostname = %host.hostname, err = %e, "invalid hostname in listHosts response, skipping"); 126 + continue; 127 + } 128 + }; 129 + 130 + // skip sources that are already running 131 + if firehose.tasks.contains_async(&wss_url).await { 132 + continue; 133 + } 134 + 135 + // initialise account count for hosts we haven't seen before 136 + if let Some(count) = host.account_count.filter(|&c| c > 0) { 137 + let count_key = keys::pds_account_count_key(host.hostname.as_ref()); 138 + let current = state.db.get_count(&count_key).await; 139 + if current == 0 { 140 + state.db.update_count_async(&count_key, count).await; 141 + } 142 + } 143 + 144 + match firehose.add_source(wss_url, true).await { 145 + Ok(()) => added += 1, 146 + Err(e) => { 147 + warn!(hostname = %host.hostname, err = %e, "failed to add firehose source"); 148 + } 149 + } 150 + } 151 + 152 + cursor = next_cursor; 153 + 154 + // persist cursor after each page so a restart can resume where we left off 155 + if let Some(ref c) = cursor { 156 + let value = match rmp_serde::to_vec(c) { 157 + Ok(v) => v, 158 + Err(e) => { 159 + warn!(err = %e, "failed to serialize seed cursor"); 160 + continue; 161 + } 162 + }; 163 + let state = state.clone(); 164 + let key: Vec<u8> = cursor_key.clone(); 165 + let result = tokio::task::spawn_blocking(move || -> miette::Result<()> { 166 + let mut batch = state.db.inner.batch(); 167 + batch.insert(&state.db.cursors, key, &value); 168 + batch.commit().into_diagnostic() 169 + }) 170 + .await 171 + .into_diagnostic() 172 + .flatten(); 173 + if let Err(e) = result { 174 + warn!(err = %e, "failed to persist seed cursor"); 175 + } 176 + } 177 + 178 + if cursor.is_none() { 179 + break; 180 + } 181 + } 182 + 183 + info!( 184 + total, 185 + added, "finished seeding firehose sources from listHosts" 186 + ); 187 + } 188 + 189 + fn list_hosts_url(base: &Url, cursor: Option<&str>) -> Url { 190 + let mut url = base.clone(); 191 + url.set_path("/xrpc/com.atproto.sync.listHosts"); 192 + { 193 + let mut pairs = url.query_pairs_mut(); 194 + pairs.append_pair("limit", "1000"); 195 + if let Some(c) = cursor { 196 + pairs.append_pair("cursor", c); 197 + } 198 + } 199 + url 200 + }
+8
src/db/keys/mod.rs
··· 220 220 key 221 221 } 222 222 223 + pub const SEED_CURSOR_PREFIX: &[u8] = b"seed_cursor|"; 224 + 225 + pub fn seed_cursor_key(url: &str) -> Vec<u8> { 226 + let mut key = SEED_CURSOR_PREFIX.to_vec(); 227 + key.extend_from_slice(url.as_bytes()); 228 + key 229 + } 230 + 223 231 pub const FIREHOSE_CURSOR_PREFIX: &[u8] = b"firehose_cursor|"; 224 232 225 233 pub const FIREHOSE_SOURCE_PREFIX: &[u8] = b"firehose|";