···77axum = "0.8.8"
88clap = { version = "4.5.60", features = ["derive", "env"] }
99fjall = "3.0.3"
1010-iroh-car = "0.5.1"
1110jacquard-api = { version = "0.9.5", default-features = false, features = ["com_atproto"] }
1211jacquard-axum = { version = "0.9.6", default-features = false, features = ["tracing"] }
1312jacquard-common = { version = "0.9.5", features = ["websocket", "reqwest-client"] }
1413jacquard-repo = "0.9.6"
1514metrics = "0.24.3"
1616-metrics-exporter-prometheus = { version = "0.18.1", features = ["http-listener"] }
1715bytes = "1"
1616+metrics-exporter-prometheus = { version = "0.18.1", features = ["http-listener"] }
1817reqwest = { version = "0.12", default-features = false, features = ["rustls-tls"] }
1918serde = { version = "1", features = ["derive"] }
2019thiserror = "2.0.18"
+92
authenticated-collection-list.md
···11+# authenticated collection listing (future work)
22+33+right now we're just doing `describeRepo` to get collections, which *is* what
44+`collectiondir` also, does but it's not what we want to stick with because:
55+66+- the collections list isn't paginated. not clear what happens if a repo has a huge
77+ number of collections -- will `describeRepo` eventually fail?
88+- the contents of the collecitons list aren't authenticated. it's *possible* for a
99+ PDS to lie and make our index incorrect, but the threat we're considering here is
1010+ more about just PDS bugs causing the list to be wrong.
1111+- there is no `commit` or even `rev` in the response, so actually we can't know if
1212+ firehose commits after `describeRepo` follow correctly/without gaps.
1313+1414+there are a few ways we can do better.
1515+1616+1717+## `com.atproto.sync.getRepo`
1818+1919+obviously we can just do full backfill of repo contents. but then we couldn't call
2020+ourselves *light*rail.
2121+2222+what we can do is detect small repos and use `getRepo` just for them. repo size can
2323+be estimated from any CAR slice by measuring the root node height. we get a car slice
2424+from firehose commits and from any `sync.getRecord` request.
2525+2626+2727+## collection-boundary `com.atproto.sync.getRecord` probing
2828+2929+mst keys have the form `<collection>/<rkey>` (lexicographic order).
3030+`com.atproto.sync.getRecord` returns a CAR proof path from the repo root to the
3131+queried key, and that usually includes keys immediately adjacent to the queried key.
3232+in particular, when the record does *not* exist, the proof path must include adjacent
3333+keys (required to prove the key is absent).
3434+3535+we can exploit this:
3636+3737+1. query `getRecord` with the **minimum legal MST key**
3838+ (`a-----...0.0-----...0.A/-`). the record usually won't exist, but the right-
3939+ adjacent key in the CAR slice reveals the first collection present in the repo.
4040+2. for that collection, compute the **maximum** legal rkey (`~` × 512) and query
4141+ `getRecord` with `<collection>/<max_rkey>`. The right-adjacent key, if present, is
4242+ the first key of the *next* collection.
4343+3. if we don't have a immediate-right-adjacent key, we can *increment* the rkey to
4444+ minimum next legal key and retry until we do get the next collection.
4545+4. repeat from step 2 until the end (no more right-adjacent collections).
4646+4747+4848+this probing costs ~one request per collection discovered. wrinkles:
4949+5050+- on the first request, estimate repo size and just do `getRepo` if it's small.
5151+ probing requests count toward PDS rate-limit.
5252+5353+- the repository can update while we are probing. this is easily detected because
5454+ every CAR slice response includes the commit object and MST root, which updates
5555+ for any update to the MST. the really nice way to deal with this is to maintain
5656+ a sparse MST tree built up from all the probe requests, which can usually be
5757+ *updated* directly from the upper changed nodes. at the end, we have a repo-
5858+ spanning valid-but-sparse MST that proves all collection boundaries simultaneously.
5959+6060+ what do we do if a collection is added or removed by a mid-probe update? TODO!
6161+6262+6363+## skeleton shower from `com.atproto.sync.getBlocks`
6464+6565+instead of scanning across the key range on collection boundaries, we could build
6666+our own sparse collection-boundary tree top-down:
6767+6868+1. make any `..sync.getRecord` query, to obtain the MST root node
6969+2. request every MST child node that spans a collection change, using
7070+ `com.atproto.sync.getBlocks`.
7171+3. continue down like this, layer by layer, until reaching the bottom layer. since
7272+ `getBlocks` accepts multiple CIDs, we can fetch everything required from each
7373+ layer together in one request per layer (unless we need too many blocks) to fit
7474+ in the querystring.
7575+7676+we end up with a nice sparse tree that proves all collection boundaries. MSTs are not
7777+very tall so this might actually be pretty nice, and we directly build a consistent
7878+point-in-time snapshot.
7979+8080+this fails when
8181+8282+- any block we need is updated or removed while we're climbing down. in that case we
8383+ can retry or fall back to `getRepo`.
8484+- a PDS doesn't implement `getBlocks`. (i have no idea how common it is?)
8585+8686+8787+## we can dream: `"com.atproto.sync.getRepoCollections"`
8888+8989+maybe one day a PDS endpoint like this will exist, which serves the sparse MST
9090+containing blocks on all collection boundary paths our approaches here end up
9191+building, proving the exact set of collections present in the repo assocaited with an
9292+exact commit.
+1
hacking.md
···2424- iroh-car: robust, simple, async
2525- manual CAR processing: since we need access to adjacent keys
2626 - TODO: repo-stream will expose this soon probably
2727+ - TODO: right now we use jacquard_repo but i think it's easier in our case to handle it more manually.
2728- fjall: workload is write-heavy so LSM is a good fit, space efficiency also very desirable
28292930
+4-8
readme.md
···2233**status: in development**
4455-lightrail uses the adjacent keys included in CAR slices from firehose commits and `com.atproto.sync.getRecord` responses to detect the first record added and last record removed from a collection in an atproto repo.
66-77-for backfill of large repositories, lightrail probes the repo with `getRecord` requests instead of trusting the `collections` property from `com.atproto.repo.describeRepo`. since there are concrete minimum and maximum `rkey`s for collections, and since `getRecord` always returns adjacent keys *even when a key is not found in a repo*, lightrail can precisely probe the repo along collection boundaries to enumerate every collection.
88-99-repo `#sync` events similarly probe the repository to diff against the recorded repo collections.
1010-1111-small repositories are detected by inspecting the MST nodes near the root of a CAR slice (eg., from the first `getRecord` probe): small repos are statistically unlikely to contain keys with high MST levels (every CAR slice must include the maximum-level keys of a repository), providing a statistical basis for estimating total repo size. Small repos are fetched in their entirety with `com.atproto.sync.getRepo` instead of probing with numerous `getRecord` requests.
55+lightrail uses the adjacent keys included in CAR slices from firehose commits to detect the first record added and last record removed from a collection in an atproto repo.
126137compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail:
148···1610- handles sync1.1 `#sync` events
1711- avoids updating its index unless commits actually add or remove collections
1812- removes repos from the index when the last record from a repo's collection is removed
1919-- uses authenticated repo contents for backfill instead of `com.atproto.repo.describeRepo`
20132114lightrail's CAR slice techniques enable its lightweight implementation, but its primary focus is on accuracy and correctness.
22151616+for backfill, lightrail currently uses `com.atproto.repo.describeRepo`, like Bluesky's `collectiondir`. This is not a robust approach, and will hopefully be replaced by probing that authenticated repo contents (see [./authenticated-collection-list.md](./authenticated-collection-list.md)) soon.
1717+23182419### wishlist features (probably doable?):
25202621- accept multiple collections for `listReposbyCollection` (merge + dedup by DID; works bc key is `<collection>||<did>`)
2722- `listReposByCollectionPrefix`, either with additional indexes up the NSID hierarchy, or via merge+dedup.
2823- subscribe to multiple relays
2424+- use authenticated repo contents for backfill instead of `com.atproto.repo.describeRepo` (see [./authenticated-collection-list.md](./authenticated-collection-list.md))
292530263127## contributing
···88use std::net::SocketAddr;
991010use jacquard_api::com_atproto::sync::{
1111- get_repo_status::GetRepoStatusRequest,
1212- list_repos_by_collection::ListReposByCollectionRequest,
1111+ get_repo_status::GetRepoStatusRequest, list_repos_by_collection::ListReposByCollectionRequest,
1312};
1413use jacquard_axum::IntoRouter;
15141616-use crate::db::DbRef;
1715use crate::error::Result;
1616+use crate::storage::DbRef;
18171918/// Build and serve the axum application on `addr`.
2019///
-17
src/backfill/list_repos.rs
···11-//! Walk `com.atproto.sync.listRepos` with cursor pagination.
22-//!
33-//! For each DID encountered, either enqueues it for probing (large repos) or
44-//! dispatches it to the small-repo fast path.
55-66-use crate::db::DbRef;
77-use crate::error::Result;
88-99-/// Walk the full `listRepos` feed for `host`, persisting progress after each
1010-/// page so it can be resumed on restart.
1111-///
1212-/// Uses `jacquard-api`'s `com_atproto::sync::list_repos` XRPC call with cursor
1313-/// pagination. Per-page progress is written via `db::cursor::set_list_repos_progress`.
1414-pub async fn run(host: &str, db: DbRef) -> Result<()> {
1515- let _ = (host, db);
1616- todo!("paginate listRepos, probe each DID, persist cursor after each page")
1717-}
-25
src/backfill/mod.rs
···11-//! Backfill subsystem.
22-//!
33-//! Walks `com.atproto.sync.listRepos` and probes each repository to populate
44-//! the rbc/cbr index before or alongside the live firehose feed.
55-//!
66-//! Large repos are enumerated via sequential `getRecord` probing (`probe`):
77-//! one request per collection, walking right-adjacent MST keys from the minimum
88-//! legal key to the end of the repo.
99-//! Small repos take the fast path of fetching the full repo CAR (`small_repo`).
1010-1111-pub mod list_repos;
1212-pub mod probe;
1313-pub mod small_repo;
1414-1515-use crate::db::DbRef;
1616-use crate::error::Result;
1717-1818-/// Run the backfill subsystem for `host`.
1919-///
2020-/// Resumes from the last-saved listRepos cursor if one exists, then pages
2121-/// through all repos and probes each one. Runs indefinitely until an error
2222-/// occurs (fatal errors) or until the full backfill completes.
2323-pub async fn run(host: String, db: DbRef) -> Result<()> {
2424- list_repos::run(&host, db).await
2525-}
-172
src/backfill/probe.rs
···11-//! `getRecord`-probing for large-repo backfill.
22-//!
33-//! MST keys have the form `<collection>/<rkey>`, where `collection` is an NSID
44-//! and `rkey` is a Record Key, both subject to format restrictions defined in
55-//! the AT Protocol specs.
66-//!
77-//! `getRecord` usually includes the keys adjacent to the queried key in its CAR
88-//! slice response, and always does when the requested key does not exist. The
99-//! probing algorithm exploits this to enumerate every collection with one
1010-//! request per collection:
1111-//!
1212-//! 1. Query `getRecord` with the **minimum legal MST key** — the
1313-//! lexicographically lowest string that is a valid `<collection>/<rkey>`.
1414-//! The record won't exist, but the right-adjacent key in the CAR slice is
1515-//! the lowest key actually present in the repo, revealing the first
1616-//! collection.
1717-//! 2. For that collection, compute the **maximum legal rkey** and query
1818-//! `getRecord` with `<collection>/<max_rkey>`. The right-adjacent key in
1919-//! the response is the first key of the *next* collection in the repo.
2020-//! 3. Repeat step 2 for each newly discovered collection until no right-
2121-//! adjacent key is returned, signalling that all collections have been
2222-//! found.
2323-//!
2424-//! It is *possible* for the maximum legal rkey of a collection to be present in
2525-//! a repo, and for its immediate right-adjacent key *not* to be present in the
2626-//! CAR slice response. In such cases, we can compute the very next legal
2727-//! collection and request it with the minimum legal rkey.
2828-//!
2929-//! Repositories can update while being probed. This is detectable because every
3030-//! probe response includes at least one parent block on any changed key's path.
3131-//! Need to get in the weeds but I think if we maintain a sparse tree from the
3232-//! probes, we might even be able to know whether an update added or removed any
3333-//! collections within the area we've already covered?
3434-//!
3535-//! TODO also: there is a `getBlocks` endpoint, which might be an alternative
3636-//! to probing: we could do one probe to get the root, then walk down the tree
3737-//! (on parallel paths even) to build out the sparse collection-boundary
3838-//! skeleton tree. Is this more efficient than probing with min/max rkeys?
3939-//!
4040-//! in either case, if handling repo updates leads to too many re-fetches, we
4141-//! should fall back to `getRepo` and full mst walking.
4242-//!
4343-//! Each discovered `(did, collection)` pair is written to the rbc/cbr index
4444-//! via `db::index::insert`.
4545-4646-use bytes::Bytes;
4747-use jacquard_api::com_atproto::sync::get_record::{GetRecord, GetRecordError};
4848-use jacquard_common::{
4949- error::ClientErrorKind,
5050- types::string::{Did, Nsid, RecordKey, Rkey},
5151- xrpc::{XrpcError, XrpcExt},
5252-};
5353-5454-use crate::db::DbRef;
5555-use crate::error::{Error, Result};
5656-5757-/// minimum legal NSID
5858-///
5959-/// - whole domain authority must be lowercase
6060-/// - top level domain must start with an alphabetic character
6161-/// - other domain segments cannot begin or end with hyphens
6262-/// - max 253 chars of domain authority before the last name segment
6363-/// - name segment accepts uppercase and must begin with an alphabetic
6464-const MIN_COLLECTION: &str = "a-------------------------------------------------------------0.0-------------------------------------------------------------0.0-------------------------------------------------------------0.0-----------------------------------------------------------0.A";
6565-6666-/// minimum legal rkey: `-` = ordinal 45 (.:_~ are 46, 58, 95, 127 respectively)
6767-const MIN_RKEY: &str = "-";
6868-6969-/// maximum legal rkey: 512 of the max legal character
7070-const MAX_RKEY: &str = "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~";
7171-7272-/// Extract the collection segment from an MST key of the form
7373-/// `<collection>/<rkey>`.
7474-fn collection_from_key(key: &str) -> Option<&str> {
7575- key.split_once('/').map(|(col, _)| col)
7676-}
7777-7878-/// Probe `did` to enumerate its collections via sequential `getRecord` requests.
7979-///
8080-/// Starts from the minimum legal MST key and follows right-adjacent keys one
8181-/// collection at a time until the end of the repo is reached. One XRPC request
8282-/// is issued per collection present in the repo.
8383-pub async fn probe_repo(host: &str, did: &str, db: DbRef) -> Result<()> {
8484- // MAX_RKEY is 512 '~' characters — the lexicographically largest valid rkey.
8585- let max_rkey: String = "~".repeat(512);
8686-8787- let client = reqwest::Client::new();
8888- let base: jacquard_common::url::Url = format!("https://{}", host)
8989- .parse()
9090- .map_err(|e: jacquard_common::url::ParseError| Error::Other(e.to_string()))?;
9191-9292- // Step 1: probe the minimum legal MST key to discover the first collection.
9393- let probe_key = format!("{}/{}", MIN_COLLECTION, MIN_RKEY);
9494- let car = match fetch_car(&client, &base, did, MIN_COLLECTION, MIN_RKEY).await? {
9595- Some(bytes) => bytes,
9696- None => return Ok(()), // repo is inaccessible or does not exist
9797- };
9898- let adjacent = crate::mst::adjacent::extract_adjacent(&car, &probe_key).await?;
9999- let mut current_collection = match adjacent.next.as_deref().and_then(collection_from_key) {
100100- Some(col) => col.to_owned(),
101101- None => return Ok(()), // repo has no records
102102- };
103103-104104- // Steps 2+: for each discovered collection, insert it and walk to the next.
105105- loop {
106106- crate::db::index::insert(&db, did, ¤t_collection)?;
107107-108108- let probe_key = format!("{}/{}", current_collection, max_rkey);
109109- let car = match fetch_car(&client, &base, did, ¤t_collection, &max_rkey).await? {
110110- Some(bytes) => bytes,
111111- None => break, // repo became inaccessible mid-probe
112112- };
113113- let adjacent = crate::mst::adjacent::extract_adjacent(&car, &probe_key).await?;
114114- let next_collection = match adjacent.next.as_deref().and_then(collection_from_key) {
115115- Some(col) => col.to_owned(),
116116- None => break, // no more collections
117117- };
118118-119119- if next_collection == current_collection {
120120- // Safety guard: the adjacent key should always be in the next
121121- // collection, but avoid an infinite loop if it is not.
122122- break;
123123- }
124124- current_collection = next_collection;
125125- }
126126-127127- Ok(())
128128-}
129129-130130-/// Make one `com.atproto.sync.getRecord` request and return the raw CAR bytes.
131131-///
132132-/// Returns `None` if the repository is inaccessible (taken down, suspended,
133133-/// deactivated, or not found) or if an unexpected HTTP error occurs.
134134-async fn fetch_car(
135135- client: &reqwest::Client,
136136- base: &jacquard_common::url::Url,
137137- did: &str,
138138- collection: &str,
139139- rkey: &str,
140140-) -> Result<Option<Bytes>> {
141141- let req = GetRecord {
142142- collection: Nsid::new_owned(collection).map_err(|e| Error::Other(e.to_string()))?,
143143- did: Did::new_owned(did).map_err(|e| Error::Other(e.to_string()))?,
144144- rkey: RecordKey(Rkey::new_owned(rkey).map_err(|e| Error::Other(e.to_string()))?),
145145- };
146146-147147- let resp = match client.xrpc(base.clone()).send(&req).await {
148148- Ok(resp) => resp,
149149- Err(e) => {
150150- return match e.kind() {
151151- // Network or unexpected HTTP-level errors: skip this repo.
152152- ClientErrorKind::Transport | ClientErrorKind::Http { .. } => Ok(None),
153153- _ => Err(Error::Other(e.to_string())),
154154- };
155155- }
156156- };
157157-158158- // resp is HTTP 200 or 400 at this point (401 with WWW-Authenticate is
159159- // already surfaced as Err by send()).
160160- match resp.parse() {
161161- Ok(output) => Ok(Some(output.body)),
162162- Err(XrpcError::Xrpc(err)) => match err {
163163- GetRecordError::RepoNotFound(_)
164164- | GetRecordError::RepoTakendown(_)
165165- | GetRecordError::RepoSuspended(_)
166166- | GetRecordError::RepoDeactivated(_)
167167- | GetRecordError::RecordNotFound(_)
168168- | GetRecordError::Unknown(_) => Ok(None),
169169- },
170170- Err(e) => Err(Error::Other(e.to_string())),
171171- }
172172-}
-20
src/backfill/small_repo.rs
···11-//! Fast path for small repositories: fetch the full repo CAR and extract all
22-//! collections in a single pass.
33-//!
44-//! Small repos are detected by inspecting MST node levels from an initial
55-//! `getRecord` probe — a high-level node present in a partial CAR slice means
66-//! the repo is large; absence of high-level nodes suggests a small repo.
77-88-use crate::db::DbRef;
99-use crate::error::Result;
1010-1111-/// Fetch the entire repo CAR for `did` via `com.atproto.sync.getRepo` and
1212-/// extract all collection NSIDs in one pass.
1313-///
1414-/// Streams the CAR response (via `iroh-car` or `jacquard-repo`'s CAR reader),
1515-/// walks every MST leaf to collect all record keys, groups them by collection,
1616-/// and writes the result to the rbc/cbr index.
1717-pub async fn index_small_repo(host: &str, did: &str, db: DbRef) -> Result<()> {
1818- let _ = (host, did, db);
1919- todo!("stream getRepo CAR, parse all MST leaves, index collections")
2020-}
···11//! Per-repo state storage.
2233-use crate::db::{keys, DbRef};
43use crate::error::Result;
44+use crate::storage::{DbRef, keys};
5566/// High-level lifecycle state of a repo.
77#[derive(Debug, Clone, PartialEq, Eq)]
+1-1
src/db/resync.rs
src/storage/resync.rs
···33//! Keys: `"repoResyncQueue"\0<ts_be:u64>\0<did>`
44//! Values: CBOR payload with the triggering commit, retry count, and retry reason.
5566-use crate::db::DbRef;
76use crate::error::Result;
77+use crate::storage::DbRef;
8899/// An item waiting in the resync queue.
1010#[derive(Debug, Clone)]
-20
src/firehose/mod.rs
···11-//! Firehose subsystem.
22-//!
33-//! Connects to an ATProto relay, validates incoming commits via sync1.1 inductive
44-//! proofs, and updates the rbc/cbr index on collection additions/removals.
55-66-mod subscriber;
77-88-pub use subscriber::Subscriber;
99-1010-use crate::db::DbRef;
1111-use crate::error::Result;
1212-1313-/// Spawn the firehose subscriber task for `host` and run until it returns.
1414-///
1515-/// This is the top-level entry point called from `main`. The subscriber handles
1616-/// reconnection internally, so this future only resolves on a fatal error.
1717-pub async fn run(host: String, db: DbRef) -> Result<()> {
1818- let mut sub = Subscriber::new(host, db);
1919- sub.run().await
2020-}
···11+//! Firehose subsystem.
22+//!
33+//! Connects to an ATProto relay, validates incoming commits via sync1.1 inductive
44+//! proofs, and updates the rbc/cbr index on collection additions/removals.
55+16//! Firehose WebSocket subscriber.
27//!
38//! Connects to an ATProto relay using `jacquard-common`'s `SubscriptionExt` +
49//! `TungsteniteClient`, persists/restores the sequence cursor via `db::cursor`,
510//! and dispatches decoded events to the appropriate handlers.
61177-use crate::db::DbRef;
1212+// pub use subscriber::Subscriber;
1313+814use crate::error::Result;
1515+use crate::storage::DbRef;
9161017/// Manages a single WebSocket connection to a relay firehose.
1118pub struct Subscriber {
···11+//! Walk `com.atproto.sync.listRepos` with cursor pagination.
22+//!
33+//! For each DID encountered, calls `small_repo::index_repo` to enumerate its
44+//! collections via `describeRepo` (or `getRepo` fallback) and write them to
55+//! the rbc/cbr index.
66+77+use crate::error::Result;
88+use crate::storage::DbRef;
99+1010+/// Walk the full `listRepos` feed for `host`, persisting progress after each
1111+/// page so it can be resumed on restart.
1212+///
1313+/// Uses `jacquard-api`'s `com_atproto::sync::list_repos` XRPC call with cursor
1414+/// pagination. For each DID, calls `small_repo::index_repo`. Per-page progress
1515+/// is written via `db::cursor::set_list_repos_progress`.
1616+pub async fn run(host: &str, db: DbRef) -> Result<()> {
1717+ let _ = (host, db);
1818+ todo!("paginate listRepos, call small_repo::index_repo for each DID, persist cursor")
1919+}
+48
src/sync/describe_repo.rs
···11+//! Per-repository indexing via `com.atproto.repo.describeRepo` with a
22+//! `com.atproto.sync.getRepo` full-CAR fallback.
33+//!
44+//! `describeRepo` is tried first because it is cheap: one request, no CAR
55+//! parsing, and the PDS directly returns its `collections` array. If the
66+//! request fails or returns no collections, the full repository CAR is fetched
77+//! and every MST leaf is walked to enumerate collections.
88+99+use jacquard_api::com_atproto::repo::describe_repo::DescribeRepo;
1010+use jacquard_common::{error::ClientErrorKind, types::ident::AtIdentifier, xrpc::XrpcExt};
1111+1212+use crate::error::{Error, Result};
1313+1414+/// Call `com.atproto.repo.describeRepo` and return the collections list.
1515+///
1616+/// Returns `None` on any network or XRPC error (the caller falls back to
1717+/// getRepo). Returns `Some(vec![])` if the response contains no collections.
1818+async fn try_describe_repo(
1919+ client: &reqwest::Client,
2020+ base: &jacquard_common::url::Url,
2121+ did: &str,
2222+) -> Result<Option<Vec<String>>> {
2323+ let req = DescribeRepo {
2424+ repo: AtIdentifier::new_owned(did).map_err(|e| Error::Other(e.to_string()))?,
2525+ };
2626+2727+ let resp = match client.xrpc(base.clone()).send(&req).await {
2828+ Ok(resp) => resp,
2929+ Err(e) => {
3030+ return match e.kind() {
3131+ ClientErrorKind::Transport | ClientErrorKind::Http { .. } => Ok(None),
3232+ _ => Err(Error::Other(e.to_string())),
3333+ };
3434+ }
3535+ };
3636+3737+ match resp.parse() {
3838+ Ok(output) => {
3939+ let collections = output
4040+ .collections
4141+ .iter()
4242+ .map(|c| c.as_str().to_owned())
4343+ .collect();
4444+ Ok(Some(collections))
4545+ }
4646+ Err(_) => Ok(None),
4747+ }
4848+}
+138
src/sync/get_repo.rs
···11+use crate::storage::DbRef;
22+use crate::sync::try_describe_repo;
33+use bytes::Bytes;
44+use jacquard_api::com_atproto::sync::get_repo::GetRepoError;
55+use jacquard_common::error::ClientErrorKind;
66+use jacquard_common::xrpc::XrpcError;
77+use jacquard_common::xrpc::XrpcExt;
88+use jacquard_repo::MemoryBlockStore;
99+use jacquard_repo::Mst;
1010+use jacquard_repo::car::parse_car_bytes;
1111+use jacquard_repo::commit::Commit;
1212+use jacquard_repo::mst::CursorPosition;
1313+use jacquard_repo::mst::MstCursor;
1414+use std::sync::Arc;
1515+1616+use crate::error::{Error, Result};
1717+use jacquard_api::com_atproto::sync::get_repo::GetRepo;
1818+1919+use jacquard_common::types::string::Did;
2020+2121+/// Fetch the full repo CAR via `com.atproto.sync.getRepo`.
2222+///
2323+/// Returns `None` if the repo is inaccessible (taken down, suspended, etc.).
2424+async fn fetch_repo_car(
2525+ client: &reqwest::Client,
2626+ base: &jacquard_common::url::Url,
2727+ did: &str,
2828+) -> Result<Option<Bytes>> {
2929+ let req = GetRepo {
3030+ did: Did::new_owned(did).map_err(|e| Error::Other(e.to_string()))?,
3131+ since: None,
3232+ };
3333+3434+ let resp = match client.xrpc(base.clone()).send(&req).await {
3535+ Ok(resp) => resp,
3636+ Err(e) => {
3737+ return match e.kind() {
3838+ ClientErrorKind::Transport | ClientErrorKind::Http { .. } => Ok(None),
3939+ _ => Err(Error::Other(e.to_string())),
4040+ };
4141+ }
4242+ };
4343+4444+ match resp.parse() {
4545+ Ok(output) => Ok(Some(output.body)),
4646+ Err(XrpcError::Xrpc(err)) => match err {
4747+ GetRepoError::RepoNotFound(_)
4848+ | GetRepoError::RepoTakendown(_)
4949+ | GetRepoError::RepoSuspended(_)
5050+ | GetRepoError::RepoDeactivated(_)
5151+ | GetRepoError::Unknown(_) => Ok(None),
5252+ },
5353+ Err(e) => Err(Error::Other(e.to_string())),
5454+ }
5555+}
5656+5757+/// Fetch the full repo CAR via `com.atproto.sync.getRepo`, walk every MST
5858+/// leaf, and write each discovered collection to the index.
5959+async fn index_via_get_repo(
6060+ client: &reqwest::Client,
6161+ base: &jacquard_common::url::Url,
6262+ did: &str,
6363+ db: &DbRef,
6464+) -> Result<()> {
6565+ let car = match fetch_repo_car(client, base, did).await? {
6666+ Some(b) => b,
6767+ None => return Ok(()), // repo inaccessible; skip silently
6868+ };
6969+7070+ let parsed = parse_car_bytes(&car)
7171+ .await
7272+ .map_err(|e| Error::Other(e.to_string()))?;
7373+7474+ let mst_root = {
7575+ let commit_bytes = parsed
7676+ .blocks
7777+ .get(&parsed.root)
7878+ .ok_or_else(|| Error::Other("getRepo CAR has no commit block".into()))?;
7979+ let commit = Commit::from_cbor(commit_bytes.as_ref())
8080+ .map_err(|e| Error::Other(format!("bad commit in getRepo CAR: {}", e)))?;
8181+ *commit.data()
8282+ };
8383+8484+ let storage = Arc::new(MemoryBlockStore::new_from_blocks(parsed.blocks));
8585+ let mst = Mst::load(storage, mst_root, None);
8686+8787+ // MST keys are `<collection>/<rkey>` in sorted order. Records within a
8888+ // collection are consecutive, so tracking the last-seen collection avoids
8989+ // redundant index writes.
9090+ let mut cursor = MstCursor::new(mst);
9191+ let mut last_col: Option<String> = None;
9292+9393+ loop {
9494+ match cursor.current() {
9595+ CursorPosition::End => break,
9696+ CursorPosition::Leaf { key, .. } => {
9797+ if let Some(col) = key.as_str().split_once('/').map(|(c, _)| c) {
9898+ if last_col.as_deref() != Some(col) {
9999+ crate::storage::index::insert(db, did, col)?;
100100+ last_col = Some(col.to_owned());
101101+ }
102102+ }
103103+ cursor.advance().await.ok();
104104+ }
105105+ CursorPosition::Tree { .. } => {
106106+ cursor.advance().await.ok();
107107+ }
108108+ }
109109+ }
110110+111111+ Ok(())
112112+}
113113+114114+/// Index a repository by enumerating its collections.
115115+///
116116+/// Tries `com.atproto.repo.describeRepo` first. On success, inserts each
117117+/// returned collection into the rbc/cbr index and returns. On failure or an
118118+/// empty collection list, fetches the full repo CAR via
119119+/// `com.atproto.sync.getRepo` and walks every MST leaf instead.
120120+pub async fn index_repo(host: &str, did: &str, db: DbRef) -> Result<()> {
121121+ let client = reqwest::Client::new();
122122+ let base: jacquard_common::url::Url = format!("https://{}", host)
123123+ .parse()
124124+ .map_err(|e: jacquard_common::url::ParseError| Error::Other(e.to_string()))?;
125125+126126+ // Try describeRepo first — it is cheap and usually sufficient.
127127+ if let Some(collections) = try_describe_repo(&client, &base, did).await? {
128128+ if !collections.is_empty() {
129129+ for col in &collections {
130130+ crate::storage::index::insert(&db, did, col)?;
131131+ }
132132+ return Ok(());
133133+ }
134134+ }
135135+136136+ // Fall back to getRepo: fetch the full CAR and walk the MST.
137137+ index_via_get_repo(&client, &base, did, &db).await
138138+}
+46
src/sync/mod.rs
···11+use jacquard_api::com_atproto::repo::describe_repo::DescribeRepo;
22+use jacquard_common::error::ClientErrorKind;
33+44+use crate::error::{Error, Result};
55+use jacquard_common::types::string::AtIdentifier;
66+use jacquard_common::xrpc::XrpcExt;
77+pub mod backfill;
88+pub mod describe_repo;
99+pub mod get_repo;
1010+pub mod subscribe_repos;
1111+1212+/// Call `com.atproto.repo.describeRepo` and return the collections list.
1313+///
1414+/// Returns `None` on any network or XRPC error (the caller falls back to
1515+/// getRepo). Returns `Some(vec![])` if the response contains no collections.
1616+async fn try_describe_repo(
1717+ client: &reqwest::Client,
1818+ base: &jacquard_common::url::Url,
1919+ did: &str,
2020+) -> Result<Option<Vec<String>>> {
2121+ let req = DescribeRepo {
2222+ repo: AtIdentifier::new_owned(did).map_err(|e| Error::Other(e.to_string()))?,
2323+ };
2424+2525+ let resp = match client.xrpc(base.clone()).send(&req).await {
2626+ Ok(resp) => resp,
2727+ Err(e) => {
2828+ return match e.kind() {
2929+ ClientErrorKind::Transport | ClientErrorKind::Http { .. } => Ok(None),
3030+ _ => Err(Error::Other(e.to_string())),
3131+ };
3232+ }
3333+ };
3434+3535+ match resp.parse() {
3636+ Ok(output) => {
3737+ let collections = output
3838+ .collections
3939+ .iter()
4040+ .map(|c| c.as_str().to_owned())
4141+ .collect();
4242+ Ok(Some(collections))
4343+ }
4444+ Err(_) => Ok(None),
4545+ }
4646+}