···11+# Plan: pds-yrs Improvements (v1)
22+33+## Context
44+55+pds-yrs is a Rust crate that syncs Yrs CRDT documents to/from AT Protocol PDS. The benchmark work proved pure CRDT merge is both the fastest and most correct approach (2ms for 10 files, 81ms for 200 files with guaranteed conflicts, zero conflicts ever). Now we need to harden pds-yrs for real-world use.
66+77+Current limitations: text-only files, no binary support, no rate limit handling, no blob batching, no compression, no token refresh, no chunking for large files, no file deletion support, and batch-only (no real-time sync).
88+99+## Improvements
1010+1111+### 1. File Manifest + Deletion Support
1212+1313+**Problem**: `SiteRecord.files` is a HashMap that only grows — `save.rs` adds/updates entries but never removes them. Deleted local files persist in the record forever and get restored on load. There's no way to distinguish "file was deleted" from "file was never synced."
1414+1515+**Approach**:
1616+- During `save`: collect the set of local files, compare against `SiteRecord.files`, and remove entries for files that no longer exist locally
1717+- Add a `deleted_at: Option<String>` tombstone field to `FileEntry` instead of hard-deleting — this preserves deletion intent for merge
1818+- Tombstoned entries: `content` cleared, `snapshot_blob` kept (for undo within a window), `deleted_at` set to ISO timestamp
1919+- During `load`: skip files where `deleted_at` is set (don't restore deleted files)
2020+- During `save`: if a local file exists that matches a tombstoned entry, clear the tombstone (file was re-created)
2121+- Tombstone garbage collection: on save, remove tombstones older than 30 days (configurable) to keep record size bounded
2222+2323+**Merge behavior — edit wins**:
2424+- If Site A has `deleted_at` set and Site B has an updated version (no tombstone), **keep Site B's version** — never lose edits
2525+- If both sites tombstoned the same file, keep the tombstone
2626+- If Site A tombstoned and Site B has no entry at all, keep the tombstone (propagate deletion)
2727+2828+**Files to modify**:
2929+- `src/types.rs` — add `deleted_at: Option<String>` to FileEntry
3030+- `src/save.rs` — detect deleted files, create tombstones, GC old tombstones, un-tombstone re-created files
3131+- `src/load.rs` — skip tombstoned entries
3232+- `src/merge.rs` — edit-wins resolution for delete-vs-edit conflicts
3333+3434+### 2. Binary / Non-CRDT File Support
3535+3636+**Problem**: Binary files (images, PDFs, fonts, etc.) are silently skipped. They can't use Yrs CRDT merging.
3737+3838+**Approach**:
3939+- Add a `FileKind` enum to `types.rs`: `Text` (Yrs CRDT) vs `Binary` (raw blob)
4040+- `FileEntry` gets a `kind: FileKind` field
4141+- Binary files: store raw content as a single blob, `content` field stores a hash or empty string (not the binary data)
4242+- On save: detect binary vs text by extension (extend `is_text_extension` with a `is_binary_extension`, and treat unknown as binary)
4343+- On load: binary files download the blob and write raw bytes to disk
4444+4545+**Conflict handling for binary files**:
4646+- Each FileEntry stores ONE blob ref (the file's content)
4747+- During `merge_sites()`, if two sites have different blobs for the same binary file (different CIDs), create TWO FileEntries:
4848+ - `images/photo.png` → `images/photo.creator1.png` + `images/photo.creator2.png`
4949+ - Both entries stored in the SiteRecord's files HashMap
5050+ - A `conflict_source: Option<String>` field on FileEntry tracks the original path
5151+- User resolves by keeping one and deleting the other (via next save)
5252+- Text files continue to use CRDT merge (no conflicts ever)
5353+5454+**Files to modify**:
5555+- `src/types.rs` — add `FileKind`, `conflict_source` to FileEntry
5656+- `src/save.rs` — handle binary files in `collect_text_files()` (rename to `collect_files()`), upload raw blobs for binary
5757+- `src/load.rs` — handle binary FileEntries (write raw bytes)
5858+- `src/merge.rs` — binary conflict detection + dual-file creation
5959+6060+### 3. Adaptive Bundling + Rate Limit Retry
6161+6262+**Problem**: Each file gets a separate `upload_blob()` call. 200 files = 200 HTTP requests. No retry on rate limits. But bundling adds complexity — it shouldn't be forced when not needed.
6363+6464+**Approach — Configurable bundling**:
6565+- `BundleStrategy` enum: `None` (one blob per request, default), `Auto` (bundle when rate-limited), `Always(usize)` (bundle N blobs per request)
6666+- `--bundle` CLI flag: `none`, `auto`, `always[:N]` (default N=50)
6767+- In `Auto` mode: start with individual uploads. On first 429, switch to bundling for remaining blobs (concatenate small blobs into a single upload with index header, like git-remote-pds's bundle format, store bundle CID + offset/length in FileEntry)
6868+- In `None` mode: always individual uploads, still retries on transient errors
6969+- In `Always` mode: always bundle, good for large initial syncs
7070+7171+**Approach — Retry with backoff** (always active regardless of bundle mode):
7272+- Add `RetryConfig` struct: `max_retries: u32`, `base_delay_ms: u64`, `max_delay_ms: u64`
7373+- Add `request_with_retry()` wrapper in `pds_client.rs`
7474+- On 429: read `Retry-After` header if present, else exponential backoff (100ms, 200ms, 400ms, ...)
7575+- On 5xx: same retry logic
7676+- On 4xx (non-429): fail immediately
7777+- Default: 3 retries, 100ms base, 5s max
7878+7979+**Files to modify**:
8080+- `src/pds_client.rs` — add retry wrapper, `BundleStrategy`, apply to `upload_blob()`, `put_record()`, `get_blob()`
8181+- `src/main.rs` — add `--bundle` CLI arg to save/sync commands
8282+8383+### 4. Large File Chunking (>40MB)
8484+8585+**Problem**: PDS has ~50MB blob limit. Large files (videos, database dumps) would fail.
8686+8787+**Approach**: Port the chunking strategy from `git-remote-pds/src/chunk.rs`:
8888+- `DEFAULT_CHUNK_SIZE = 40MB` (safe headroom under 50MB limit)
8989+- Files > 40MB split into chunks, each uploaded as separate blob
9090+- FileEntry gets `parts: Option<Vec<BlobRef>>` for multi-part files
9191+- On download: reassemble chunks in order
9292+- Reuse the existing `chunk.rs` logic (or copy the simple split/join functions)
9393+9494+**Files to modify**:
9595+- `src/types.rs` — add `parts: Option<Vec<BlobRef>>` to FileEntry
9696+- `src/yrs_pds.rs` — chunked upload/download helpers
9797+- `src/save.rs` — chunk large blobs before upload
9898+- `src/load.rs` — reassemble chunks on download
9999+100100+### 5. Blob Compression
101101+102102+**Problem**: Yrs snapshots and binary blobs are uploaded as raw bytes. Compression would reduce bandwidth and storage.
103103+104104+**Approach**:
105105+- Use `flate2` crate for gzip compression/decompression
106106+- Compress all blobs before upload, decompress on download
107107+- Add `compressed: bool` field to BlobRef (or use mime type `application/gzip`)
108108+- Transparent to the rest of the code — compress/decompress at the pds_client layer
109109+- Skip compression for already-compressed formats (jpg, png, zip, gz)
110110+111111+**Files to modify**:
112112+- `Cargo.toml` — add `flate2` dependency
113113+- `src/pds_client.rs` or `src/yrs_pds.rs` — compress before `upload_blob()`, decompress after `get_blob()`
114114+- `src/types.rs` — track compression state in BlobRef
115115+116116+### 6. Token Refresh
117117+118118+**Problem**: Long-running syncs (1000 files) may exceed Bearer token TTL. No refresh logic.
119119+120120+**Approach**:
121121+- Track token expiry (conservative 90min TTL like git-remote-pds)
122122+- Before each PDS call, check if token is about to expire
123123+- If expiring: call `com.atproto.server.refreshSession` with refresh_jwt
124124+- Store both access_jwt and refresh_jwt from login response (currently only stores access_jwt)
125125+126126+**Files to modify**:
127127+- `src/pds_client.rs` — store refresh_jwt, add `maybe_refresh_token()`, call before requests
128128+129129+### 7. Real-Time Sync Relay (PDS as WebRTC Fallback)
130130+131131+**Problem**: When NAT traversal prevents direct yrs-webrtc connections between clients, there's no fallback for real-time collaborative editing.
132132+133133+**Approach — PDS as a relay**:
134134+- Add a `sync` subcommand that runs continuously (poll loop)
135135+- Each client periodically:
136136+ 1. Encodes its local Yrs state vector
137137+ 2. Fetches the remote SiteRecord from PDS
138138+ 3. Computes diff: `encode_diff_v1(remote_sv)` for what remote is missing
139139+ 4. Uploads the diff as an `updates_blob` on the record
140140+ 5. Downloads remote's updates and applies locally
141141+- **Polling interval**: configurable, default 2-5 seconds (higher latency than WebRTC but functional)
142142+- **Conflict-free by design**: Yrs CRDT ensures all updates converge regardless of order
143143+144144+**Data model changes**:
145145+- SiteRecord gets a `sync_cursor: Option<String>` — state vector of last sync
146146+- FileEntry's `updates_blob` becomes actively used (not just for compaction)
147147+- Add `last_synced_at: Option<String>` timestamp per file
148148+149149+**Optimization — AT Protocol Event Stream**:
150150+- Instead of polling, subscribe to the PDS firehose for record changes
151151+- `com.atproto.sync.subscribeRepos` WebSocket — get notified when the SiteRecord changes
152152+- Reduces latency from poll-interval to near-real-time (~100-500ms)
153153+- Fall back to polling if firehose unavailable
154154+155155+**CLI**:
156156+```
157157+pds-yrs sync --dir DIR --handle HANDLE --site RKEY --password PASS [--interval 3s] [--verbose]
158158+```
159159+160160+**Files to modify/create**:
161161+- `src/sync.rs` (new) — poll loop, diff computation, update application
162162+- `src/main.rs` — add `sync` subcommand
163163+- `src/types.rs` — add sync_cursor, last_synced_at
164164+- `src/yrs_pds.rs` — incremental diff helpers (mostly exist already)
165165+166166+### 8. Configurable File Filters
167167+168168+**Problem**: Hardcoded list of text extensions. Users may want to include/exclude specific patterns.
169169+170170+**Approach**:
171171+- Add optional `--include` and `--exclude` glob patterns to save/sync commands
172172+- Default: current behavior (known text extensions as CRDT, skip hidden/node_modules/target)
173173+- With `--include "*.md,*.txt"`: only sync matching files
174174+- Binary detection still automatic by extension, but user can override
175175+176176+**Files to modify**:
177177+- `src/save.rs` — accept filter config in `collect_files()`
178178+- `src/main.rs` — add CLI args
179179+180180+## Implementation Order
181181+182182+1. **File manifest + deletion** — foundational, everything else depends on knowing what files exist
183183+2. **Binary file support** — most impactful, enables syncing real sites (builds on manifest for tombstones)
184184+3. **Retry logic + adaptive bundling** — required for reliability (bundling configurable: none/auto/always)
185185+4. **Token refresh** — required for long syncs
186186+5. **Compression** — easy win for bandwidth
187187+6. **Large file chunking** — needed for completeness
188188+7. **Real-time sync relay** — new capability, most complex
189189+8. **Configurable filters** — nice to have
190190+191191+## Verification
192192+193193+For each improvement:
194194+- Unit tests for new logic (tombstone GC, binary detection, compression round-trip, chunking)
195195+- Update existing E2E tests to cover deletion, binary files, and compressed blobs
196196+- New E2E test for real-time sync (two clients syncing via PDS relay)
197197+- Benchmark: re-run merge-bench Strategy 4 with compression enabled to measure impact
+151
plans/improvements-v2-crdt-manifest.md
···11+# Plan: pds-yrs Improvements (v2 — CRDT Manifest)
22+33+**Status**: Superseded by active plan file. Key refinements made:
44+- Manifest changed from Yrs Text (line-per-path) to **Yrs Map** (key=path, value=kind) to avoid character interleaving
55+- Bundling changed from adaptive (none/auto/always) to **pack blob with index** (always, fewest HTTP calls)
66+77+## Context
88+99+Same as v1. The key difference: instead of tombstone-based deletion tracking, the file manifest is itself a Yrs CRDT document. Deletions are CRDT operations that propagate automatically through normal Yrs merging.
1010+1111+## Core Design: CRDT Manifest
1212+1313+### What it is
1414+1515+A special Yrs text document where each line is a file path:
1616+1717+```
1818+docs/index.md
1919+docs/about.md
2020+images/logo.png
2121+style.css
2222+```
2323+2424+The manifest is stored as a dedicated FileEntry in the SiteRecord (e.g., key `_manifest`). It uses the same Yrs CRDT infrastructure as all other text files — snapshot_blob, state_vector, updates_blob, compaction.
2525+2626+### Operations
2727+2828+- **New file**: insert line into manifest Yrs doc + create FileEntry for the file's content
2929+- **Delete file**: remove line from manifest Yrs doc (on next save, when local file is gone)
3030+- **Edit file**: update the file's FileEntry only — manifest is untouched
3131+- **Rename file**: remove old line + add new line (Yrs handles as delete + insert)
3232+3333+### Save flow
3434+3535+1. Collect local files on disk
3636+2. Fetch existing SiteRecord (includes manifest Yrs doc + all FileEntries)
3737+3. Reconstruct manifest Yrs doc from snapshot + updates
3838+4. Compute diff between local file list and manifest text:
3939+ - Files on disk but NOT in manifest → insert line into manifest Yrs doc, create FileEntry
4040+ - Files in manifest but NOT on disk → remove line from manifest Yrs doc (deletion)
4141+ - Files in both → check if content changed, update FileEntry if so
4242+5. Upload updated manifest + changed FileEntries
4343+4444+### Load flow
4545+4646+1. Fetch SiteRecord
4747+2. Materialize manifest Yrs doc → get list of file paths
4848+3. For each path in manifest: download and reconstruct FileEntry, write to disk
4949+4. Ignore FileEntries not listed in manifest (they're deleted or orphaned)
5050+5151+### Merge flow
5252+5353+1. Fetch all SiteRecords
5454+2. CRDT-merge the manifest Yrs docs (standard Yrs merge — handles concurrent inserts/deletes)
5555+3. Materialize the merged manifest → file list
5656+4. For each file in merged manifest: CRDT-merge the file's Yrs docs across sites
5757+5. **Edit-wins reconciliation**: for each FileEntry that exists in ANY site but is NOT in the merged manifest:
5858+ - Check if the FileEntry content differs from the merge base (the common ancestor)
5959+ - If content was modified → someone edited while someone else deleted → re-add line to manifest (edit wins)
6060+ - If content is unchanged → file was just deleted, no concurrent edit → leave it out
6161+6. For binary files not in manifest: same check — if blob CID differs from base, re-add
6262+6363+### Edge cases
6464+6565+- **Both sites delete same file**: both remove the line, CRDT merge still removes it → clean deletion
6666+- **Both sites add same new file**: both insert a line — Yrs may produce duplicate lines. Deduplicate by normalizing manifest text (sort + dedup) after merge
6767+- **Site A deletes, Site B edits**: edit-wins check catches this — re-adds to manifest
6868+- **Site A deletes, Site B doesn't touch it**: line removed in merged manifest, FileEntry unchanged from base → true deletion
6969+- **Orphaned FileEntries**: FileEntries with no manifest line and no edits since base → safe to garbage collect during save
7070+7171+### Why this is better than tombstones (v1)
7272+7373+- No tombstone GC — deleted files just disappear from the manifest
7474+- No `deleted_at` field cluttering the data model
7575+- Deletion is a first-class CRDT operation, not a convention
7676+- The manifest doubles as a human-readable file listing
7777+- Merge behavior emerges from CRDT semantics + one application-level check
7878+7979+## Other Improvements (same as v1, renumbered)
8080+8181+### 2. Binary / Non-CRDT File Support
8282+8383+Same as v1. Binary files get a FileEntry with `kind: Binary` and a raw blob. The manifest lists them alongside text files — the manifest doesn't distinguish file types (that's in the FileEntry).
8484+8585+Conflict handling for binary files during merge:
8686+- If two sites have different blob CIDs for the same binary file path → create `file.creator1.ext` + `file.creator2.ext` entries in manifest + FileEntries
8787+- User resolves by keeping one, deleting the other (which removes the line from manifest on next save)
8888+8989+### 3. Adaptive Bundling + Rate Limit Retry
9090+9191+Same as v1. Configurable bundling:
9292+- `BundleStrategy::None` (default) — individual uploads with retry
9393+- `BundleStrategy::Auto` — start individual, switch to bundling on first 429
9494+- `BundleStrategy::Always(N)` — always bundle N blobs per request
9595+9696+Retry always active (3 retries, exponential backoff, respect Retry-After header).
9797+9898+### 4. Large File Chunking (>40MB)
9999+100100+Same as v1. Port from `git-remote-pds/src/chunk.rs`.
101101+102102+### 5. Blob Compression
103103+104104+Same as v1. `flate2` gzip, skip already-compressed formats.
105105+106106+### 6. Token Refresh
107107+108108+Same as v1. Store refresh_jwt, auto-refresh before expiry.
109109+110110+### 7. Real-Time Sync Relay (PDS as WebRTC Fallback)
111111+112112+Same as v1, but the manifest CRDT makes this cleaner:
113113+- Sync loop pushes/pulls manifest updates alongside file updates
114114+- New files from remote appear in manifest → trigger download
115115+- Deleted files disappear from manifest → trigger local cleanup
116116+- The manifest state vector tracks what the remote has seen
117117+118118+### 8. Configurable File Filters
119119+120120+Same as v1. `--include` and `--exclude` glob patterns.
121121+122122+## Implementation Order
123123+124124+1. **CRDT manifest + deletion** — foundational, replaces SiteRecord.files HashMap as source of truth for file existence
125125+2. **Binary file support** — enables syncing real sites
126126+3. **Retry logic + adaptive bundling** — reliability
127127+4. **Token refresh** — long sync support
128128+5. **Compression** — bandwidth optimization
129129+6. **Large file chunking** — completeness
130130+7. **Real-time sync relay** — new capability
131131+8. **Configurable filters** — nice to have
132132+133133+## Key Files to Modify
134134+135135+For the manifest specifically:
136136+- `src/types.rs` — add `FileKind` enum, `kind` + `conflict_source` to FileEntry, manifest constant (e.g., `MANIFEST_KEY = "_manifest"`)
137137+- `src/save.rs` — reconstruct manifest doc, diff local files vs manifest, insert/remove lines, update FileEntries
138138+- `src/load.rs` — materialize manifest to get file list, only load listed files
139139+- `src/merge.rs` — CRDT-merge manifest docs, edit-wins reconciliation, binary conflict detection
140140+- `src/yrs_pds.rs` — helpers for manifest text manipulation (insert line, remove line, deduplicate)
141141+142142+## Verification
143143+144144+- Unit test: manifest round-trip (add files, delete files, materialize, verify)
145145+- Unit test: concurrent add + delete merges correctly via CRDT
146146+- Unit test: edit-wins check (Site A deletes, Site B edits → file survives)
147147+- Unit test: both-delete → file gone
148148+- Unit test: duplicate line dedup after concurrent adds
149149+- E2E test: save with deletions, load respects manifest
150150+- E2E test: merge two sites with conflicting deletes/edits
151151+- Benchmark: re-run merge-bench with manifest overhead
+55
src/export.rs
···11+//! Export site content from PDS as plain text files.
22+//!
33+//! This is the data portability escape hatch — reads the `content` field
44+//! from each FileEntry without requiring Yrs decoding.
55+66+use std::path::Path;
77+88+use crate::pds_client::PdsClient;
99+use crate::types::{SiteRecord, COLLECTION};
1010+1111+/// Export a site from PDS to plain text files.
1212+///
1313+/// Reads only the `content` field from each FileEntry — no Yrs
1414+/// decoding or blob downloads needed. This works even if the Yrs
1515+/// library is unavailable.
1616+pub async fn export(
1717+ client: &PdsClient,
1818+ did: &str,
1919+ rkey: &str,
2020+ output_dir: &Path,
2121+ verbose: bool,
2222+) -> Result<usize, String> {
2323+ // Fetch site record
2424+ let record = client
2525+ .get_record(did, COLLECTION, rkey)
2626+ .await?
2727+ .ok_or_else(|| format!("site record not found: {}", rkey))?;
2828+2929+ let site: SiteRecord = serde_json::from_value(record.value)
3030+ .map_err(|e| format!("parse SiteRecord: {}", e))?;
3131+3232+ let mut files_exported = 0;
3333+3434+ for (rel_path, entry) in &site.files {
3535+ if verbose {
3636+ eprintln!("pds-yrs: export {}", rel_path);
3737+ }
3838+3939+ let output_path = output_dir.join(rel_path);
4040+ if let Some(parent) = output_path.parent() {
4141+ std::fs::create_dir_all(parent)
4242+ .map_err(|e| format!("create dir {:?}: {}", parent, e))?;
4343+ }
4444+ std::fs::write(&output_path, &entry.content)
4545+ .map_err(|e| format!("write {:?}: {}", output_path, e))?;
4646+4747+ files_exported += 1;
4848+ }
4949+5050+ if verbose {
5151+ eprintln!("pds-yrs: exported {} file(s)", files_exported);
5252+ }
5353+5454+ Ok(files_exported)
5555+}
+19
src/lib.rs
···11+//! pds-yrs: Sync Yrs CRDT documents via AT Protocol PDS.
22+//!
33+//! No git involved — files are stored as Yrs Doc state on the PDS,
44+//! with plain text content alongside for portability.
55+66+pub mod export;
77+pub mod load;
88+pub mod merge;
99+pub mod pds_client;
1010+pub mod save;
1111+pub mod types;
1212+pub mod yrs_pds;
1313+1414+pub use export::export;
1515+pub use load::load;
1616+pub use merge::merge_sites;
1717+pub use pds_client::PdsClient;
1818+pub use save::save;
1919+pub use types::COLLECTION;
+67
src/load.rs
···11+//! Load a site from PDS into a local directory.
22+33+use std::path::Path;
44+55+use crate::pds_client::PdsClient;
66+use crate::types::{LoadResult, SiteRecord, COLLECTION};
77+use crate::yrs_pds;
88+99+/// Load a site from PDS into a directory.
1010+///
1111+/// Fetches the SiteRecord, downloads blob data, reconstructs Yrs Docs,
1212+/// and materializes text files into the output directory.
1313+pub async fn load(
1414+ client: &PdsClient,
1515+ did: &str,
1616+ rkey: &str,
1717+ output_dir: &Path,
1818+ verbose: bool,
1919+) -> Result<LoadResult, String> {
2020+ // Fetch site record
2121+ let record = client
2222+ .get_record(did, COLLECTION, rkey)
2323+ .await?
2424+ .ok_or_else(|| format!("site record not found: {}", rkey))?;
2525+2626+ let site: SiteRecord = serde_json::from_value(record.value)
2727+ .map_err(|e| format!("parse SiteRecord: {}", e))?;
2828+2929+ let mut files_loaded = 0;
3030+ let mut blobs_downloaded = 0;
3131+3232+ for (rel_path, entry) in &site.files {
3333+ if verbose {
3434+ eprintln!("pds-yrs: loading {}", rel_path);
3535+ }
3636+3737+ // Reconstruct Doc from PDS blobs
3838+ let doc = yrs_pds::file_entry_to_doc(entry, client, did).await?;
3939+ blobs_downloaded += 1;
4040+ if entry.updates_blob.is_some() {
4141+ blobs_downloaded += 1;
4242+ }
4343+4444+ // Materialize text
4545+ let content = yrs_pds::materialize(&doc);
4646+4747+ // Write to output directory
4848+ let output_path = output_dir.join(rel_path);
4949+ if let Some(parent) = output_path.parent() {
5050+ std::fs::create_dir_all(parent)
5151+ .map_err(|e| format!("create dir {:?}: {}", parent, e))?;
5252+ }
5353+ std::fs::write(&output_path, &content)
5454+ .map_err(|e| format!("write {:?}: {}", output_path, e))?;
5555+5656+ files_loaded += 1;
5757+ }
5858+5959+ if verbose {
6060+ eprintln!("pds-yrs: loaded {} file(s)", files_loaded);
6161+ }
6262+6363+ Ok(LoadResult {
6464+ files_loaded,
6565+ blobs_downloaded,
6666+ })
6767+}