feat(rfc): outline tranquil-store idea for comment · tranquil.farm/tranquil-pds@282f08f

+169

1 changed file

expand all

+169

TRANQUIL_OWN_DB_RFC.txt

··· 1 + tranquil-store: embedded storage engine for Tranquil PDS 2 + RFC draft, 2026-03-22 3 + By Lewis! 4 + 5 + -- TLDR -- 6 + 7 + Add an embedded storage engine as an alternative to postgres (and leapfrog SQLite-per-actor) 8 + that treats Tranquil's 3 types of storage workloads as 3 separate problems: 9 + 10 + - BlockStore: bitcask-esque append log for immutable CID-keyed blocks [4] 11 + - MetaStore: Fjall LSM keyspaces for mutable metadata [5] 12 + - EventLog: segmented append log for the firehose 13 + 14 + Group commit across users, content dedup, sub-ms firehose delivery. 15 + We will use deterministic simulation testing [18][19]. 16 + Postgres will of course stay as the existing alternative backend. 17 + 18 + -- Intro -- 19 + 20 + The ref PDS hits structural limits around 300k accounts [2]. 21 + SQLite-per-actor means no cross-user write batching. 22 + 23 + tranquil-store is an embedded rust library. It lives in-process, no external deps. 24 + Postgres remains supported; we plan to enable a storage transition path 25 + such that users can seamlessly snapshot-n-switch between the backends. 26 + 27 + The BlockStore is a bitcask-style append log [4] with a Fjall key index [5] that 28 + maps each CID to a (file, offset, length) tuple. We use key-value separation 29 + as per WiscKey [6]. Because blocks are immutable and keyed by CID, the value log 30 + never needs compaction. An LRU hot tier keeps frequently-accessed blocks in 31 + mem, and hint files allow fast index reconstruction on restart [4]. 32 + 33 + The main throughput enabler is group commit [7]. The ref PDS fsyncs once per user 34 + per mutation [1], but BlockStore batches all concurrent commits into a single 35 + write-and-sync cycle. 36 + 37 + Content dedup occurs naturally: identical MST subtrees across users share 38 + one CID-keyed block instead of N copies [3]. 39 + 40 + MetaStore uses Fjall [5] keyspaces for all mutable data. We chose Fjall over 41 + redb and LMDB because both of those are single-writer [8][9]. Each keyspace 42 + compacts independently. 43 + 44 + For cross-store atomicity we use an intent log. Each mutation writes a single 45 + intent record containing the BlockStore refcount updates, MetaStore changes, 46 + and the serialized EventLog payload, fsynced via the group commit. After fsync, 47 + the changes are applied to MetaStore and the event is appended to the EventLog, 48 + then the intent is marked committed. Recovery replays any incomplete intents, 49 + re-applying both metadata changes and event appends. This gives us crash-atomic 50 + mutations across all three stores without full MVCC, since mutations 51 + are already serialized per-user [3]. 52 + 53 + EventLog stores the firehose as segmented append-only files. Live subscribers 54 + receive events via tokio broadcast, and consumers that are catching up will 55 + read from mmap'ed segments [10]. Each event receives a monotonic u64 sequence number. 56 + Segment headers store the base sequence number; a per-segment index maps sequence 57 + ranges to byte offsets. This decouples consumer cursors from physical layout, 58 + allowing transparent addition of per-segment zstd compression per the loom-v2 59 + spec [11] without invalidating checkpoints. Retention is just deleting old 60 + segments! :P 61 + 62 + GC uses refcounted key index entries. GC is epoch-gated by the group commit 63 + cycle: a block is only eligible for collection if its refcount reached zero in 64 + a prior completed commit cycle. This prevents races between concurrent dedup 65 + (which skips the block write but increments the refcount in the same batch) and 66 + collection. Blocks past the epoch gate are collected by rewriting any data 67 + files that fall below a liveness threshold. 68 + 69 + For serialization we use postcard on disk and rkyv [12] for in-mem caches 70 + only. All data files carry a version tag. 71 + 72 + Memory is divided into fixed slices from a configurable total budget: Fjall 73 + block cache, BlockStore hot tier, and CID index each receive a configured 74 + percentage. Actual usage per component is exposed as metrics. The EventLog's 75 + mmap pages live in the OS page cache and are excluded from the budget. 76 + 77 + Backup acquires the group commit lock, which quiesces all writes at the next 78 + commit boundary. Under the lock, the system notes the EventLog position, the 79 + BlockStore file list, and takes a Fjall snapshot, then releases the lock. 80 + Sealed data files and segments are immutable and can be copied without 81 + co-ordination after the snapshot. The quiesce window is bounded by one commit 82 + cycle. Point-in-time recovery replays the EventLog against a prior snapshot. 83 + For continuous replication, a background process tails the EventLog and copies 84 + sealed files to remote storage. 85 + 86 + -- Runtime -- 87 + 88 + The storage core runs on tokio. It is synchronous internally, accessed through 89 + dedicated handler threads that communicate via async channels [13]. Requests 90 + are dispatched by hashing the DID, which gives us per-user write serialization 91 + without locks. Global operations use round-robin. All disk IO goes through 92 + pread/pwrite directly [13]. 93 + 94 + We rejected io_uring for three reasons: it creates orphan kernel operations 95 + when futures are cancelled [22], it is blocked by default in both Docker [16] 96 + and Podman [17] seccomp profiles, and it accounts for 60% of Google's kernel 97 + vulnerability rewards [15]. 98 + 99 + We also rejected thread-per-core runtimes (glommio, etc.) because they are 100 + incompatible with the tokio ecosystem. DID-sharded handler threads give us 101 + the same shared-nothing property without a runtime split. 102 + 103 + -- Testing -- 104 + 105 + We use deterministic simulation testing, following FoundationDB [18] and 106 + TigerBeetle's VOPR [19]. All IO sits behind a StorageIO trait, and tests use an 107 + in-memory implementation that injects faults: partial writes, bit flips, sync 108 + failures, and misdirected writes. A single seed controls the entire fault 109 + schedule, so any failure reproduces exactly [20][21]. 110 + 111 + -- Why these choices -- 112 + 113 + Bitcask for blocks: 114 + Key-val separation [6] using Bitcask [4] for immutable CID blocks: 115 + O(1) writes, O(1) reads, zero write amplification, & no compaction! 116 + 117 + Fjall for metadata: 118 + Only pure-Rust embedded engine with concurrent writers [5]. 119 + Otherwise we'd write our own. 120 + 121 + Segmented log for events: 122 + Write once -> scan forward -> delete by age. 123 + Quite straightforward! 124 + 125 + Postcard on disk: 126 + rkyv is apparently faster [12] but couples on-disk format to library version. 127 + 128 + Tokio & handler threads: 129 + spawn_blocking & pread matches io_uring without security/compat costs [13][14][15][16]. 130 + 131 + Deterministic simulation: 132 + Catches bug classes conventional testing can't reach [18][19]. 133 + StorageIO trait is needed anyway; but being harness-first is a one-time cost [20][21]. 134 + 135 + -- References -- 136 + 137 + [1] Bluesky PDS SQLite migration. github.com/bluesky-social/atproto/pull/1705 138 + [2] G. Orosz. Building Bluesky: a Distributed Social Network. Pragmatic Engineer, April 2024. 139 + newsletter.pragmaticengineer.com/p/bluesky 140 + K. Suder. Introduction to AT Protocol. August 2025. mackuba.eu/2025/08/20/introduction-to-atproto 141 + Bluesky PDS "Going to Production" guide. atproto.com/guides/going-to-production 142 + [3] AT Protocol repository spec. atproto.com/specs/repository 143 + [4] Bitcask: A Log-Structured Hash Table for Fast KV Data. Riak, 2010. riak.com/assets/bitcask-intro.pdf 144 + [5] Fjall: LSM-based embedded storage engine. github.com/fjall-rs/fjall 145 + [6] Lu et al. WiscKey: Separating Keys from Values in SSD-Conscious Storage. USENIX FAST 2016. 146 + usenix.org/conference/fast16/technical-sessions/presentation/lu 147 + [7] Phil Eaton. A Write-Ahead Log Is Not a Universal Part of Durability. July 2024. 148 + notes.eatonphil.com/2024-07-01-a-write-ahead-log-is-not-a-universal-part-of-durability.html 149 + [8] redb design document. github.com/cberner/redb/blob/master/docs/design.md 150 + [9] LMDB source repository. github.com/LMDB/lmdb 151 + [10] Crotty et al. Are You Sure You Want to Use MMAP in Your DBMS? CIDR 2022. 152 + cs.brown.edu/people/acrotty/pubs/p13-crotty.pdf 153 + [11] ybzeek. RFC: com.atproto.sync.getZstdStream (zstd-compressed relay streams). 154 + github.com/bluesky-social/atproto/discussions/4582 155 + [12] rkyv: zero-copy deserialization framework for Rust. rkyv.org 156 + [13] Tonbo. Exploring Better Async Rust Disk IO. tonbo.io/blog/exploring-better-async-rust-disk-io 157 + [14] Iroh. Async Rust Challenges in Iroh. iroh.computer/blog/async-rust-challenges-in-iroh 158 + [15] Google restricting io_uring. phoronix.com/news/Google-Restricting-IO_uring 159 + [16] Docker 4.42.0 and io_uring. forums.docker.com/t/4-42-0-and-io-uring/148620 160 + [17] Podman io_uring discussion. github.com/containers/podman/discussions/27772 161 + [18] FoundationDB simulation testing. apple.github.io/foundationdb/testing.html 162 + [19] TigerBeetle VOPR. tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness 163 + [20] DST in Rust (S2). s2.dev/blog/dst 164 + [21] Phil Eaton. What's the big deal about Deterministic Simulation Testing? August 2024. 165 + notes.eatonphil.com/2024-08-20-deterministic-simulation-testing.html 166 + [22] Tonbo. Async Rust Is Not Safe with io_uring. tonbo.io/blog/async-rust-is-not-safe-with-io-uring 167 + 168 + Thank you for reading! Let's do some great work together. 169 +

Configure Feed

Configure Feed