···11+tranquil-store: embedded storage engine for Tranquil PDS
22+RFC draft, 2026-03-22
33+By Lewis!
44+55+-- TLDR --
66+77+Add an embedded storage engine as an alternative to postgres (and leapfrog SQLite-per-actor)
88+that treats Tranquil's 3 types of storage workloads as 3 separate problems:
99+1010+- BlockStore: bitcask-esque append log for immutable CID-keyed blocks [4]
1111+- MetaStore: Fjall LSM keyspaces for mutable metadata [5]
1212+- EventLog: segmented append log for the firehose
1313+1414+Group commit across users, content dedup, sub-ms firehose delivery.
1515+We will use deterministic simulation testing [18][19].
1616+Postgres will of course stay as the existing alternative backend.
1717+1818+-- Intro --
1919+2020+The ref PDS hits structural limits around 300k accounts [2].
2121+SQLite-per-actor means no cross-user write batching.
2222+2323+tranquil-store is an embedded rust library. It lives in-process, no external deps.
2424+Postgres remains supported; we plan to enable a storage transition path
2525+such that users can seamlessly snapshot-n-switch between the backends.
2626+2727+The BlockStore is a bitcask-style append log [4] with a Fjall key index [5] that
2828+maps each CID to a (file, offset, length) tuple. We use key-value separation
2929+as per WiscKey [6]. Because blocks are immutable and keyed by CID, the value log
3030+never needs compaction. An LRU hot tier keeps frequently-accessed blocks in
3131+mem, and hint files allow fast index reconstruction on restart [4].
3232+3333+The main throughput enabler is group commit [7]. The ref PDS fsyncs once per user
3434+per mutation [1], but BlockStore batches all concurrent commits into a single
3535+write-and-sync cycle.
3636+3737+Content dedup occurs naturally: identical MST subtrees across users share
3838+one CID-keyed block instead of N copies [3].
3939+4040+MetaStore uses Fjall [5] keyspaces for all mutable data. We chose Fjall over
4141+redb and LMDB because both of those are single-writer [8][9]. Each keyspace
4242+compacts independently.
4343+4444+For cross-store atomicity we use an intent log. Each mutation writes a single
4545+intent record containing the BlockStore refcount updates, MetaStore changes,
4646+and the serialized EventLog payload, fsynced via the group commit. After fsync,
4747+the changes are applied to MetaStore and the event is appended to the EventLog,
4848+then the intent is marked committed. Recovery replays any incomplete intents,
4949+re-applying both metadata changes and event appends. This gives us crash-atomic
5050+mutations across all three stores without full MVCC, since mutations
5151+are already serialized per-user [3].
5252+5353+EventLog stores the firehose as segmented append-only files. Live subscribers
5454+receive events via tokio broadcast, and consumers that are catching up will
5555+read from mmap'ed segments [10]. Each event receives a monotonic u64 sequence number.
5656+Segment headers store the base sequence number; a per-segment index maps sequence
5757+ranges to byte offsets. This decouples consumer cursors from physical layout,
5858+allowing transparent addition of per-segment zstd compression per the loom-v2
5959+spec [11] without invalidating checkpoints. Retention is just deleting old
6060+segments! :P
6161+6262+GC uses refcounted key index entries. GC is epoch-gated by the group commit
6363+cycle: a block is only eligible for collection if its refcount reached zero in
6464+a prior completed commit cycle. This prevents races between concurrent dedup
6565+(which skips the block write but increments the refcount in the same batch) and
6666+collection. Blocks past the epoch gate are collected by rewriting any data
6767+files that fall below a liveness threshold.
6868+6969+For serialization we use postcard on disk and rkyv [12] for in-mem caches
7070+only. All data files carry a version tag.
7171+7272+Memory is divided into fixed slices from a configurable total budget: Fjall
7373+block cache, BlockStore hot tier, and CID index each receive a configured
7474+percentage. Actual usage per component is exposed as metrics. The EventLog's
7575+mmap pages live in the OS page cache and are excluded from the budget.
7676+7777+Backup acquires the group commit lock, which quiesces all writes at the next
7878+commit boundary. Under the lock, the system notes the EventLog position, the
7979+BlockStore file list, and takes a Fjall snapshot, then releases the lock.
8080+Sealed data files and segments are immutable and can be copied without
8181+co-ordination after the snapshot. The quiesce window is bounded by one commit
8282+cycle. Point-in-time recovery replays the EventLog against a prior snapshot.
8383+For continuous replication, a background process tails the EventLog and copies
8484+sealed files to remote storage.
8585+8686+-- Runtime --
8787+8888+The storage core runs on tokio. It is synchronous internally, accessed through
8989+dedicated handler threads that communicate via async channels [13]. Requests
9090+are dispatched by hashing the DID, which gives us per-user write serialization
9191+without locks. Global operations use round-robin. All disk IO goes through
9292+pread/pwrite directly [13].
9393+9494+We rejected io_uring for three reasons: it creates orphan kernel operations
9595+when futures are cancelled [22], it is blocked by default in both Docker [16]
9696+and Podman [17] seccomp profiles, and it accounts for 60% of Google's kernel
9797+vulnerability rewards [15].
9898+9999+We also rejected thread-per-core runtimes (glommio, etc.) because they are
100100+incompatible with the tokio ecosystem. DID-sharded handler threads give us
101101+the same shared-nothing property without a runtime split.
102102+103103+-- Testing --
104104+105105+We use deterministic simulation testing, following FoundationDB [18] and
106106+TigerBeetle's VOPR [19]. All IO sits behind a StorageIO trait, and tests use an
107107+in-memory implementation that injects faults: partial writes, bit flips, sync
108108+failures, and misdirected writes. A single seed controls the entire fault
109109+schedule, so any failure reproduces exactly [20][21].
110110+111111+-- Why these choices --
112112+113113+Bitcask for blocks:
114114+Key-val separation [6] using Bitcask [4] for immutable CID blocks:
115115+O(1) writes, O(1) reads, zero write amplification, & no compaction!
116116+117117+Fjall for metadata:
118118+Only pure-Rust embedded engine with concurrent writers [5].
119119+Otherwise we'd write our own.
120120+121121+Segmented log for events:
122122+Write once -> scan forward -> delete by age.
123123+Quite straightforward!
124124+125125+Postcard on disk:
126126+rkyv is apparently faster [12] but couples on-disk format to library version.
127127+128128+Tokio & handler threads:
129129+spawn_blocking & pread matches io_uring without security/compat costs [13][14][15][16].
130130+131131+Deterministic simulation:
132132+Catches bug classes conventional testing can't reach [18][19].
133133+StorageIO trait is needed anyway; but being harness-first is a one-time cost [20][21].
134134+135135+-- References --
136136+137137+[1] Bluesky PDS SQLite migration. github.com/bluesky-social/atproto/pull/1705
138138+[2] G. Orosz. Building Bluesky: a Distributed Social Network. Pragmatic Engineer, April 2024.
139139+ newsletter.pragmaticengineer.com/p/bluesky
140140+ K. Suder. Introduction to AT Protocol. August 2025. mackuba.eu/2025/08/20/introduction-to-atproto
141141+ Bluesky PDS "Going to Production" guide. atproto.com/guides/going-to-production
142142+[3] AT Protocol repository spec. atproto.com/specs/repository
143143+[4] Bitcask: A Log-Structured Hash Table for Fast KV Data. Riak, 2010. riak.com/assets/bitcask-intro.pdf
144144+[5] Fjall: LSM-based embedded storage engine. github.com/fjall-rs/fjall
145145+[6] Lu et al. WiscKey: Separating Keys from Values in SSD-Conscious Storage. USENIX FAST 2016.
146146+ usenix.org/conference/fast16/technical-sessions/presentation/lu
147147+[7] Phil Eaton. A Write-Ahead Log Is Not a Universal Part of Durability. July 2024.
148148+ notes.eatonphil.com/2024-07-01-a-write-ahead-log-is-not-a-universal-part-of-durability.html
149149+[8] redb design document. github.com/cberner/redb/blob/master/docs/design.md
150150+[9] LMDB source repository. github.com/LMDB/lmdb
151151+[10] Crotty et al. Are You Sure You Want to Use MMAP in Your DBMS? CIDR 2022.
152152+ cs.brown.edu/people/acrotty/pubs/p13-crotty.pdf
153153+[11] ybzeek. RFC: com.atproto.sync.getZstdStream (zstd-compressed relay streams).
154154+ github.com/bluesky-social/atproto/discussions/4582
155155+[12] rkyv: zero-copy deserialization framework for Rust. rkyv.org
156156+[13] Tonbo. Exploring Better Async Rust Disk IO. tonbo.io/blog/exploring-better-async-rust-disk-io
157157+[14] Iroh. Async Rust Challenges in Iroh. iroh.computer/blog/async-rust-challenges-in-iroh
158158+[15] Google restricting io_uring. phoronix.com/news/Google-Restricting-IO_uring
159159+[16] Docker 4.42.0 and io_uring. forums.docker.com/t/4-42-0-and-io-uring/148620
160160+[17] Podman io_uring discussion. github.com/containers/podman/discussions/27772
161161+[18] FoundationDB simulation testing. apple.github.io/foundationdb/testing.html
162162+[19] TigerBeetle VOPR. tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness
163163+[20] DST in Rust (S2). s2.dev/blog/dst
164164+[21] Phil Eaton. What's the big deal about Deterministic Simulation Testing? August 2024.
165165+ notes.eatonphil.com/2024-08-20-deterministic-simulation-testing.html
166166+[22] Tonbo. Async Rust Is Not Safe with io_uring. tonbo.io/blog/async-rust-is-not-safe-with-io-uring
167167+168168+Thank you for reading! Let's do some great work together.
169169+