Irmin#
Irmin is a content-addressable store that models data as typed, hash-linked DAGs. You navigate them with cursors, merge them with schema-defined strategies, and sync them over the network. Backends are pluggable: the same code works against a Git repository, an ATProto PDS, an OCI registry, or an in-memory heap.
This version of Irmin is rebuilt around a new core abstraction: a typed cursor over a content-addressed DAG, with codecs that define block structure and merge semantics at the schema level rather than per-backend. It uses Eio for concurrent I/O and requires OCaml >= 5.2.
Why Irmin#
A Git repository is a content-addressed store. So is an ATProto repository, an OCI image, or an IPFS object. They all share the same core operations: hash a block, store it, link blocks by hash, traverse, diff, merge, sync.
Irmin makes that pattern a library. You define a schema (how blocks are structured), and Irmin gives you:
- Typed cursors that navigate the DAG lazily, fetching blocks on demand
- Schema-driven merge with typed merge functions (counters, sets, text, LWW) composed at the tree level
- Merkle proofs — record a computation's reads into a subheap, replay and verify elsewhere
- Sync protocols — anti-entropy gossip, Merkle descent, Bloom-slice transfer (Journault & Gazagnaire, 2014)
- A CLI for inspecting, editing, and serving stores from the terminal
The cursor abstraction means application code is backend-independent. Switch from Git to ATProto by changing the heap, not the logic.
Installation#
Install with opam:
$ opam install nox-irmin
If opam cannot find the package, it may not yet be released in the public
opam-repository. Add the overlay repository, then install it:
$ opam repo add samoht https://tangled.org/gazagnaire.org/opam-overlay.git
$ opam update
$ opam install nox-irmin
Quick Start#
Navigate a Git repository#
open Irmin_git
let walk_tree () =
Eio_main.run @@ fun env ->
let fs = Eio.Stdenv.fs env in
Eio.Switch.run @@ fun sw ->
let heap = open_ ~sw ~fs ~path:(Fpath.v ".") in
match head heap ~branch:"main" with
| None -> print_endline "empty repository"
| Some h ->
let c = at heap tree h in
list c
|> List.iter (fun (name, kind) ->
let tag = match kind with `Node -> "/" | `Leaf -> "" in
Fmt.pr " %s%s@." name tag)
Define a custom schema#
(* Reuse the Git tree codec but override the leaf merge strategy *)
let my_tree =
fix (fun self ->
node ~name:"application/x-tree"
~dec:tree_parse ~enc:tree_serialize
~merge:merge_lww (* last-writer-wins at leaves *)
~rules:[ "*" => self ] ())
Links vs inlines#
A child of a node is either linked (a separate content-addressed
block, referenced by hash, deduplicated across the DAG) or inlined
(bytes stored inside the parent block, not content-addressed on their
own). The schema's dec/enc decides per child; the cursor walks
both transparently.
(* A Git tree entry: the permission bits live inline in the
parent, the target blob lives as a separate Link. *)
let example_children =
let blob_hash = Git.Hash.digest_string ~kind:`Blob "hello" in
Named
[ ("mode", inline "100644");
("target", link (irmin_hash blob_hash)) ]
Rule of thumb: anything you want to share across blocks (deduplicated,
independently fetchable, reachable from proofs) should be a link.
Anything you always want to materialise with the parent (a small flag,
a permission, a short tag) should be inline. The schema stays pure
either way — heap writes happen at flush, not inside enc.
Merge with typed strategies#
(* Lift a typed counter merge into a block-level merge strategy *)
let counter_merge =
Irmin.Merge.v
~decode:int_of_string
~encode:string_of_int
Irmin.Merge.counter
let counter_leaf =
leaf ~name:"counter" ~merge:counter_merge ()
(* 5 + 3 = 8, not a conflict *)
let text_leaf =
leaf ~name:"text" ~merge:merge_lww ()
(* Tree-level merge composes leaf strategies automatically *)
let merged_tree =
fix (fun self ->
directory
[ "*.count" => counter_leaf;
"*.txt" => text_leaf;
"*" => self ])
CLI#
$ irmin init -r mystore
$ irmin set -r mystore config/db.json '{"host":"localhost"}'
$ irmin get -r mystore config/db.json
{"host":"localhost"}
$ irmin tree -r mystore
config/
db.json
$ irmin log -r mystore
[abc1234] Set config/db.json
$ irmin branches -r mystore
main
Moving data in and out#
Four commands, split by source medium (file on disk vs. another live store) and direction (ingest vs. emit):
| from a file | from another store | |
|---|---|---|
| in | irmin import FILE |
irmin pull REMOTE |
| out | irmin export FILE |
irmin push REMOTE |
Plus the onboard shortcut:
$ irmin clone SOURCE [DIR]
which seeds a fresh store under DIR from a CAR archive (today) or a
remote URL (later) — the one-shot of init + import + setting
HEAD. With no DIR, the target folder is inferred from the source
basename, matching git clone's convention.
The archival pair (import / export) and the sync pair
(pull / push) are deliberately different workflows:
- Archive: CAR file is a self-contained, hash-integral snapshot. No
refs, no merge, no network. Hand it to someone, commit it to backup,
re-hydrate with
import.cloneis the "get started" shortcut over this pair. - Sync: two live stores agreeing on a ref.
pushsends the delta,pullfetches + merges. Refs and merge strategy matter; the target must be reachable and writable forpush.
CAR support currently targets the PDS/ATProto backend; Git-backed
stores export via git bundle on the underlying .git.
Backends#
| Module | Type | Block format | Status |
|---|---|---|---|
Irmin_git |
Backend | SHA-1 tree/blob/commit | Read/write, full Heap.S |
Irmin_atproto |
Backend | DAG-CBOR (SHA-256) | Read/write, full Heap.S |
Irmin_tar |
Backend | SHA-256 file blobs | Read/write |
Irmin_json |
Codec | In-memory JSON values | Read-only (parse JSON blocks) |
Irmin_cbor |
Codec | DAG-CBOR blocks | Read-only (parse CBOR blocks) |
Irmin_oci |
Codec | SHA-256 manifests/layers | Read-only (parse OCI manifests) |
Full backends implement Heap.S — content-addressed block storage with
named refs, put, get, and mem. Codecs provide the dec/enc functions
that interpret blocks but do not manage storage.
Architecture#
┌──────────────────────────────────────────┐
│ Schema: codecs, cursors, merge, proofs │ typed, backend-agnostic
├────────────────────┬─────────────────────┤
│ Sync │ Worktree │ coordination
│ gossip / Merkle / │ checkout / status │
│ Bloom-slice │ / commit │
├────────────────────┴──────┬──────────────┤
│ Git heap │ ATProto heap │ backend layer
│ (SHA-1, packfiles) │ (SHA-256) │
└───────────────────────────┴──────────────┘
Module Structure#
| Module | Purpose |
|---|---|
Irmin.Hash |
Phantom-typed SHA-1 / SHA-256 hashes |
Irmin.Heap |
Content-addressed block store with named refs |
Irmin.Schema.Make(H) |
Codecs, cursors, merge, diff, proofs over H |
Irmin.Sync |
Backend-agnostic sync (gossip, Merkle, Bloom) |
Irmin.Worktree |
Filesystem checkout / status / commit |
Irmin.SHA1 |
Pre-built Schema.Make for SHA-1 |
Irmin.SHA256 |
Pre-built Schema.Make for SHA-256 |
Irmin_git |
Git backend |
Irmin_atproto |
ATProto backend |
Irmin_oci |
OCI registry backend |
Irmin_json |
JSON in-memory backend |
Irmin_cbor |
CBOR block backend |
Irmin_tar |
TAR archive backend |
Key Concepts#
Heap: A typed, content-addressed persistent store. Blocks are stored and retrieved by hash. Named refs (branches, HEAD) provide mutable entry points. The heap does not interpret blocks — structure is the codec's job.
Codec: A typed description of a block at one level of the DAG.
Codecs carry a MIME-style name, a decode/encode pair, an optional
merge function, and navigation rules that map child names to codecs.
The fix combinator handles recursive structures (trees that contain
trees).
Cursor: A position in the DAG with a known value type. Reads are
lazy: step descends to a child by name or typed field, fetching the
block only when needed. Writes accumulate locally until flush
persists them.
Proof: A subheap recording every block read during a computation.
produce runs a function against a full heap and captures the reads;
verify replays the function against the subheap alone, confirming
the result without trusting the full store.
Sync: The protocol for exchanging blocks between stores. Irmin provides three default strategies: anti-entropy gossip (exchange head lists), Merkle descent (walk the DAG to find divergence), and Bloom-slice sync (probabilistic set difference using layered Bloom filters — efficient for large DAGs with small deltas).
References#
- Irmin on GitHub — previous version of Irmin (Lwt-based, MirageOS-focused)
- Git Internals — the content-addressed model that Irmin generalises
- AT Protocol Repository Spec — ATProto's MST-based content-addressed repository
- B. Journault and T. Gazagnaire. Irmin: a branch-consistent distributed library database. OCaml Workshop, 2014.
License#
ISC