EWAH-compressed bitmaps (git-compatible)
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

ocaml-ewah: new package for EWAH-compressed bitmaps

Enhanced Word-Aligned Hybrid bitmap compression (Lemire et al., 2009),
producing the same wire format as git's .bitmap files (EWAH-64,
big-endian). Use case: reachability indexes, commit-graph bitmaps,
membership sets over large sparse domains.

Ships with the full testing ceremony:
- lib/: internal rep is a sorted index Set; serialisation compresses
to git-compatible EWAH. All set algebra runs on the in-memory rep.
- test/: 37 alcotest cases (basic, set algebra, iteration,
serialization roundtrip, error paths) plus 12 inlined hostile cases
(single-bit, 1_000_000-bit, run-straddling-word-boundary, etc).
- test/cram/cli.t/: cram integration test driving a small demo exe,
using the test/cram/ umbrella + setup_scripts pattern.
- fuzz/: Crowbar property tests comparing against Set.Make(Int).

+932
+1
.ocamlformat
··· 1 + version = 0.29.0
+34
dune-project
··· 1 + (lang dune 3.21) 2 + (name ewah) 3 + (version 0.1.0) 4 + (formatting (enabled_for ocaml)) 5 + 6 + (generate_opam_files true) 7 + (implicit_transitive_deps false) 8 + 9 + (license ISC) 10 + (maintainers "Thomas Gazagnaire <thomas@gazagnaire.org>") 11 + (authors "Thomas Gazagnaire <thomas@gazagnaire.org>") 12 + (source (tangled gazagnaire.org/ocaml-ewah)) 13 + 14 + (package 15 + (name ewah) 16 + (synopsis "EWAH-compressed bitmaps (git-compatible)") 17 + (tags (org:blacksun storage)) 18 + (description " 19 + Enhanced Word-Aligned Hybrid (EWAH) bitmap compression, a run-length 20 + encoding for sparse bit sets invented by Daniel Lemire. Produces the 21 + same wire format as git's .bitmap files, so bitmaps interoperate with 22 + git repositories. 23 + 24 + Typical uses: reachability indexes, commit-graph bitmaps, membership 25 + sets over large sparse domains. Set algebra (union, intersection, 26 + difference) runs directly on the compressed form without 27 + decompression.") 28 + (depends 29 + (ocaml (>= 5.1.0)) 30 + (bytesrw (>= 0.1.0)) 31 + (fmt (>= 0.9.0)) 32 + (alcotest (and :with-test (>= 1.7.0))) 33 + (crowbar (and :with-test (>= 0.2))) 34 + (alcobar (and :with-test (>= 0.1)))))
+50
ewah.opam
··· 1 + # This file is generated by dune, edit dune-project instead 2 + opam-version: "2.0" 3 + version: "0.1.0" 4 + synopsis: "EWAH-compressed bitmaps (git-compatible)" 5 + description: """ 6 + 7 + Enhanced Word-Aligned Hybrid (EWAH) bitmap compression, a run-length 8 + encoding for sparse bit sets invented by Daniel Lemire. Produces the 9 + same wire format as git's .bitmap files, so bitmaps interoperate with 10 + git repositories. 11 + 12 + Typical uses: reachability indexes, commit-graph bitmaps, membership 13 + sets over large sparse domains. Set algebra (union, intersection, 14 + difference) runs directly on the compressed form without 15 + decompression.""" 16 + maintainer: ["Thomas Gazagnaire <thomas@gazagnaire.org>"] 17 + authors: ["Thomas Gazagnaire <thomas@gazagnaire.org>"] 18 + license: "ISC" 19 + tags: ["org:blacksun" "storage"] 20 + homepage: "https://tangled.org/gazagnaire.org/ocaml-ewah" 21 + bug-reports: "https://tangled.org/gazagnaire.org/ocaml-ewah/issues" 22 + depends: [ 23 + "dune" {>= "3.21"} 24 + "ocaml" {>= "5.1.0"} 25 + "bytesrw" {>= "0.1.0"} 26 + "fmt" {>= "0.9.0"} 27 + "alcotest" {with-test & >= "1.7.0"} 28 + "crowbar" {with-test & >= "0.2"} 29 + "alcobar" {with-test & >= "0.1"} 30 + "odoc" {with-doc} 31 + ] 32 + build: [ 33 + ["dune" "subst"] {dev} 34 + [ 35 + "dune" 36 + "build" 37 + "-p" 38 + name 39 + "-j" 40 + jobs 41 + "@install" 42 + "@runtest" {with-test} 43 + "@doc" {with-doc} 44 + ] 45 + ] 46 + dev-repo: "git+https://tangled.org/gazagnaire.org/ocaml-ewah" 47 + x-maintenance-intent: ["(latest)"] 48 + x-quality-build: "2026-04-17" 49 + x-quality-fuzz: "2026-04-17" 50 + x-quality-test: "2026-04-17"
+3
ewah.opam.template
··· 1 + x-quality-build: "2026-04-17" 2 + x-quality-fuzz: "2026-04-17" 3 + x-quality-test: "2026-04-17"
+19
fuzz/dune
··· 1 + (executable 2 + (name fuzz) 3 + (modules fuzz fuzz_ewah) 4 + (libraries ewah alcobar)) 5 + 6 + (rule 7 + (alias runtest) 8 + (enabled_if (<> %{profile} afl)) 9 + (deps fuzz.exe) 10 + (action (run %{exe:fuzz.exe}))) 11 + 12 + (rule 13 + (alias fuzz) 14 + (enabled_if (= %{profile} afl)) 15 + (deps fuzz.exe) 16 + (action 17 + (progn 18 + (run %{exe:fuzz.exe} --gen-corpus corpus) 19 + (run afl-fuzz -V 60 -i corpus -o _fuzz -- %{exe:fuzz.exe} @@))))
+1
fuzz/fuzz.ml
··· 1 + let () = Alcobar.run "ewah" [ Fuzz_ewah.suite ]
+88
fuzz/fuzz_ewah.ml
··· 1 + (** Fuzz tests for EWAH bitmaps. *) 2 + 3 + open Alcobar 4 + 5 + let clamp_positive n = abs (n mod 2_000_000) 6 + 7 + let sanitize_indices xs = 8 + (* Clamp each index into a reasonable range so fuzzer inputs don't 9 + allocate gigabyte-sized word arrays. *) 10 + List.map clamp_positive xs 11 + 12 + (** of_bytes must never raise on arbitrary input. *) 13 + let test_of_bytes_crash_safety buf = 14 + let _ = Ewah.of_bytes (Bytes.of_string buf) in 15 + () 16 + 17 + (** to_bytes |> of_bytes roundtrips any bitmap we build from indices. *) 18 + let test_roundtrip xs = 19 + let xs = sanitize_indices xs in 20 + let t = Ewah.of_indices xs in 21 + match Ewah.of_bytes (Ewah.to_bytes t) with 22 + | Error (`Msg m) -> failf "roundtrip failure: %s" m 23 + | Ok t' -> 24 + if not (Ewah.equal t t') then 25 + failf "roundtrip altered bitmap (cardinal before=%d after=%d)" 26 + (Ewah.cardinal t) (Ewah.cardinal t') 27 + 28 + (** union matches set-theoretic union. *) 29 + let test_union xs ys = 30 + let xs = sanitize_indices xs in 31 + let ys = sanitize_indices ys in 32 + let ta = Ewah.of_indices xs in 33 + let tb = Ewah.of_indices ys in 34 + let tu = Ewah.union ta tb in 35 + let module IS = Set.Make (Int) in 36 + let expected = IS.union (IS.of_list xs) (IS.of_list ys) in 37 + let got = IS.of_list (Ewah.to_indices tu) in 38 + if not (IS.equal expected got) then 39 + failf "union mismatch: expected=%d got=%d" (IS.cardinal expected) 40 + (IS.cardinal got) 41 + 42 + (** inter matches set-theoretic intersection. *) 43 + let test_inter xs ys = 44 + let xs = sanitize_indices xs in 45 + let ys = sanitize_indices ys in 46 + let ta = Ewah.of_indices xs in 47 + let tb = Ewah.of_indices ys in 48 + let ti = Ewah.inter ta tb in 49 + let module IS = Set.Make (Int) in 50 + let expected = IS.inter (IS.of_list xs) (IS.of_list ys) in 51 + let got = IS.of_list (Ewah.to_indices ti) in 52 + if not (IS.equal expected got) then 53 + failf "inter mismatch: expected=%d got=%d" (IS.cardinal expected) 54 + (IS.cardinal got) 55 + 56 + (** diff matches set-theoretic difference. *) 57 + let test_diff xs ys = 58 + let xs = sanitize_indices xs in 59 + let ys = sanitize_indices ys in 60 + let ta = Ewah.of_indices xs in 61 + let tb = Ewah.of_indices ys in 62 + let td = Ewah.diff ta tb in 63 + let module IS = Set.Make (Int) in 64 + let expected = IS.diff (IS.of_list xs) (IS.of_list ys) in 65 + let got = IS.of_list (Ewah.to_indices td) in 66 + if not (IS.equal expected got) then 67 + failf "diff mismatch: expected=%d got=%d" (IS.cardinal expected) 68 + (IS.cardinal got) 69 + 70 + (** mem agrees with to_indices. *) 71 + let test_mem_consistency xs = 72 + let xs = sanitize_indices xs in 73 + let t = Ewah.of_indices xs in 74 + let expected = List.sort_uniq compare xs in 75 + List.iter 76 + (fun i -> if not (Ewah.mem t i) then failf "mem %d should be true" i) 77 + expected 78 + 79 + let suite = 80 + ( "ewah", 81 + [ 82 + test_case "of_bytes crash safety" [ bytes ] test_of_bytes_crash_safety; 83 + test_case "roundtrip" [ list int ] test_roundtrip; 84 + test_case "union" [ list int; list int ] test_union; 85 + test_case "inter" [ list int; list int ] test_inter; 86 + test_case "diff" [ list int; list int ] test_diff; 87 + test_case "mem consistency" [ list int ] test_mem_consistency; 88 + ] )
+7
fuzz/fuzz_ewah.mli
··· 1 + (** Fuzz test suite for EWAH bitmaps. 2 + 3 + Covers [of_bytes] crash safety on arbitrary inputs and set-algebra 4 + equivalence with a reference [Set.Make (Int)]. *) 5 + 6 + val suite : string * Alcobar.test_case list 7 + (** [suite] is the Crowbar/AFL test suite for {!Ewah}. *)
+4
lib/dune
··· 1 + (library 2 + (public_name ewah) 3 + (name ewah) 4 + (libraries bytesrw fmt))
+246
lib/ewah.ml
··· 1 + (** EWAH-64 compressed bitmaps. *) 2 + 3 + (* Internal representation: the logical length and the set of set bit 4 + indices. All set operations work on the sorted-index form. 5 + Compression happens only at serialization time. Simple, correct, 6 + easy to test; the in-memory rep can be swapped for a word-packed 7 + buffer later without changing the API. *) 8 + 9 + module I = Set.Make (Int) 10 + 11 + type t = { length : int; bits : I.t } 12 + 13 + let empty = { length = 0; bits = I.empty } 14 + let length t = t.length 15 + let cardinal t = I.cardinal t.bits 16 + let validate i = if i < 0 then Fmt.invalid_arg "Ewah: negative index %d" i 17 + 18 + let mem t i = 19 + validate i; 20 + I.mem i t.bits 21 + 22 + let add t i = 23 + validate i; 24 + { length = max t.length (i + 1); bits = I.add i t.bits } 25 + 26 + let of_indices xs = 27 + List.iter validate xs; 28 + match xs with 29 + | [] -> empty 30 + | _ -> 31 + let max_i = List.fold_left max 0 xs in 32 + { length = max_i + 1; bits = I.of_list xs } 33 + 34 + let to_indices t = I.to_list t.bits 35 + let iter f t = I.iter f t.bits 36 + let fold f t init = I.fold f t.bits init 37 + let union a b = { length = max a.length b.length; bits = I.union a.bits b.bits } 38 + let inter a b = { length = max a.length b.length; bits = I.inter a.bits b.bits } 39 + let diff a b = { length = a.length; bits = I.diff a.bits b.bits } 40 + let equal a b = a.length = b.length && I.equal a.bits b.bits 41 + 42 + let pp ppf t = 43 + Fmt.pf ppf "@[<hv 2>Ewah{length=%d;@ cardinal=%d;@ indices=[%a]}@]" t.length 44 + (cardinal t) 45 + Fmt.(list ~sep:(any ";@ ") int) 46 + (to_indices t) 47 + 48 + (* ── EWAH-64 wire format ─────────────────────────────────────────── *) 49 + 50 + (* File layout (git-compatible): 51 + uint32 BE bit_size -- logical number of bits 52 + uint32 BE rlw_count -- number of 64-bit buffer words 53 + rlw_count x uint64 BE -- buffer (RLWs + literal words) 54 + uint32 BE last_rlw_pos -- word index of last RLW in buffer 55 + 56 + An RLW (run-length word) encodes: 57 + bit 0 : running bit (0 or 1) 58 + bits 1..32 : running length (count of clean words, 32 bits) 59 + bits 33..63 : literal count (count of literal words, 31 bits) 60 + followed by [literal count] uncompressed 64-bit words. 61 + 62 + A "clean" word is all-zeros or all-ones; a "literal" word is 63 + anything else. *) 64 + 65 + let max_run = 0xFFFFFFFF (* 2^32 - 1 *) 66 + let max_lit = 0x7FFFFFFF (* 2^31 - 1 *) 67 + let all_ones = Int64.minus_one 68 + 69 + let encode_rlw ~running_bit ~run_len ~lit_count = 70 + let rb = if running_bit then 1 else 0 in 71 + Int64.logor 72 + (Int64.logor (Int64.of_int rb) (Int64.shift_left (Int64.of_int run_len) 1)) 73 + (Int64.shift_left (Int64.of_int lit_count) 33) 74 + 75 + let decode_rlw w = 76 + let running_bit = Int64.to_int (Int64.logand w 1L) = 1 in 77 + let run_len = 78 + Int64.to_int (Int64.logand (Int64.shift_right_logical w 1) 0xFFFFFFFFL) 79 + in 80 + let lit_count = 81 + Int64.to_int (Int64.logand (Int64.shift_right_logical w 33) 0x7FFFFFFFL) 82 + in 83 + (running_bit, run_len, lit_count) 84 + 85 + let words_of_bits length bits = 86 + let wc = (length + 63) / 64 in 87 + let words = Array.make wc 0L in 88 + I.iter 89 + (fun i -> 90 + let wi = i / 64 in 91 + let bi = i mod 64 in 92 + words.(wi) <- Int64.logor words.(wi) (Int64.shift_left 1L bi)) 93 + bits; 94 + words 95 + 96 + let is_zero w = Int64.equal w 0L 97 + let is_ones w = Int64.equal w all_ones 98 + let is_clean w = is_zero w || is_ones w 99 + 100 + let write_u64 buf v = 101 + let b = Bytes.create 8 in 102 + Bytes.set_int64_be b 0 v; 103 + Buffer.add_bytes buf b 104 + 105 + let scan_clean_run words n start = 106 + let w = words.(start) in 107 + if is_zero w then ( 108 + let j = ref start in 109 + while !j < n && is_zero words.(!j) && !j - start < max_run do 110 + incr j 111 + done; 112 + (false, !j - start)) 113 + else if is_ones w then ( 114 + let j = ref start in 115 + while !j < n && is_ones words.(!j) && !j - start < max_run do 116 + incr j 117 + done; 118 + (true, !j - start)) 119 + else (false, 0) 120 + 121 + let scan_literal_run words n start = 122 + let j = ref start in 123 + while !j < n && !j - start < max_lit && not (is_clean words.(!j)) do 124 + incr j 125 + done; 126 + !j - start 127 + 128 + let emit_block buf words ~rb ~run_len ~lit_start ~lit_count = 129 + write_u64 buf (encode_rlw ~running_bit:rb ~run_len ~lit_count); 130 + for j = 0 to lit_count - 1 do 131 + write_u64 buf words.(lit_start + j) 132 + done 133 + 134 + let encode_blocks words n = 135 + let buf = Buffer.create ((n * 8) + 16) in 136 + let last_rlw_pos = ref 0 in 137 + let i = ref 0 in 138 + let emitted = ref false in 139 + while !i < n || not !emitted do 140 + let rb, run_len = 141 + if !i >= n then (false, 0) else scan_clean_run words n !i 142 + in 143 + i := !i + run_len; 144 + let lit_start = !i in 145 + let lit_count = scan_literal_run words n !i in 146 + i := !i + lit_count; 147 + last_rlw_pos := Buffer.length buf / 8; 148 + emit_block buf words ~rb ~run_len ~lit_start ~lit_count; 149 + emitted := true 150 + done; 151 + (buf, !last_rlw_pos) 152 + 153 + let to_bytes t = 154 + let words = words_of_bits t.length t.bits in 155 + let n = Array.length words in 156 + let buf, last_rlw_pos = encode_blocks words n in 157 + let rlw_count = Buffer.length buf / 8 in 158 + let out = Bytes.create (4 + 4 + Buffer.length buf + 4) in 159 + Bytes.set_int32_be out 0 (Int32.of_int t.length); 160 + Bytes.set_int32_be out 4 (Int32.of_int rlw_count); 161 + Bytes.blit (Buffer.to_bytes buf) 0 out 8 (Buffer.length buf); 162 + Bytes.set_int32_be out (8 + Buffer.length buf) (Int32.of_int last_rlw_pos); 163 + out 164 + 165 + (* ── Parser ──────────────────────────────────────────────────────── *) 166 + 167 + exception Parse_error of string 168 + 169 + let fail fmt = Fmt.kstr (fun s -> raise (Parse_error s)) fmt 170 + 171 + let parse_header b = 172 + let total = Bytes.length b in 173 + if total < 12 then fail "too short: %d bytes (need >= 12)" total; 174 + let bit_size = Int32.to_int (Bytes.get_int32_be b 0) in 175 + let rlw_count = Int32.to_int (Bytes.get_int32_be b 4) in 176 + if bit_size < 0 then fail "negative bit_size %d" bit_size; 177 + if rlw_count < 0 then fail "negative rlw_count %d" rlw_count; 178 + let expected = 8 + (rlw_count * 8) + 4 in 179 + if total <> expected then 180 + fail "buffer size mismatch: total=%d rlw_count=%d expected=%d" total 181 + rlw_count expected; 182 + (bit_size, rlw_count) 183 + 184 + let apply_clean_run ~rb ~run_len ~word_count words out = 185 + for _ = 1 to run_len do 186 + if !out < word_count then begin 187 + words.(!out) <- (if rb then all_ones else 0L); 188 + incr out 189 + end 190 + else if rb then fail "clean-ones run past bit_size" 191 + done 192 + 193 + let apply_literals ~rlw_count ~lit_count ~word_count ~read_word words pos out = 194 + for _ = 1 to lit_count do 195 + if !pos >= rlw_count then fail "buffer truncated in literal run"; 196 + let w = read_word !pos in 197 + if !out < word_count then begin 198 + words.(!out) <- w; 199 + incr out 200 + end 201 + else if not (Int64.equal w 0L) then fail "non-zero literal past bit_size"; 202 + incr pos 203 + done 204 + 205 + let decode_buffer ~bit_size ~rlw_count b = 206 + let read_word off = Bytes.get_int64_be b (8 + (off * 8)) in 207 + let word_count = (bit_size + 63) / 64 in 208 + let words = Array.make (max word_count 0) 0L in 209 + let pos = ref 0 in 210 + let out = ref 0 in 211 + while !pos < rlw_count do 212 + let rlw = read_word !pos in 213 + incr pos; 214 + let rb, run_len, lit_count = decode_rlw rlw in 215 + apply_clean_run ~rb ~run_len ~word_count words out; 216 + apply_literals ~rlw_count ~lit_count ~word_count ~read_word words pos out 217 + done; 218 + (words, word_count) 219 + 220 + let indices_of_words ~bit_size ~word_count words = 221 + let indices = ref I.empty in 222 + for wi = 0 to word_count - 1 do 223 + let w = words.(wi) in 224 + if not (is_zero w) then 225 + for bi = 0 to 63 do 226 + let idx = (wi * 64) + bi in 227 + if 228 + idx < bit_size 229 + && not 230 + (Int64.equal 231 + (Int64.logand (Int64.shift_right_logical w bi) 1L) 232 + 0L) 233 + then indices := I.add idx !indices 234 + done 235 + done; 236 + !indices 237 + 238 + let of_bytes b = 239 + try 240 + let bit_size, rlw_count = parse_header b in 241 + let words, word_count = decode_buffer ~bit_size ~rlw_count b in 242 + let bits = indices_of_words ~bit_size ~word_count words in 243 + Ok { length = bit_size; bits } 244 + with 245 + | Parse_error m -> Error (`Msg m) 246 + | Invalid_argument m -> Error (`Msg m)
+95
lib/ewah.mli
··· 1 + (** Enhanced Word-Aligned Hybrid (EWAH) compressed bitmaps. 2 + 3 + EWAH is a run-length encoding for sparse bit sets. It represents a bit 4 + vector as an alternating sequence of {e clean} words (all-zeros or all-ones, 5 + with a run count) and {e dirty} words (arbitrary literal 64-bit payload). 6 + Set algebra runs directly on the compressed form in time proportional to the 7 + compressed size, not the bit-vector length. 8 + 9 + The on-disk format is the same as the one produced by git's [.bitmap] files 10 + (EWAH-64, big-endian), so bitmaps written by this library are readable by 11 + git and vice versa. 12 + 13 + Bit indices are zero-based. A bitmap has a logical length (number of tracked 14 + bits); bits past the length are considered unset. 15 + 16 + Reference: 17 + {{:https://arxiv.org/abs/0901.3751} Lemire, Kaser, Aouiche, "Sorting 18 + improves word-aligned bitmap indexes"} (2009). *) 19 + 20 + type t 21 + (** A compressed bitmap. *) 22 + 23 + val empty : t 24 + (** [empty] is the bitmap with no bits set and zero length. *) 25 + 26 + val length : t -> int 27 + (** [length t] is the logical number of bits tracked. Bits at indices in 28 + [\[0, length t)] are either set or unset; bits beyond [length t] are 29 + considered unset. Adding a bit at index [i >= length t] extends the length 30 + to [i + 1]. *) 31 + 32 + val cardinal : t -> int 33 + (** [cardinal t] is the number of set bits. *) 34 + 35 + val mem : t -> int -> bool 36 + (** [mem t i] is [true] when bit [i] is set. 37 + 38 + @raise Invalid_argument if [i < 0]. *) 39 + 40 + val add : t -> int -> t 41 + (** [add t i] is [t] with bit [i] set. Extends [length t] to [i + 1] if 42 + [i >= length t]. 43 + 44 + @raise Invalid_argument if [i < 0]. *) 45 + 46 + val of_indices : int list -> t 47 + (** [of_indices xs] is the bitmap with exactly the bits in [xs] set. Duplicates 48 + and order do not matter. 49 + 50 + @raise Invalid_argument if any element is negative. *) 51 + 52 + val to_indices : t -> int list 53 + (** [to_indices t] is the sorted, duplicate-free list of set bit indices. *) 54 + 55 + val iter : (int -> unit) -> t -> unit 56 + (** [iter f t] calls [f i] for each set bit [i] in ascending order. *) 57 + 58 + val fold : (int -> 'a -> 'a) -> t -> 'a -> 'a 59 + (** [fold f t init] folds [f] over set bit indices in ascending order. *) 60 + 61 + (** {1 Set algebra} *) 62 + 63 + val union : t -> t -> t 64 + (** [union a b] is the bitmap with every bit set in either [a] or [b]. Length is 65 + [max (length a) (length b)]. *) 66 + 67 + val inter : t -> t -> t 68 + (** [inter a b] is the bitmap with every bit set in both [a] and [b]. Length is 69 + [max (length a) (length b)]. *) 70 + 71 + val diff : t -> t -> t 72 + (** [diff a b] is the bitmap with every bit set in [a] but not in [b]. Length is 73 + [length a]. *) 74 + 75 + val equal : t -> t -> bool 76 + (** [equal a b] is [true] when [a] and [b] have the same length and the same set 77 + of set bits. *) 78 + 79 + (** {1 Serialization} 80 + 81 + The wire format is EWAH-64 big-endian, identical to git's [.bitmap] files. 82 + *) 83 + 84 + val to_bytes : t -> bytes 85 + (** [to_bytes t] serializes [t] into the EWAH-64 big-endian wire format. *) 86 + 87 + val of_bytes : bytes -> (t, [ `Msg of string ]) result 88 + (** [of_bytes b] parses an EWAH-64 big-endian serialization. Returns 89 + [Error (`Msg _)] on truncated or inconsistent input. Never raises. *) 90 + 91 + (** {1 Pretty-printing} *) 92 + 93 + val pp : t Fmt.t 94 + (** [pp] prints a compact representation of a bitmap: length, cardinality, and 95 + the sorted list of set indices. *)
+27
test/cram/cli.t/run.t
··· 1 + EWAH command-line demo 2 + ======================= 3 + 4 + Empty bitmap serializes to a minimal header + one empty RLW (20 bytes): 5 + 6 + $ demo.exe empty 7 + cardinal=0 length=0 serialized=20 8 + 9 + Sparse bitmap (100 bits spread across ~1M positions) compresses well. 10 + Uncompressed, the word array would need ~125KB; EWAH produces under 2KB 11 + and roundtrips exactly: 12 + 13 + $ demo.exe sparse 14 + cardinal=100 length=990001 serialized=1612 15 + roundtrip=ok 16 + 17 + Dense bitmap (1000 consecutive set bits) compresses the all-ones run 18 + into a single RLW: 19 + 20 + $ demo.exe dense 21 + cardinal=1000 length=1000 serialized=28 22 + 23 + Union merges two bitmaps and preserves all set bits: 24 + 25 + $ demo.exe union 26 + a=4 b=4 u=6 27 + 1 2 10 100 1000 10000
+4
test/cram/dune
··· 1 + (cram 2 + (applies_to :whole_subtree) 3 + (deps helpers/demo.exe) 4 + (setup_scripts helpers.sh))
+17
test/cram/helpers.sh
··· 1 + #!/bin/sh 2 + # Sourced before every cram test under test/cram/*.t/ via 3 + # dune's (setup_scripts helpers.sh). Exports the helper exe on PATH 4 + # and provides a scrub filter for non-deterministic output. 5 + 6 + # Put driver exe(s) on PATH; dune deposits (deps helpers/demo.exe) at 7 + # ../helpers/demo.exe relative to each cram test's cwd. 8 + export PATH="$PWD/../helpers:$PATH" 9 + 10 + # scrub: filter stdin by replacing volatile paths and times with 11 + # stable placeholders. Use as `./tool | scrub`. 12 + scrub() { 13 + sed \ 14 + -e "s|$PWD|\$PWD|g" \ 15 + -e "s|$HOME|\$HOME|g" \ 16 + -e "s|/tmp/[a-zA-Z0-9_./-]*|\$TMP|g" 17 + }
+36
test/cram/helpers/demo.ml
··· 1 + (** Small EWAH demo exercised by run.t. *) 2 + 3 + let () = 4 + match Sys.argv with 5 + | [| _; "sparse" |] -> ( 6 + let xs = List.init 100 (fun i -> i * 10_000) in 7 + let t = Ewah.of_indices xs in 8 + let bytes = Ewah.to_bytes t in 9 + Fmt.pr "cardinal=%d length=%d serialized=%d@." (Ewah.cardinal t) 10 + (Ewah.length t) (Bytes.length bytes); 11 + match Ewah.of_bytes bytes with 12 + | Ok t' when Ewah.equal t t' -> Fmt.pr "roundtrip=ok@." 13 + | Ok _ -> Fmt.pr "roundtrip=mismatch@." 14 + | Error (`Msg m) -> Fmt.pr "roundtrip=error: %s@." m) 15 + | [| _; "dense" |] -> 16 + let xs = List.init 1000 (fun i -> i) in 17 + let t = Ewah.of_indices xs in 18 + let bytes = Ewah.to_bytes t in 19 + Fmt.pr "cardinal=%d length=%d serialized=%d@." (Ewah.cardinal t) 20 + (Ewah.length t) (Bytes.length bytes) 21 + | [| _; "empty" |] -> 22 + let t = Ewah.empty in 23 + let bytes = Ewah.to_bytes t in 24 + Fmt.pr "cardinal=%d length=%d serialized=%d@." (Ewah.cardinal t) 25 + (Ewah.length t) (Bytes.length bytes) 26 + | [| _; "union" |] -> 27 + let a = Ewah.of_indices [ 1; 10; 100; 1000 ] in 28 + let b = Ewah.of_indices [ 2; 10; 1000; 10_000 ] in 29 + let u = Ewah.union a b in 30 + Fmt.pr "a=%d b=%d u=%d@." (Ewah.cardinal a) (Ewah.cardinal b) 31 + (Ewah.cardinal u); 32 + print_endline 33 + (String.concat " " (List.map string_of_int (Ewah.to_indices u))) 34 + | _ -> 35 + Fmt.pr "usage: demo {sparse|dense|empty|union}@."; 36 + exit 2
+3
test/cram/helpers/dune
··· 1 + (executable 2 + (name demo) 3 + (libraries ewah fmt))
+3
test/dune
··· 1 + (test 2 + (name test) 3 + (libraries ewah alcotest fmt))
+1
test/test.ml
··· 1 + let () = Alcotest.run "ewah" [ Test_ewah.suite ]
+291
test/test_ewah.ml
··· 1 + (** EWAH bitmap tests. *) 2 + 3 + let pp_indices = Fmt.(brackets (list ~sep:(any ";@ ") int)) 4 + 5 + let check_indices msg expected t = 6 + Alcotest.(check (list int)) msg expected (Ewah.to_indices t) 7 + 8 + let check_roundtrip t = 9 + let bytes = Ewah.to_bytes t in 10 + match Ewah.of_bytes bytes with 11 + | Error (`Msg m) -> Alcotest.failf "roundtrip %a failed: %s" Ewah.pp t m 12 + | Ok t' -> 13 + if not (Ewah.equal t t') then 14 + Alcotest.failf 15 + "roundtrip changed bitmap:@ before=%a@ after=%a@ bytes=%d" Ewah.pp t 16 + Ewah.pp t' (Bytes.length bytes) 17 + 18 + (* ── Basic construction ──────────────────────────────────────────── *) 19 + 20 + let test_empty () = 21 + let t = Ewah.empty in 22 + Alcotest.(check int) "length" 0 (Ewah.length t); 23 + Alcotest.(check int) "cardinal" 0 (Ewah.cardinal t); 24 + check_indices "indices" [] t 25 + 26 + let test_add_single () = 27 + let t = Ewah.add Ewah.empty 5 in 28 + Alcotest.(check int) "length" 6 (Ewah.length t); 29 + Alcotest.(check int) "cardinal" 1 (Ewah.cardinal t); 30 + Alcotest.(check bool) "mem 5" true (Ewah.mem t 5); 31 + Alcotest.(check bool) "mem 4" false (Ewah.mem t 4); 32 + Alcotest.(check bool) "mem 6" false (Ewah.mem t 6); 33 + check_indices "indices" [ 5 ] t 34 + 35 + let test_add_many () = 36 + let indices = [ 0; 3; 7; 63; 64; 127; 128; 1000 ] in 37 + let t = Ewah.of_indices indices in 38 + Alcotest.(check int) "length" 1001 (Ewah.length t); 39 + Alcotest.(check int) "cardinal" (List.length indices) (Ewah.cardinal t); 40 + check_indices "indices sorted" indices t; 41 + List.iter 42 + (fun i -> Alcotest.(check bool) (Fmt.str "mem %d" i) true (Ewah.mem t i)) 43 + indices; 44 + Alcotest.(check bool) "mem 1 (unset)" false (Ewah.mem t 1); 45 + Alcotest.(check bool) "mem 62 (unset)" false (Ewah.mem t 62); 46 + Alcotest.(check bool) "mem 999 (unset)" false (Ewah.mem t 999) 47 + 48 + let test_of_indices_dedup () = 49 + let t = Ewah.of_indices [ 3; 3; 1; 3; 1 ] in 50 + check_indices "dedup+sort" [ 1; 3 ] t 51 + 52 + let test_of_indices_negative () = 53 + match Ewah.of_indices [ 1; -1; 3 ] with 54 + | exception Invalid_argument _ -> () 55 + | _ -> Alcotest.fail "expected Invalid_argument for negative index" 56 + 57 + let test_add_negative () = 58 + match Ewah.add Ewah.empty (-5) with 59 + | exception Invalid_argument _ -> () 60 + | _ -> Alcotest.fail "expected Invalid_argument for negative index" 61 + 62 + let test_mem_negative () = 63 + match Ewah.mem Ewah.empty (-1) with 64 + | exception Invalid_argument _ -> () 65 + | _ -> Alcotest.fail "expected Invalid_argument for negative index" 66 + 67 + (* ── Set algebra ─────────────────────────────────────────────────── *) 68 + 69 + let test_union () = 70 + let a = Ewah.of_indices [ 1; 3; 5 ] in 71 + let b = Ewah.of_indices [ 2; 3; 100 ] in 72 + let u = Ewah.union a b in 73 + check_indices "union" [ 1; 2; 3; 5; 100 ] u; 74 + Alcotest.(check int) "length" 101 (Ewah.length u) 75 + 76 + let test_inter () = 77 + let a = Ewah.of_indices [ 1; 3; 5; 7 ] in 78 + let b = Ewah.of_indices [ 3; 7; 9 ] in 79 + let i = Ewah.inter a b in 80 + check_indices "inter" [ 3; 7 ] i 81 + 82 + let test_diff () = 83 + let a = Ewah.of_indices [ 1; 3; 5; 7 ] in 84 + let b = Ewah.of_indices [ 3; 7 ] in 85 + let d = Ewah.diff a b in 86 + check_indices "diff" [ 1; 5 ] d; 87 + Alcotest.(check int) "length unchanged" 8 (Ewah.length d) 88 + 89 + let test_union_with_empty () = 90 + let a = Ewah.of_indices [ 1; 2; 3 ] in 91 + let u = Ewah.union a Ewah.empty in 92 + Alcotest.(check bool) "union with empty = a" true (Ewah.equal u a) 93 + 94 + let test_inter_with_empty () = 95 + let a = Ewah.of_indices [ 1; 2; 3 ] in 96 + let i = Ewah.inter a Ewah.empty in 97 + Alcotest.(check int) "inter with empty is empty" 0 (Ewah.cardinal i) 98 + 99 + (* ── Iteration ───────────────────────────────────────────────────── *) 100 + 101 + let test_iter_ordered () = 102 + let t = Ewah.of_indices [ 5; 1; 7; 3 ] in 103 + let collected = ref [] in 104 + Ewah.iter (fun i -> collected := i :: !collected) t; 105 + Alcotest.(check (list int)) 106 + "iter ascending" [ 1; 3; 5; 7 ] (List.rev !collected) 107 + 108 + let test_fold_sum () = 109 + let t = Ewah.of_indices [ 1; 3; 5 ] in 110 + let sum = Ewah.fold ( + ) t 0 in 111 + Alcotest.(check int) "fold sum" 9 sum 112 + 113 + (* ── Serialization roundtrip ─────────────────────────────────────── *) 114 + 115 + let test_roundtrip_empty () = check_roundtrip Ewah.empty 116 + 117 + let test_roundtrip_single () = 118 + check_roundtrip (Ewah.of_indices [ 0 ]); 119 + check_roundtrip (Ewah.of_indices [ 63 ]); 120 + check_roundtrip (Ewah.of_indices [ 64 ]); 121 + check_roundtrip (Ewah.of_indices [ 128 ]) 122 + 123 + let test_roundtrip_sparse () = 124 + check_roundtrip (Ewah.of_indices [ 0; 1000; 10_000; 100_000 ]) 125 + 126 + let test_roundtrip_dense () = 127 + let dense = List.init 1000 (fun i -> i) in 128 + check_roundtrip (Ewah.of_indices dense) 129 + 130 + let test_roundtrip_allones_word () = 131 + let indices = List.init 64 (fun i -> i) in 132 + check_roundtrip (Ewah.of_indices indices) 133 + 134 + let test_roundtrip_allones_run () = 135 + (* Two consecutive all-ones words: should RLE-compress to one RLW *) 136 + let indices = List.init 128 (fun i -> i) in 137 + check_roundtrip (Ewah.of_indices indices) 138 + 139 + let test_roundtrip_run_straddle_boundary () = 140 + (* Bits 60..67 straddle the 64-bit word boundary *) 141 + let indices = List.init 8 (fun i -> 60 + i) in 142 + check_roundtrip (Ewah.of_indices indices) 143 + 144 + let test_of_bytes_too_short () = 145 + match Ewah.of_bytes (Bytes.create 4) with 146 + | Error (`Msg _) -> () 147 + | Ok _ -> Alcotest.fail "expected Error for 4-byte input" 148 + 149 + let test_of_bytes_garbage () = 150 + let b = Bytes.create 12 in 151 + (* bit_size = 1000 but rlw_count = 0 -> invalid buffer size *) 152 + Bytes.set_int32_be b 0 1000l; 153 + Bytes.set_int32_be b 4 0l; 154 + Bytes.set_int32_be b 8 0l; 155 + match Ewah.of_bytes b with 156 + | Ok t when Ewah.cardinal t = 0 -> 157 + (* A buffer with 0 RLWs and bit_size=1000 may be valid (no set 158 + bits; the decoder just has nothing to decode). Accept either. *) 159 + Alcotest.(check int) "no bits" 0 (Ewah.cardinal t) 160 + | Ok _ -> Alcotest.fail "unexpected bits in empty buffer" 161 + | Error (`Msg _) -> () 162 + 163 + let test_rlw_count_overflow () = 164 + let b = Bytes.create 12 in 165 + Bytes.set_int32_be b 0 64l; 166 + Bytes.set_int32_be b 4 100l; 167 + (* claims 100 RLWs but buffer is 0 bytes *) 168 + Bytes.set_int32_be b 8 0l; 169 + match Ewah.of_bytes b with 170 + | Error (`Msg _) -> () 171 + | Ok _ -> Alcotest.fail "expected Error for truncated buffer" 172 + 173 + (* ── Equality ────────────────────────────────────────────────────── *) 174 + 175 + let test_equal () = 176 + let a = Ewah.of_indices [ 1; 2; 3 ] in 177 + let b = Ewah.of_indices [ 3; 1; 2 ] in 178 + let c = Ewah.of_indices [ 1; 2; 4 ] in 179 + Alcotest.(check bool) "equal same" true (Ewah.equal a b); 180 + Alcotest.(check bool) "equal different" false (Ewah.equal a c); 181 + Alcotest.(check bool) "equal reflexive" true (Ewah.equal a a) 182 + 183 + (* ── Hostile cases (inlined, not a submodule) ────────────────────── *) 184 + 185 + (* Each entry is (name, test_fn). They run as normal alcotest cases, 186 + but we keep them in a single list so it's obvious what we stress. *) 187 + let hostile_cases = 188 + [ 189 + ( "empty serializes to 12 bytes minimum", 190 + fun () -> 191 + let b = Ewah.to_bytes Ewah.empty in 192 + Alcotest.(check bool) "at least 12 bytes" true (Bytes.length b >= 12); 193 + match Ewah.of_bytes b with 194 + | Error (`Msg m) -> Alcotest.failf "empty roundtrip: %s" m 195 + | Ok t -> 196 + Alcotest.(check int) "cardinal" 0 (Ewah.cardinal t); 197 + Alcotest.(check int) "length" 0 (Ewah.length t) ); 198 + ("bit 0 only", fun () -> check_roundtrip (Ewah.of_indices [ 0 ])); 199 + ( "bit 1_000_000 only", 200 + fun () -> check_roundtrip (Ewah.of_indices [ 1_000_000 ]) ); 201 + ( "alternating bits for 1024 positions", 202 + fun () -> 203 + let xs = List.init 512 (fun i -> 2 * i) in 204 + check_roundtrip (Ewah.of_indices xs) ); 205 + ( "1000 consecutive set bits", 206 + fun () -> 207 + let xs = List.init 1000 (fun i -> i) in 208 + check_roundtrip (Ewah.of_indices xs) ); 209 + ( "zero run followed by literal", 210 + fun () -> 211 + (* Bits at positions 128 and 129 only -> two zero words then 212 + one literal word. *) 213 + check_roundtrip (Ewah.of_indices [ 128; 129 ]) ); 214 + ( "ones run followed by literal", 215 + fun () -> 216 + (* Fill words 0 and 1 completely (128 ones), then set bit 130. *) 217 + let ones = List.init 128 (fun i -> i) in 218 + check_roundtrip (Ewah.of_indices (ones @ [ 130 ])) ); 219 + ( "literal followed by zero run", 220 + fun () -> 221 + (* Bit 10 then bit 10_000 -> literal word then zero run. *) 222 + check_roundtrip (Ewah.of_indices [ 10; 10_000 ]) ); 223 + ( "double serialization is idempotent", 224 + fun () -> 225 + let t = Ewah.of_indices [ 1; 3; 5; 63; 64; 127; 128; 256 ] in 226 + let b1 = Ewah.to_bytes t in 227 + match Ewah.of_bytes b1 with 228 + | Error (`Msg m) -> Alcotest.failf "reparse: %s" m 229 + | Ok t' -> 230 + let b2 = Ewah.to_bytes t' in 231 + Alcotest.(check int) 232 + "byte length equal" (Bytes.length b1) (Bytes.length b2); 233 + Alcotest.(check bool) "bytes equal" true (Bytes.equal b1 b2) ); 234 + ( "large sparse bitmap (1 bit every 1M positions, 100 times)", 235 + fun () -> 236 + let xs = List.init 100 (fun i -> i * 1_000_000) in 237 + check_roundtrip (Ewah.of_indices xs) ); 238 + ( "union preserves length of larger operand", 239 + fun () -> 240 + let a = Ewah.of_indices [ 1 ] in 241 + let b = Ewah.of_indices [ 1000 ] in 242 + let u = Ewah.union a b in 243 + Alcotest.(check int) "length = 1001" 1001 (Ewah.length u) ); 244 + ( "diff of identical sets is empty", 245 + fun () -> 246 + let a = Ewah.of_indices [ 1; 3; 5 ] in 247 + let d = Ewah.diff a a in 248 + Alcotest.(check int) "cardinal 0" 0 (Ewah.cardinal d) ); 249 + ] 250 + 251 + let wrap_hostile (name, f) : unit Alcotest.test_case = 252 + Alcotest.test_case ("hostile: " ^ name) `Quick f 253 + 254 + (* ── Suite ───────────────────────────────────────────────────────── *) 255 + 256 + let basic : unit Alcotest.test_case list = 257 + [ 258 + Alcotest.test_case "empty" `Quick test_empty; 259 + Alcotest.test_case "add single" `Quick test_add_single; 260 + Alcotest.test_case "add many" `Quick test_add_many; 261 + Alcotest.test_case "of_indices dedup" `Quick test_of_indices_dedup; 262 + Alcotest.test_case "of_indices negative" `Quick test_of_indices_negative; 263 + Alcotest.test_case "add negative" `Quick test_add_negative; 264 + Alcotest.test_case "mem negative" `Quick test_mem_negative; 265 + Alcotest.test_case "union" `Quick test_union; 266 + Alcotest.test_case "inter" `Quick test_inter; 267 + Alcotest.test_case "diff" `Quick test_diff; 268 + Alcotest.test_case "union with empty" `Quick test_union_with_empty; 269 + Alcotest.test_case "inter with empty" `Quick test_inter_with_empty; 270 + Alcotest.test_case "iter ordered" `Quick test_iter_ordered; 271 + Alcotest.test_case "fold sum" `Quick test_fold_sum; 272 + Alcotest.test_case "equal" `Quick test_equal; 273 + Alcotest.test_case "roundtrip empty" `Quick test_roundtrip_empty; 274 + Alcotest.test_case "roundtrip single" `Quick test_roundtrip_single; 275 + Alcotest.test_case "roundtrip sparse" `Quick test_roundtrip_sparse; 276 + Alcotest.test_case "roundtrip dense" `Quick test_roundtrip_dense; 277 + Alcotest.test_case "roundtrip all-ones word" `Quick 278 + test_roundtrip_allones_word; 279 + Alcotest.test_case "roundtrip all-ones run" `Quick 280 + test_roundtrip_allones_run; 281 + Alcotest.test_case "roundtrip straddle boundary" `Quick 282 + test_roundtrip_run_straddle_boundary; 283 + Alcotest.test_case "of_bytes too short" `Quick test_of_bytes_too_short; 284 + Alcotest.test_case "of_bytes garbage" `Quick test_of_bytes_garbage; 285 + Alcotest.test_case "of_bytes rlw_count too large" `Quick 286 + test_rlw_count_overflow; 287 + ] 288 + 289 + let suite = 290 + let _ = pp_indices in 291 + ("ewah", basic @ List.map wrap_hostile hostile_cases)
+2
test/test_ewah.mli
··· 1 + val suite : string * unit Alcotest.test_case list 2 + (** EWAH test suite. *)