commits
ocaml-protobuf/fuzz: drop bytesrw — flagged by Dead_lib, the fuzz
driver only uses alcobar/nox-protobuf.
ocaml-websocket/lib: extend the (mdx ...) stanza to also exercise
handshake.mli alongside websocket.mli.
Remove libraries declared in '(libraries ...)' clauses but unreferenced
by any module in the same source tree, as flagged by 'monopam lint'
after the new Dead_lib detection landed. Touches 131 dune files across
~80 packages.
A few stanzas needed a positive correction instead of a pure removal:
- ocaml-git/bin/diag: depended on eio_main + bytesrw-eio for an
Eio_posix.run call site; the umbrella was overkill, switch to the
precise eio_posix package.
- ocaml-scaleway/lib, ocaml-s3/lib: scaleway.mli / s3.mli reference
Eio_unix.Stdenv.base; eio.unix is required and was missing.
- merlint/lib: pulled bytesrw + nox-opam.bytesrw to surface
Opam_bytesrw, used by rule e915 and lint helpers.
Stanzas where Dead_lib was a false positive (transitive dep needed
for module visibility, virtual-library impls) are left untouched —
e.g. helix.jx.jsoo for ocaml-globe/demo retains its (libraries ...)
entry because it provides the impl of the helix.jx virtual lib.
Run mdx on lib/protobuf.mli so the {[ ... ]} odoc block now type-checks
and the encode-decode round-trip is verified.
Renamed the local `person` value to `person_codec` so it doesn't
shadow the `person` type, bound `ada` once and reused it for both
the encode and the round-trip assertion. Asserted that the wire
output is non-empty and that decoding it via `Protobuf.of_string`
returns the original record; the error path reports via
`Fmt.failwith "%a" Protobuf.Error.pp`.
The READMEs all share the standard install/overlay snippet, but the
sh blocks lacked the "<!-- $MDX skip -->" directive. `dune test`
would shell out to `opam install` against the live switch, which
either prompts interactively or fails with a package conflict —
either way diffing as a test failure.
Bulk-add skip directives in front of every install/overlay block.
Also collapse the doubled "non-deterministic + skip" stack on three
READMEs (memtrace, ocaml-dpop, ocaml-pid1, ocaml-yaml, merlint) where
`skip` already implies the runtime is bypassed.
Each README's 'opam install <pkg>' instructions now match the post-rename
opam package names. Auto-generated by 'monopam lint --fix' after the
nox-* prefix landed on the underlying packages.
Extends the nox- prefix to the remaining encoding/codec packages —
none clash with opam-repository today, but the rule "blacksun forks
get nox-" applies the same way regardless of conflict status.
Renamed: json, xml, meta, opam, protobuf -> nox-*
Renames 35 packages to make blacksun forks distinguishable from their
opam-repository upstreams. Module names (Git.x, Tls.x, ...) stay bare;
opam package names and dune (public_name) findlib references move to
nox-X. After this commit, zero local package names overlap with
opam-repository.
Renamed:
- nox-git, nox-irmin
- nox-crypto, nox-crypto-pk, nox-crypto-rng, nox-crypto-ec
- nox-tls, nox-tls-eio, nox-tar, nox-tar-eio, nox-tty, nox-tty-eio
- nox-arp, nox-ca-certs, nox-cbor, nox-cookie, nox-crc, nox-csv
- nox-gpt, nox-hkdf, nox-http, nox-jwt, nox-kdf, nox-loc
- nox-memtrace, nox-pds, nox-sexp, nox-slack, nox-toml
- nox-websocket, nox-x509, nox-xdge, nox-yaml
Also drops orphan tar-mirage and tar-unix opam templates that had no
matching package stanza.
Pure formatting changes from `dune fmt`: doc comment placement moves
from above the binding to below it for `type`s, multi-line `match`
expressions collapse onto one line where they fit, and infix operator
applications pick up spaces (`Soup.($?)` -> `Soup.( $? )`). No
semantic changes.
Object combinators: [Object.mem] -> [Object.member], [Object.opt_mem]
-> [Object.opt_member], [Object.case_mem] -> [Object.case_member]. The
sibling submodules [Object.Mem] / [Object.Mems] become
[Object.Member] / [Object.Members]. RFC 8259 §4 calls these
"name/value pairs, referred to as the members", so mirror the spec
name rather than the shortened [mem].
[Object.finish] -> [Object.seal]. "Seal" reads as "close the map, no
more members added", which is what the operation does.
Value constructors/queries: [Value.mem] (function) -> [Value.member];
[Value.mem_find] -> [Value.member_key]; [Value.mem_names] ->
[Value.member_names]; [Value.mem_keys] -> [Value.member_keys].
[type mem = ...] -> [type member = ...]; [type object'] still points
at [member list].
Downstream (~80 files across slack, sbom, stripe, sigstore, requests,
claude, irmin, freebox) updated via perl-pie. dune build clean,
dune test ocaml-json clean.
This reverts commit 58d9592128efa768a384d69bcb6a544fa3b753ca.
Expose only what's actually called. kinded' / or_kind were added
speculatively across libraries that don't use them; revert.
- toml, xml, csv: keep to_string, pp, kinded (kinded has callers —
csv 3, toml 2, xml 1). Drop kinded' and or_kind.
- cbor, protobuf: keep to_string, pp only. Drop kinded, kinded',
or_kind (no callers).
- json keeps the full API — 38 callers including kinded' (16) and
or_kind (11).
Factor [kinded] through [kinded' ~kind s = if kind = "" then s else
kind ^ " " ^ s], matching the shape of json's and toml's Sort
modules. Consistent API across every encoding library's Sort: every
module exposes [to_string], [pp], [or_kind], [kinded'], [kinded].
protobuf previously lacked [or_kind] and [kinded] entirely; add them
alongside [kinded'].
Add the missing pieces to Protobuf.Error so callers don't need to reach
into Loc.Error directly:
- Verb re-exports: v, msg, raise, fail, failf, push_array, push_object,
kind_to_string (were previously missing; users had to import Loc.Error
separately).
- Typed kinds: Sort_mismatch, Kinded_sort_mismatch (for schema-vs-payload
sort errors at the codec layer, complementing the existing wire-layer
kinds).
- Shape helpers: expected, sort, kinded_sort, missing_mems, unexpected_mems,
index_out_of_range, number_range, integer_range, no_decoder, no_encoder,
decode_todo, encode_todo.
Brings the protobuf facade in line with the json exemplar and toml.
Move Sort out of codec.ml into its own sort.ml at the top level. Sort
was previously private to the codec interpreter, even though it labels
error contexts exposed to users. Surface it at Protobuf.Sort so the
module structure matches the other encoding libraries (Json.Sort,
Xml.Sort, Toml.Sort, Cbor.Sort, Csv.Sort).
Schema-free layer on top of [Codec]: two new modules that together
let callers inspect, query, and rewrite arbitrary protobuf messages
without a typed schema.
- [Protobuf.Value] — AST over the four wire-type leaves plus a
[Message] variant ([(int * t) list] with wire-order-preserving
repeated tags). Per-node [Loc.Meta.t] sets up future byte-offset
tracking; for now everything is [Meta.none].
- constructors: [varint] / [fixed32] / [fixed64] / [length_delim]
/ [message]
- queries: [find : int -> t -> t option],
[find_all : int -> t -> t list],
[message_of : t -> t option] (re-parse a length-delim blob as
a nested message)
- IO: [of_string] / [of_string_exn] / [to_string]
- [pp] / [equal]
Length-delim blobs stay raw in the AST because schema-free parsing
cannot distinguish a string from a bytes from a nested message.
- [Protobuf.Cursor] — zipper over [Value.t] with ancestor stack:
[root] / [focus] / [up] / [top] / [set] / [down_field : int] /
[down_length_delim]. The last re-parses a length-delim leaf as a
message and descends, making multi-level traversals straightforward
once the caller knows a blob is a sub-message.
FieldMask paths: [of_field_mask] / [to_field_mask] parse and
serialise dotted-integer paths (e.g. ["1.3.2"]). The protobuf spec's
[google.protobuf.FieldMask] uses field *names*; our schema-free
cursor only sees tags, so integers replace names.
Top-level [Protobuf] re-exports [module Value = Value] and
[module Cursor = Cursor].
Tests: three Value round-trips (encode via a schema codec, decode as
Value, re-encode, compare bytes) and four Cursor operations (root
focus, down+up, set+rebuild, FieldMask parse + nested length-delim
descent). All 60 unit + 17 fuzz + 2 protoc interop tests pass.
Move the codec type, the 15 scalar codecs, the [Message] combinators,
and [fix] into [lib/codec.ml] / [lib/codec.mli]. The top-level
[Protobuf] module now aliases [type 'a t = 'a Codec.t], re-exports
the scalars and [Message] for ergonomics, and keeps only the
reading/writing entry points ([of_string] / [to_string] /
[of_reader] / [to_writer] plus [_exn] twins).
Motivation: match the one-file-per-concern layout already used in
[ocaml-json], [ocaml-cbor], and [ocaml-xml]. A later split will
extract a [Value.t] AST and a [Cursor] zipper.
[codec.mli] exposes:
- [type 'a t] (abstract)
- the 15 scalar codecs
- [module Message] with required/optional/repeated/packed/map/oneof
- [fix]
The codec's four IO walkers ([encode_string] / [decode_string] /
[encode] / [decode] and the unknown-fields pair) are under a
[(**/**)] internal section because the top-level [Protobuf] module's
[of_*] / [to_*] functions are the stable public surface.
Files:
lib/codec.ml [new, 886 lines]
lib/codec.mli [new]
lib/protobuf.ml [rewritten: 52-line re-export + IO shim]
lib/protobuf.mli [rewritten to match]
All 53 unit + 17 fuzz + 2 protoc interop tests pass.
Match the naming pattern already used elsewhere in the monorepo
(ocaml-json, ocaml-cbor, ocaml-xml): [of_*] decodes, [to_*] encodes,
and each [of_*] has an [_exn] twin that raises {!Loc.Error} on the
error path.
encode_string -> to_string
decode_string -> of_string
encode -> to_writer
decode -> of_reader
New entries:
val of_string_exn : 'a t -> string -> 'a
val of_reader_exn : 'a t -> Bytesrw.Bytes.Reader.t -> 'a
Call sites in [test/test_protobuf.ml], [test/test_wire.ml],
[test/interop/protoc/test.ml] and [fuzz/fuzz_protobuf.ml] are
mechanically updated. The old names are gone; no backwards-compat
aliases.
All 53 unit + 17 fuzz + 2 protoc interop tests pass.
Merlint caps identifiers at 4 underscores; CVE-numbered test names
with year + CVE-number + description crossed it. Drop the year
component, keep just the CVE number prefix:
test_cve_2015_5237_huge_length -> test_cve5237_huge_length
test_cve_2021_22569_many_small_groups -> test_cve22569_many_small
test_cve_2022_3171_group_wire_type_4 -> test_cve3171_group_wt4
...
Also trim:
test_unknowns_empty_when_schema_matches -> test_unknowns_empty_on_match
test_int32_negative_is_10_bytes -> test_int32_neg_10_bytes
Bodies and comments unchanged; CVE references stay in the docstring
comments so the provenance is still visible.
Merlint issue count drops from 57 to 25 (documentation nits remain).
Follow up to the module rename: update the remaining callers that
still referenced [Err] (library [claude.ml{,i}], [client.ml], the test
driver [test.ml]), and fix one stray [^ e] string concatenation in
hermest's CLI that needed [Json.Error.to_string e] now that
[Json.of_string] yields a structured error.
Protobuf [oneof] groups a set of mutually exclusive optional fields at
distinct tags. Encoding emits whichever case matches; decoding picks
the case with the highest wire-order sequence (protobuf "last wins").
API:
val case : int -> 'a t -> inject:('a -> 'b) -> extract:('b -> 'a option)
-> 'b case
val oneof : default:'a -> ('o -> 'a) -> 'a case list -> ('o, 'a) field
Typical usage lifts the oneof alternatives into an OCaml polymorphic
variant:
type payload = [ `None | `Text of string | `Num of int32 ]
let msg_codec =
finish
(let* payload =
oneof ~default:`None (fun r -> r.payload)
[ case 1 string ~inject:(fun s -> `Text s)
~extract:(function `Text s -> Some s | _ -> None);
case 2 int32 ~inject:(fun n -> `Num n)
~extract:(function `Num n -> Some n | _ -> None) ] in
return { payload })
Internals:
- [parse_wire] now stamps each wire entry with a sequence counter so
[take_oneof_last] can find the case whose tag came last in wire
order. Hashtbl buckets still record per-tag wire order (reversed,
prepend-on-insert); the counter adds cross-tag ordering.
- [decode_fields] handles the new [Oneof] GADT constructor.
- [encode_fields] iterates the case list, picks the first whose
[extract] returns [Some], and emits that tag. If every extractor
returns [None] (e.g. value is the default variant), no wire bytes
are written -- matching protoc's behaviour for unset oneofs.
- [take_oneof_last] consumes every case tag from the table on exit
so oneof fields don't leak into the unknown-fields bag.
Tests: roundtrip through each case variant, empty-wire for the
default variant, and a "last wins" test where three consecutive
oneof tags appear on the wire and the decoder picks the third.
All 53 unit + 17 fuzz + 2 protoc interop tests pass.
Two merlint-driven structural cleanups:
- [Wire.wire_type] -> [Wire.t], [Wire.wire_type_to_int] -> [Wire.to_int],
[Wire.wire_type_of_int] -> [Wire.of_int], [Wire.pp_wire_type] ->
[Wire.pp]. Merlint's E330 rule flags the redundant module prefix;
callers using [Wire.Fixed32]/etc. already disambiguate the sort by
module path, so the type can be the idiomatic [Wire.t]. Labeled
argument [~wire_type:] on [write_tag] / [read_tag] stays.
- Merge [test_hostile.ml] into [test_protobuf.ml] as a
[hostile_cases : unit Alcotest.test_case list] appended onto the
main suite. Matches the user's established convention -- hostile
inputs are tested alongside the happy-path cases, not in a separate
test_<nonexistent-library>.ml that merlint E610 objects to.
All 49 unit + 17 fuzz + 2 protoc interop tests pass.
Remaining merlint items: long identifier names (CVE-numbered tests
with > 4 underscores), a [Fmt] usage hint, and a small set of doc
tags. Low-signal nits; deferred.
- [Message.map tag get key_codec value_codec] declares a [map<K,V>]
field. On the wire this is sugar for a repeated nested message with
[key = 1] and [value = 2], and the decoder handles both forms.
Internal [map_entry_codec] builds the entry submessage inline
without routing through [let* / finish] -- the entry is an ephemeral
tuple rather than a named record.
- [decode_with_unknowns_string] / [encode_with_unknowns_string] let
forward-compatible pipelines preserve fields whose tag was not in
the schema. Decode returns [Ok (value, unknown_wire)] where the
byte string can be tacked onto a later encode via the matching
[~unknowns] argument. Unknowns are re-serialized in canonical form
and sorted by tag, so round-trip preserves semantics but not
byte-identity. Standard [decode_string] / [encode_string] still
silently drop unknowns.
Implementation: [Message.take_last] and [take_all] now consume the
matched entries from the parse_wire hashtable; what remains after
[decode_fields] returns is exactly the unknown-field set.
- Hostile-input suite is rewritten around CVE identifiers. Each test
cites the upstream vulnerability:
CVE-2015-5237 (C++ 2015): huge length prefix, over-long varint,
truncated tag -- integer overflow / DoS
CVE-2021-22569 (Java 2021): many small groups -- memory
amplification
CVE-2022-1941 (C++ 2022): all-unknown-fields schema -- null deref
CVE-2022-3171 (Java 2022): deprecated group wire types 3 & 4
CVE-2024-7254 (Go 2024): deep nesting in known and unknown
message fields
CVE-2024-47554 (Rust prost 2024): length past end, packed
corrupt body
Plus spec-conformance tests for reserved tag 0, wire-type mismatch,
non-UTF-8 string content (must accept), empty input (proto3
defaults), overrun rejection, and map duplicate keys (last-wins
but decoder preserves wire order).
- GADT tweak: drop the [_t] suffix from [Fixed32_t] / [Fixed64_t]
codec constructors. OCaml's type-directed constructor
disambiguation resolves the name collision with [Wire.Fixed32] /
[Wire.Fixed64] by context.
- Add [Protobuf.pp : 'a t -> _] printing the codec's sort (for
debugging / merlint E415).
- Add a top-level [.ocamlformat] (version 0.29.0) to match the
monorepo convention.
- Add one-line docstrings to every [Wire.read_*] entry in [wire.mli].
All 49 unit + 17 fuzz + 2 protoc interop tests pass.
Remaining merlint items (queued for next session): inline
test_hostile.ml into test_protobuf.ml as a [hostile_cases] list per
the user's established pattern; shorten test identifiers to
<= 4 underscores; rename [Wire.wire_type] to [Wire.t].
The previous [type 'a t = { wire_type; write_value; read_wire; ... }]
was a record of closures. Interpreters couldn't be added without
editing every combinator, the structure was opaque to tooling, and
the shape didn't match the ocaml-json / encodings-skill design.
Rewrite as a finally-tagged GADT whose constructors name protobuf's
wire-level alphabet:
type _ t =
| Varint : (int64, 'a) base -> 'a t
| Fixed32_t : (int32, 'a) base -> 'a t
| Fixed64_t : (int64, 'a) base -> 'a t
| Length_delim : (string, 'a) base -> 'a t
| Message : 'a message_spec -> 'a t
| Rec : 'a t Lazy.t -> 'a t
Each scalar codec now produces a typed GADT node carrying its
[Sort.t] (one of the 15 protobuf scalar types — int32, uint32,
sint32, fixed32, sfixed32, float, ..., bytes, message). Sort feeds
into error messages: instead of "expected varint, got
length-delimited" the decoder now says "int32: expected wire type
varint, got length-delimited", which is what users want when a schema
says [int32 a = 1] but the wire carries a length-delim.
[fix] switches from a mutable forwarding placeholder to a [Lazy]-
wrapped [Rec] node. Cleaner: the recursive-codec forcing is explicit
in the GADT shape.
encode_value / decode_value are now [type a. a t -> ...] walkers that
pattern-match on the wire sort. Adding a new interpreter (schema
printer, pp, diff) is adding a new walker alongside these, no change
to the combinator call sites.
Message combinators (Message.required / optional / repeated / packed
and the [let*] chain) retain their shape at the user-facing level;
internally [Message.finish] now produces a [Message { encode_body;
decode_body; msg_default }] GADT node.
Still to do from the encoding skill:
- Split into separate [value.ml] / [codec.ml] / [error.ml] / [foo.ml]
layer files (this commit keeps everything in protobuf.ml for
minimal diff).
- Expose [Protobuf.Value.t] AST and [Cursor].
- Migrate errors to [Loc.Error.kind].
- Add six-verb API (of_string / to_string / of_reader / to_writer /
decode / encode) with _exn twins.
All 40 unit + 17 fuzz + 2 protoc interop tests pass.
Fixes and extensions building on the scaffolding landed in fd396b81e.
- Encoder: proto3 scalar fields equal to their codec default are now
omitted from the wire. This is the first real interop bug — protoc
output for [Test1 {a = 0}] is empty, but we were emitting "0800".
[Message.encode_fields] checks [v <> codec.default] before writing
required fields.
- Fuzz suite (fuzz/): 17 alcobar invariants covering scalar round-trip,
the kitchen-sink message (every scalar plus optional/repeated/packed/
nested), and decoder robustness against arbitrary bytes (must return
Ok or Error; never raise, loop, or allocate unboundedly).
- Hostile-input tests (test_hostile): eleven regressions for known
protobuf decoder CVE classes — huge length prefix DoS, over-long
varint, truncated tag, reserved tag 0, unsupported wire type, wire
type mismatch, empty input -> defaults, overrun rejected, length past
end, packed corrupt body, and many-repeated scaling. Depth-limited
recursion noted as a TODO (needs a Lazy-wrapped recursive codec and
an explicit depth bound in the decoder).
- Interop test against protoc (test/interop/protoc/): Python oracle
using grpcio-tools 1.73.0 + protobuf 6.31.0, generating two trace
CSVs for a Test1 message and an Everything message covering all 15
scalar types plus optional/repeated/packed/nested. The OCaml test
asserts byte-for-byte equality in both directions (encode matches
protoc, decode reproduces protoc's values). [dune build @regen-traces]
from the package root refreshes traces.
Total test count: 38 unit + 17 fuzz + 2 interop (all passing). The
interop layer is the one that actually proves this speaks protobuf —
the earlier tests just verified self-consistency.
Warning 69 (unused-field, mutable-never-assigned). Four independent
record fields were flagged as mutable but the code only mutates their
referents in place, never rebinds the record slot itself:
- ocaml-wal/lib/wal.ml: [t.file] (the Eio file resource; methods call
Eio.File.pwrite_all etc., the slot is set once at open time).
- ocaml-block/lib/block.ml: [Memory.state.data] (the backing bytes,
written via Bytes.blit_string; [Bytes.t] is already mutable).
- ocaml-sse/lib/sse.ml: [Parser.t.data_buf] (a Buffer.t, written via
Buffer.add_*; the slot never changes).
- ocaml-zephyr/lib/zephyr.ml: drop [mode : Read | Write] entirely —
set at open-time, read nowhere. The open_read / open_write
constructors already distinguish the two call shapes, so mode
tracking was redundant.
Remove libraries declared in '(libraries ...)' clauses but unreferenced
by any module in the same source tree, as flagged by 'monopam lint'
after the new Dead_lib detection landed. Touches 131 dune files across
~80 packages.
A few stanzas needed a positive correction instead of a pure removal:
- ocaml-git/bin/diag: depended on eio_main + bytesrw-eio for an
Eio_posix.run call site; the umbrella was overkill, switch to the
precise eio_posix package.
- ocaml-scaleway/lib, ocaml-s3/lib: scaleway.mli / s3.mli reference
Eio_unix.Stdenv.base; eio.unix is required and was missing.
- merlint/lib: pulled bytesrw + nox-opam.bytesrw to surface
Opam_bytesrw, used by rule e915 and lint helpers.
Stanzas where Dead_lib was a false positive (transitive dep needed
for module visibility, virtual-library impls) are left untouched —
e.g. helix.jx.jsoo for ocaml-globe/demo retains its (libraries ...)
entry because it provides the impl of the helix.jx virtual lib.
Run mdx on lib/protobuf.mli so the {[ ... ]} odoc block now type-checks
and the encode-decode round-trip is verified.
Renamed the local `person` value to `person_codec` so it doesn't
shadow the `person` type, bound `ada` once and reused it for both
the encode and the round-trip assertion. Asserted that the wire
output is non-empty and that decoding it via `Protobuf.of_string`
returns the original record; the error path reports via
`Fmt.failwith "%a" Protobuf.Error.pp`.
The READMEs all share the standard install/overlay snippet, but the
sh blocks lacked the "<!-- $MDX skip -->" directive. `dune test`
would shell out to `opam install` against the live switch, which
either prompts interactively or fails with a package conflict —
either way diffing as a test failure.
Bulk-add skip directives in front of every install/overlay block.
Also collapse the doubled "non-deterministic + skip" stack on three
READMEs (memtrace, ocaml-dpop, ocaml-pid1, ocaml-yaml, merlint) where
`skip` already implies the runtime is bypassed.
Renames 35 packages to make blacksun forks distinguishable from their
opam-repository upstreams. Module names (Git.x, Tls.x, ...) stay bare;
opam package names and dune (public_name) findlib references move to
nox-X. After this commit, zero local package names overlap with
opam-repository.
Renamed:
- nox-git, nox-irmin
- nox-crypto, nox-crypto-pk, nox-crypto-rng, nox-crypto-ec
- nox-tls, nox-tls-eio, nox-tar, nox-tar-eio, nox-tty, nox-tty-eio
- nox-arp, nox-ca-certs, nox-cbor, nox-cookie, nox-crc, nox-csv
- nox-gpt, nox-hkdf, nox-http, nox-jwt, nox-kdf, nox-loc
- nox-memtrace, nox-pds, nox-sexp, nox-slack, nox-toml
- nox-websocket, nox-x509, nox-xdge, nox-yaml
Also drops orphan tar-mirage and tar-unix opam templates that had no
matching package stanza.
Object combinators: [Object.mem] -> [Object.member], [Object.opt_mem]
-> [Object.opt_member], [Object.case_mem] -> [Object.case_member]. The
sibling submodules [Object.Mem] / [Object.Mems] become
[Object.Member] / [Object.Members]. RFC 8259 §4 calls these
"name/value pairs, referred to as the members", so mirror the spec
name rather than the shortened [mem].
[Object.finish] -> [Object.seal]. "Seal" reads as "close the map, no
more members added", which is what the operation does.
Value constructors/queries: [Value.mem] (function) -> [Value.member];
[Value.mem_find] -> [Value.member_key]; [Value.mem_names] ->
[Value.member_names]; [Value.mem_keys] -> [Value.member_keys].
[type mem = ...] -> [type member = ...]; [type object'] still points
at [member list].
Downstream (~80 files across slack, sbom, stripe, sigstore, requests,
claude, irmin, freebox) updated via perl-pie. dune build clean,
dune test ocaml-json clean.
Expose only what's actually called. kinded' / or_kind were added
speculatively across libraries that don't use them; revert.
- toml, xml, csv: keep to_string, pp, kinded (kinded has callers —
csv 3, toml 2, xml 1). Drop kinded' and or_kind.
- cbor, protobuf: keep to_string, pp only. Drop kinded, kinded',
or_kind (no callers).
- json keeps the full API — 38 callers including kinded' (16) and
or_kind (11).
Factor [kinded] through [kinded' ~kind s = if kind = "" then s else
kind ^ " " ^ s], matching the shape of json's and toml's Sort
modules. Consistent API across every encoding library's Sort: every
module exposes [to_string], [pp], [or_kind], [kinded'], [kinded].
protobuf previously lacked [or_kind] and [kinded] entirely; add them
alongside [kinded'].
Add the missing pieces to Protobuf.Error so callers don't need to reach
into Loc.Error directly:
- Verb re-exports: v, msg, raise, fail, failf, push_array, push_object,
kind_to_string (were previously missing; users had to import Loc.Error
separately).
- Typed kinds: Sort_mismatch, Kinded_sort_mismatch (for schema-vs-payload
sort errors at the codec layer, complementing the existing wire-layer
kinds).
- Shape helpers: expected, sort, kinded_sort, missing_mems, unexpected_mems,
index_out_of_range, number_range, integer_range, no_decoder, no_encoder,
decode_todo, encode_todo.
Brings the protobuf facade in line with the json exemplar and toml.
Move Sort out of codec.ml into its own sort.ml at the top level. Sort
was previously private to the codec interpreter, even though it labels
error contexts exposed to users. Surface it at Protobuf.Sort so the
module structure matches the other encoding libraries (Json.Sort,
Xml.Sort, Toml.Sort, Cbor.Sort, Csv.Sort).
Schema-free layer on top of [Codec]: two new modules that together
let callers inspect, query, and rewrite arbitrary protobuf messages
without a typed schema.
- [Protobuf.Value] — AST over the four wire-type leaves plus a
[Message] variant ([(int * t) list] with wire-order-preserving
repeated tags). Per-node [Loc.Meta.t] sets up future byte-offset
tracking; for now everything is [Meta.none].
- constructors: [varint] / [fixed32] / [fixed64] / [length_delim]
/ [message]
- queries: [find : int -> t -> t option],
[find_all : int -> t -> t list],
[message_of : t -> t option] (re-parse a length-delim blob as
a nested message)
- IO: [of_string] / [of_string_exn] / [to_string]
- [pp] / [equal]
Length-delim blobs stay raw in the AST because schema-free parsing
cannot distinguish a string from a bytes from a nested message.
- [Protobuf.Cursor] — zipper over [Value.t] with ancestor stack:
[root] / [focus] / [up] / [top] / [set] / [down_field : int] /
[down_length_delim]. The last re-parses a length-delim leaf as a
message and descends, making multi-level traversals straightforward
once the caller knows a blob is a sub-message.
FieldMask paths: [of_field_mask] / [to_field_mask] parse and
serialise dotted-integer paths (e.g. ["1.3.2"]). The protobuf spec's
[google.protobuf.FieldMask] uses field *names*; our schema-free
cursor only sees tags, so integers replace names.
Top-level [Protobuf] re-exports [module Value = Value] and
[module Cursor = Cursor].
Tests: three Value round-trips (encode via a schema codec, decode as
Value, re-encode, compare bytes) and four Cursor operations (root
focus, down+up, set+rebuild, FieldMask parse + nested length-delim
descent). All 60 unit + 17 fuzz + 2 protoc interop tests pass.
Move the codec type, the 15 scalar codecs, the [Message] combinators,
and [fix] into [lib/codec.ml] / [lib/codec.mli]. The top-level
[Protobuf] module now aliases [type 'a t = 'a Codec.t], re-exports
the scalars and [Message] for ergonomics, and keeps only the
reading/writing entry points ([of_string] / [to_string] /
[of_reader] / [to_writer] plus [_exn] twins).
Motivation: match the one-file-per-concern layout already used in
[ocaml-json], [ocaml-cbor], and [ocaml-xml]. A later split will
extract a [Value.t] AST and a [Cursor] zipper.
[codec.mli] exposes:
- [type 'a t] (abstract)
- the 15 scalar codecs
- [module Message] with required/optional/repeated/packed/map/oneof
- [fix]
The codec's four IO walkers ([encode_string] / [decode_string] /
[encode] / [decode] and the unknown-fields pair) are under a
[(**/**)] internal section because the top-level [Protobuf] module's
[of_*] / [to_*] functions are the stable public surface.
Files:
lib/codec.ml [new, 886 lines]
lib/codec.mli [new]
lib/protobuf.ml [rewritten: 52-line re-export + IO shim]
lib/protobuf.mli [rewritten to match]
All 53 unit + 17 fuzz + 2 protoc interop tests pass.
Match the naming pattern already used elsewhere in the monorepo
(ocaml-json, ocaml-cbor, ocaml-xml): [of_*] decodes, [to_*] encodes,
and each [of_*] has an [_exn] twin that raises {!Loc.Error} on the
error path.
encode_string -> to_string
decode_string -> of_string
encode -> to_writer
decode -> of_reader
New entries:
val of_string_exn : 'a t -> string -> 'a
val of_reader_exn : 'a t -> Bytesrw.Bytes.Reader.t -> 'a
Call sites in [test/test_protobuf.ml], [test/test_wire.ml],
[test/interop/protoc/test.ml] and [fuzz/fuzz_protobuf.ml] are
mechanically updated. The old names are gone; no backwards-compat
aliases.
All 53 unit + 17 fuzz + 2 protoc interop tests pass.
Merlint caps identifiers at 4 underscores; CVE-numbered test names
with year + CVE-number + description crossed it. Drop the year
component, keep just the CVE number prefix:
test_cve_2015_5237_huge_length -> test_cve5237_huge_length
test_cve_2021_22569_many_small_groups -> test_cve22569_many_small
test_cve_2022_3171_group_wire_type_4 -> test_cve3171_group_wt4
...
Also trim:
test_unknowns_empty_when_schema_matches -> test_unknowns_empty_on_match
test_int32_negative_is_10_bytes -> test_int32_neg_10_bytes
Bodies and comments unchanged; CVE references stay in the docstring
comments so the provenance is still visible.
Merlint issue count drops from 57 to 25 (documentation nits remain).
Follow up to the module rename: update the remaining callers that
still referenced [Err] (library [claude.ml{,i}], [client.ml], the test
driver [test.ml]), and fix one stray [^ e] string concatenation in
hermest's CLI that needed [Json.Error.to_string e] now that
[Json.of_string] yields a structured error.
Protobuf [oneof] groups a set of mutually exclusive optional fields at
distinct tags. Encoding emits whichever case matches; decoding picks
the case with the highest wire-order sequence (protobuf "last wins").
API:
val case : int -> 'a t -> inject:('a -> 'b) -> extract:('b -> 'a option)
-> 'b case
val oneof : default:'a -> ('o -> 'a) -> 'a case list -> ('o, 'a) field
Typical usage lifts the oneof alternatives into an OCaml polymorphic
variant:
type payload = [ `None | `Text of string | `Num of int32 ]
let msg_codec =
finish
(let* payload =
oneof ~default:`None (fun r -> r.payload)
[ case 1 string ~inject:(fun s -> `Text s)
~extract:(function `Text s -> Some s | _ -> None);
case 2 int32 ~inject:(fun n -> `Num n)
~extract:(function `Num n -> Some n | _ -> None) ] in
return { payload })
Internals:
- [parse_wire] now stamps each wire entry with a sequence counter so
[take_oneof_last] can find the case whose tag came last in wire
order. Hashtbl buckets still record per-tag wire order (reversed,
prepend-on-insert); the counter adds cross-tag ordering.
- [decode_fields] handles the new [Oneof] GADT constructor.
- [encode_fields] iterates the case list, picks the first whose
[extract] returns [Some], and emits that tag. If every extractor
returns [None] (e.g. value is the default variant), no wire bytes
are written -- matching protoc's behaviour for unset oneofs.
- [take_oneof_last] consumes every case tag from the table on exit
so oneof fields don't leak into the unknown-fields bag.
Tests: roundtrip through each case variant, empty-wire for the
default variant, and a "last wins" test where three consecutive
oneof tags appear on the wire and the decoder picks the third.
All 53 unit + 17 fuzz + 2 protoc interop tests pass.
Two merlint-driven structural cleanups:
- [Wire.wire_type] -> [Wire.t], [Wire.wire_type_to_int] -> [Wire.to_int],
[Wire.wire_type_of_int] -> [Wire.of_int], [Wire.pp_wire_type] ->
[Wire.pp]. Merlint's E330 rule flags the redundant module prefix;
callers using [Wire.Fixed32]/etc. already disambiguate the sort by
module path, so the type can be the idiomatic [Wire.t]. Labeled
argument [~wire_type:] on [write_tag] / [read_tag] stays.
- Merge [test_hostile.ml] into [test_protobuf.ml] as a
[hostile_cases : unit Alcotest.test_case list] appended onto the
main suite. Matches the user's established convention -- hostile
inputs are tested alongside the happy-path cases, not in a separate
test_<nonexistent-library>.ml that merlint E610 objects to.
All 49 unit + 17 fuzz + 2 protoc interop tests pass.
Remaining merlint items: long identifier names (CVE-numbered tests
with > 4 underscores), a [Fmt] usage hint, and a small set of doc
tags. Low-signal nits; deferred.
- [Message.map tag get key_codec value_codec] declares a [map<K,V>]
field. On the wire this is sugar for a repeated nested message with
[key = 1] and [value = 2], and the decoder handles both forms.
Internal [map_entry_codec] builds the entry submessage inline
without routing through [let* / finish] -- the entry is an ephemeral
tuple rather than a named record.
- [decode_with_unknowns_string] / [encode_with_unknowns_string] let
forward-compatible pipelines preserve fields whose tag was not in
the schema. Decode returns [Ok (value, unknown_wire)] where the
byte string can be tacked onto a later encode via the matching
[~unknowns] argument. Unknowns are re-serialized in canonical form
and sorted by tag, so round-trip preserves semantics but not
byte-identity. Standard [decode_string] / [encode_string] still
silently drop unknowns.
Implementation: [Message.take_last] and [take_all] now consume the
matched entries from the parse_wire hashtable; what remains after
[decode_fields] returns is exactly the unknown-field set.
- Hostile-input suite is rewritten around CVE identifiers. Each test
cites the upstream vulnerability:
CVE-2015-5237 (C++ 2015): huge length prefix, over-long varint,
truncated tag -- integer overflow / DoS
CVE-2021-22569 (Java 2021): many small groups -- memory
amplification
CVE-2022-1941 (C++ 2022): all-unknown-fields schema -- null deref
CVE-2022-3171 (Java 2022): deprecated group wire types 3 & 4
CVE-2024-7254 (Go 2024): deep nesting in known and unknown
message fields
CVE-2024-47554 (Rust prost 2024): length past end, packed
corrupt body
Plus spec-conformance tests for reserved tag 0, wire-type mismatch,
non-UTF-8 string content (must accept), empty input (proto3
defaults), overrun rejection, and map duplicate keys (last-wins
but decoder preserves wire order).
- GADT tweak: drop the [_t] suffix from [Fixed32_t] / [Fixed64_t]
codec constructors. OCaml's type-directed constructor
disambiguation resolves the name collision with [Wire.Fixed32] /
[Wire.Fixed64] by context.
- Add [Protobuf.pp : 'a t -> _] printing the codec's sort (for
debugging / merlint E415).
- Add a top-level [.ocamlformat] (version 0.29.0) to match the
monorepo convention.
- Add one-line docstrings to every [Wire.read_*] entry in [wire.mli].
All 49 unit + 17 fuzz + 2 protoc interop tests pass.
Remaining merlint items (queued for next session): inline
test_hostile.ml into test_protobuf.ml as a [hostile_cases] list per
the user's established pattern; shorten test identifiers to
<= 4 underscores; rename [Wire.wire_type] to [Wire.t].
The previous [type 'a t = { wire_type; write_value; read_wire; ... }]
was a record of closures. Interpreters couldn't be added without
editing every combinator, the structure was opaque to tooling, and
the shape didn't match the ocaml-json / encodings-skill design.
Rewrite as a finally-tagged GADT whose constructors name protobuf's
wire-level alphabet:
type _ t =
| Varint : (int64, 'a) base -> 'a t
| Fixed32_t : (int32, 'a) base -> 'a t
| Fixed64_t : (int64, 'a) base -> 'a t
| Length_delim : (string, 'a) base -> 'a t
| Message : 'a message_spec -> 'a t
| Rec : 'a t Lazy.t -> 'a t
Each scalar codec now produces a typed GADT node carrying its
[Sort.t] (one of the 15 protobuf scalar types — int32, uint32,
sint32, fixed32, sfixed32, float, ..., bytes, message). Sort feeds
into error messages: instead of "expected varint, got
length-delimited" the decoder now says "int32: expected wire type
varint, got length-delimited", which is what users want when a schema
says [int32 a = 1] but the wire carries a length-delim.
[fix] switches from a mutable forwarding placeholder to a [Lazy]-
wrapped [Rec] node. Cleaner: the recursive-codec forcing is explicit
in the GADT shape.
encode_value / decode_value are now [type a. a t -> ...] walkers that
pattern-match on the wire sort. Adding a new interpreter (schema
printer, pp, diff) is adding a new walker alongside these, no change
to the combinator call sites.
Message combinators (Message.required / optional / repeated / packed
and the [let*] chain) retain their shape at the user-facing level;
internally [Message.finish] now produces a [Message { encode_body;
decode_body; msg_default }] GADT node.
Still to do from the encoding skill:
- Split into separate [value.ml] / [codec.ml] / [error.ml] / [foo.ml]
layer files (this commit keeps everything in protobuf.ml for
minimal diff).
- Expose [Protobuf.Value.t] AST and [Cursor].
- Migrate errors to [Loc.Error.kind].
- Add six-verb API (of_string / to_string / of_reader / to_writer /
decode / encode) with _exn twins.
All 40 unit + 17 fuzz + 2 protoc interop tests pass.
Fixes and extensions building on the scaffolding landed in fd396b81e.
- Encoder: proto3 scalar fields equal to their codec default are now
omitted from the wire. This is the first real interop bug — protoc
output for [Test1 {a = 0}] is empty, but we were emitting "0800".
[Message.encode_fields] checks [v <> codec.default] before writing
required fields.
- Fuzz suite (fuzz/): 17 alcobar invariants covering scalar round-trip,
the kitchen-sink message (every scalar plus optional/repeated/packed/
nested), and decoder robustness against arbitrary bytes (must return
Ok or Error; never raise, loop, or allocate unboundedly).
- Hostile-input tests (test_hostile): eleven regressions for known
protobuf decoder CVE classes — huge length prefix DoS, over-long
varint, truncated tag, reserved tag 0, unsupported wire type, wire
type mismatch, empty input -> defaults, overrun rejected, length past
end, packed corrupt body, and many-repeated scaling. Depth-limited
recursion noted as a TODO (needs a Lazy-wrapped recursive codec and
an explicit depth bound in the decoder).
- Interop test against protoc (test/interop/protoc/): Python oracle
using grpcio-tools 1.73.0 + protobuf 6.31.0, generating two trace
CSVs for a Test1 message and an Everything message covering all 15
scalar types plus optional/repeated/packed/nested. The OCaml test
asserts byte-for-byte equality in both directions (encode matches
protoc, decode reproduces protoc's values). [dune build @regen-traces]
from the package root refreshes traces.
Total test count: 38 unit + 17 fuzz + 2 interop (all passing). The
interop layer is the one that actually proves this speaks protobuf —
the earlier tests just verified self-consistency.
Warning 69 (unused-field, mutable-never-assigned). Four independent
record fields were flagged as mutable but the code only mutates their
referents in place, never rebinds the record slot itself:
- ocaml-wal/lib/wal.ml: [t.file] (the Eio file resource; methods call
Eio.File.pwrite_all etc., the slot is set once at open time).
- ocaml-block/lib/block.ml: [Memory.state.data] (the backing bytes,
written via Bytes.blit_string; [Bytes.t] is already mutable).
- ocaml-sse/lib/sse.ml: [Parser.t.data_buf] (a Buffer.t, written via
Buffer.add_*; the slot never changes).
- ocaml-zephyr/lib/zephyr.ml: drop [mode : Read | Write] entirely —
set at open-time, read nowhere. The open_read / open_write
constructors already distinguish the two call shapes, so mode
tracking was redundant.