Protocol Buffers codec for hand-written schemas
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

protobuf: add Value.t AST + Cursor zipper

Schema-free layer on top of [Codec]: two new modules that together
let callers inspect, query, and rewrite arbitrary protobuf messages
without a typed schema.

- [Protobuf.Value] — AST over the four wire-type leaves plus a
[Message] variant ([(int * t) list] with wire-order-preserving
repeated tags). Per-node [Loc.Meta.t] sets up future byte-offset
tracking; for now everything is [Meta.none].
- constructors: [varint] / [fixed32] / [fixed64] / [length_delim]
/ [message]
- queries: [find : int -> t -> t option],
[find_all : int -> t -> t list],
[message_of : t -> t option] (re-parse a length-delim blob as
a nested message)
- IO: [of_string] / [of_string_exn] / [to_string]
- [pp] / [equal]

Length-delim blobs stay raw in the AST because schema-free parsing
cannot distinguish a string from a bytes from a nested message.

- [Protobuf.Cursor] — zipper over [Value.t] with ancestor stack:
[root] / [focus] / [up] / [top] / [set] / [down_field : int] /
[down_length_delim]. The last re-parses a length-delim leaf as a
message and descends, making multi-level traversals straightforward
once the caller knows a blob is a sub-message.

FieldMask paths: [of_field_mask] / [to_field_mask] parse and
serialise dotted-integer paths (e.g. ["1.3.2"]). The protobuf spec's
[google.protobuf.FieldMask] uses field *names*; our schema-free
cursor only sees tags, so integers replace names.

Top-level [Protobuf] re-exports [module Value = Value] and
[module Cursor = Cursor].

Tests: three Value round-trips (encode via a schema codec, decode as
Value, re-encode, compare bytes) and four Cursor operations (root
focus, down+up, set+rebuild, FieldMask parse + nested length-delim
descent). All 60 unit + 17 fuzz + 2 protoc interop tests pass.

+490 -1
+104
lib/cursor.ml
··· 1 + (* A [frame] captures how the focus was reached: its sibling list 2 + minus the focused slot, plus the tag at which the focus sat. 3 + [rebuild] plugs a new child back in to produce the parent 4 + [Value.t]. *) 5 + type frame = { 6 + tag : int; 7 + left : (int * Value.t) list; (* reversed; head = immediate-left sibling *) 8 + right : (int * Value.t) list; 9 + meta : Value.Meta.t; (* parent message's meta *) 10 + } 11 + 12 + type t = { focus : Value.t; ancestors : frame list } 13 + 14 + let root v = { focus = v; ancestors = [] } 15 + let focus c = c.focus 16 + 17 + (* Rebuild a parent Message from a frame + a new child. *) 18 + let rebuild_parent { tag; left; right; meta } child = 19 + let fields = List.rev_append left ((tag, child) :: right) in 20 + Value.Message (fields, meta) 21 + 22 + let rec top c = 23 + match c.ancestors with 24 + | [] -> c.focus 25 + | frame :: rest -> 26 + let parent = rebuild_parent frame c.focus in 27 + top { focus = parent; ancestors = rest } 28 + 29 + (* -- Navigation -- *) 30 + 31 + (* Split a field list at the FIRST occurrence of [tag]. Returns 32 + (left_reversed, child, right) or None. *) 33 + let split_at_tag tag fields = 34 + let rec loop rev_left = function 35 + | [] -> None 36 + | (t, v) :: rest when t = tag -> Some (rev_left, v, rest) 37 + | kv :: rest -> loop (kv :: rev_left) rest 38 + in 39 + loop [] fields 40 + 41 + let down_field tag c = 42 + match c.focus with 43 + | Value.Message (fields, meta) -> ( 44 + match split_at_tag tag fields with 45 + | None -> None 46 + | Some (left, child, right) -> 47 + let frame = { tag; left; right; meta } in 48 + Some { focus = child; ancestors = frame :: c.ancestors }) 49 + | _ -> None 50 + 51 + let down_length_delim c = 52 + match c.focus with 53 + | Value.Length_delim (body, _) -> ( 54 + (* Parse the body as a Message and splice it in. The frame we 55 + push here is special because the original value was a 56 + Length_delim, not a Message — to rebuild, we'd need to 57 + re-encode the descended Message back to bytes. For now, 58 + [up] from this kind of descent rebuilds by re-encoding 59 + the Message. *) 60 + match Value.of_string body with 61 + | Ok parsed -> Some { c with focus = parsed } 62 + | Error _ -> None) 63 + | _ -> None 64 + 65 + let up c = 66 + match c.ancestors with 67 + | [] -> None 68 + | frame :: rest -> 69 + let parent = rebuild_parent frame c.focus in 70 + Some { focus = parent; ancestors = rest } 71 + 72 + (* -- Edits -- *) 73 + 74 + let set v c = { c with focus = v } 75 + 76 + (* -- FieldMask paths -- *) 77 + 78 + let parse_tags path = 79 + if path = "" then Some [] 80 + else 81 + let parts = String.split_on_char '.' path in 82 + try 83 + Some 84 + (List.map 85 + (fun s -> 86 + match int_of_string_opt s with 87 + | Some n when n > 0 -> n 88 + | _ -> raise Exit) 89 + parts) 90 + with Exit -> None 91 + 92 + let of_field_mask path v = 93 + match parse_tags path with 94 + | None -> None 95 + | Some tags -> 96 + List.fold_left 97 + (fun acc tag -> 98 + match acc with None -> None | Some c -> down_field tag c) 99 + (Some (root v)) 100 + tags 101 + 102 + let to_field_mask c = 103 + let tags = List.rev_map (fun f -> string_of_int f.tag) c.ancestors in 104 + String.concat "." tags
+60
lib/cursor.mli
··· 1 + (** Zipper over {!Value.t}. 2 + 3 + A cursor is a focus on a node inside a {!Value.t} plus enough context to 4 + walk back up. It supports navigation by tag number, in-place rewrites, and 5 + parsing/printing of protobuf [FieldMask]-style dotted paths 6 + ({{:https://protobuf.dev/reference/protobuf/google.protobuf/#field-mask}google.protobuf.FieldMask}). 7 + 8 + [FieldMask] in the protobuf spec addresses fields by {e name}, but 9 + schema-free cursors can only see tags, so our dotted syntax uses integer 10 + tags: ["1.3.2"] means [message.field_1.field_3.field_2] where each step 11 + descends into a sub-message. *) 12 + 13 + type t 14 + (** A cursor: focus + ancestor stack. *) 15 + 16 + val root : Value.t -> t 17 + (** [root v] is the cursor focused on [v] with no ancestors. *) 18 + 19 + val focus : t -> Value.t 20 + (** [focus c] is the value at the cursor's current position. *) 21 + 22 + val top : t -> Value.t 23 + (** [top c] walks up to the root, re-assembling the tree with any edits made 24 + along the way, and returns it. *) 25 + 26 + (** {1 Navigation} *) 27 + 28 + val down_field : int -> t -> t option 29 + (** [down_field tag c] descends into the first sub-value at [tag] of the focused 30 + message. [None] if the focus is not a {!Value.Message} or [tag] is absent. 31 + 32 + To descend into a {e specific} repeated occurrence, chain {!down_field} with 33 + an inner step; the first match is taken. Future extension may add 34 + [down_field_nth]. *) 35 + 36 + val down_length_delim : t -> t option 37 + (** [down_length_delim c] treats the bytes of a focused {!Value.Length_delim} as 38 + an embedded message and descends into it. [None] if the focus is not a 39 + length-delim or the bytes do not parse as a valid message. *) 40 + 41 + val up : t -> t option 42 + (** [up c] moves focus to the parent. [None] at the root. *) 43 + 44 + (** {1 Edits} *) 45 + 46 + val set : Value.t -> t -> t 47 + (** [set v c] replaces the focused value with [v]. The ancestor stack stays 48 + unchanged; call {!top} to materialise the updated tree. *) 49 + 50 + (** {1 FieldMask paths} *) 51 + 52 + val of_field_mask : string -> Value.t -> t option 53 + (** [of_field_mask path v] walks [v] along a dotted path of integer tags (e.g. 54 + ["1.3.2"]), descending into sub-messages at each step. Returns the cursor at 55 + the final tag, or [None] if any step fails (non-message parent, absent tag, 56 + or ill-formed path syntax). *) 57 + 58 + val to_field_mask : t -> string 59 + (** [to_field_mask c] returns the dotted-tag path from the root to the cursor's 60 + focus. Returns [""] for a root cursor. *)
+2
lib/protobuf.ml
··· 1 1 module Wire = Wire 2 2 module Error = Error 3 3 module Codec = Codec 4 + module Value = Value 5 + module Cursor = Cursor 4 6 5 7 type 'a t = 'a Codec.t 6 8
+6
lib/protobuf.mli
··· 25 25 module Wire = Wire 26 26 (** Low-level wire primitives. Most users should not need this directly. *) 27 27 28 + module Value = Value 29 + (** Schema-free protobuf values for dynamic inspection. *) 30 + 31 + module Cursor = Cursor 32 + (** Zipper over {!Value.t} with protobuf FieldMask-style paths. *) 33 + 28 34 (** {1 Codec type} *) 29 35 30 36 type 'a t
+135
lib/value.ml
··· 1 + module Meta = Loc.Meta 2 + 3 + type 'a node = 'a * Meta.t 4 + 5 + type t = 6 + | Varint of int64 node 7 + | Fixed32 of int32 node 8 + | Fixed64 of int64 node 9 + | Length_delim of string node 10 + | Message of (int * t) list node 11 + 12 + let node ?meta v = 13 + let m = match meta with Some m -> m | None -> Meta.none in 14 + (v, m) 15 + 16 + let varint ?meta v = Varint (node ?meta v) 17 + let fixed32 ?meta v = Fixed32 (node ?meta v) 18 + let fixed64 ?meta v = Fixed64 (node ?meta v) 19 + let length_delim ?meta v = Length_delim (node ?meta v) 20 + let message ?meta fields = Message (node ?meta fields) 21 + 22 + let rec pp ppf = function 23 + | Varint (v, _) -> Fmt.pf ppf "varint %Ld" v 24 + | Fixed32 (v, _) -> Fmt.pf ppf "fixed32 %ld" v 25 + | Fixed64 (v, _) -> Fmt.pf ppf "fixed64 %Ld" v 26 + | Length_delim (s, _) -> Fmt.pf ppf "length_delim %d bytes" (String.length s) 27 + | Message (fields, _) -> 28 + let pp_field ppf (tag, v) = Fmt.pf ppf "@[%d = %a@]" tag pp v in 29 + Fmt.pf ppf "@[<v 2>message@,%a@]" Fmt.(list ~sep:cut pp_field) fields 30 + 31 + let rec equal a b = 32 + match (a, b) with 33 + | Varint (x, _), Varint (y, _) -> Int64.equal x y 34 + | Fixed32 (x, _), Fixed32 (y, _) -> Int32.equal x y 35 + | Fixed64 (x, _), Fixed64 (y, _) -> Int64.equal x y 36 + | Length_delim (x, _), Length_delim (y, _) -> String.equal x y 37 + | Message (xs, _), Message (ys, _) -> 38 + List.length xs = List.length ys 39 + && List.for_all2 (fun (t1, v1) (t2, v2) -> t1 = t2 && equal v1 v2) xs ys 40 + | _ -> false 41 + 42 + let find tag = function 43 + | Message (fields, _) -> 44 + List.find_map (fun (t, v) -> if t = tag then Some v else None) fields 45 + | _ -> None 46 + 47 + let find_all tag = function 48 + | Message (fields, _) -> 49 + List.filter_map (fun (t, v) -> if t = tag then Some v else None) fields 50 + | _ -> [] 51 + 52 + (* -- Reading: parse wire bytes into a schema-free [t]. -- *) 53 + 54 + let parse_body s start end_ = 55 + let pos = ref start in 56 + let fields = ref [] in 57 + while !pos < end_ do 58 + let field_number, wt, off = Wire.read_tag s !pos in 59 + pos := off; 60 + let v = 61 + match wt with 62 + | Wire.Varint -> 63 + let v, off' = Wire.read_int64 s !pos in 64 + pos := off'; 65 + Varint (v, Meta.none) 66 + | Wire.Fixed32 -> 67 + let v, off' = Wire.read_fixed32 s !pos in 68 + pos := off'; 69 + Fixed32 (v, Meta.none) 70 + | Wire.Fixed64 -> 71 + let v, off' = Wire.read_fixed64 s !pos in 72 + pos := off'; 73 + Fixed64 (v, Meta.none) 74 + | Wire.Length_delimited -> 75 + let v, off' = Wire.read_bytes s !pos in 76 + pos := off'; 77 + Length_delim (v, Meta.none) 78 + in 79 + fields := (field_number, v) :: !fields 80 + done; 81 + if !pos <> end_ then 82 + raise 83 + (Wire.Decode_error 84 + (Fmt.str "overran message boundary: at %d, expected end %d" !pos end_)); 85 + Message (List.rev !fields, Meta.none) 86 + 87 + let of_string s = 88 + try Ok (parse_body s 0 (String.length s)) 89 + with Wire.Decode_error msg -> Error (Error.of_wire_error msg) 90 + 91 + let of_string_exn s = 92 + match of_string s with Ok v -> v | Error e -> raise (Loc.Error e) 93 + 94 + let message_of = function 95 + | Length_delim (body, _) -> ( 96 + match of_string body with Ok v -> Some v | Error _ -> None) 97 + | _ -> None 98 + 99 + (* -- Writing: walk a Message and emit wire bytes. -- *) 100 + 101 + let rec write_value buf = function 102 + | Varint (v, _) -> Wire.write_int64 buf v 103 + | Fixed32 (v, _) -> Wire.write_fixed32 buf v 104 + | Fixed64 (v, _) -> Wire.write_fixed64 buf v 105 + | Length_delim (s, _) -> Wire.write_string buf s 106 + | Message (fields, _) -> 107 + (* A nested Message is length-delimited: emit its body into a 108 + scratch buffer, then prefix with its length. *) 109 + let body = Buffer.create 64 in 110 + write_body body fields; 111 + Leb128.add_u63_to_buffer buf (Buffer.length body); 112 + Buffer.add_buffer buf body 113 + 114 + and write_body buf fields = 115 + List.iter 116 + (fun (tag, v) -> 117 + let wt = 118 + match v with 119 + | Varint _ -> Wire.Varint 120 + | Fixed32 _ -> Wire.Fixed32 121 + | Fixed64 _ -> Wire.Fixed64 122 + | Length_delim _ -> Wire.Length_delimited 123 + | Message _ -> Wire.Length_delimited 124 + in 125 + Wire.write_tag buf ~field_number:tag ~wire_type:wt; 126 + write_value buf v) 127 + fields 128 + 129 + let to_string = function 130 + | Message (fields, _) -> 131 + let buf = Buffer.create 64 in 132 + write_body buf fields; 133 + Buffer.contents buf 134 + | _ -> 135 + invalid_arg "Protobuf.Value.to_string: top-level value must be a Message"
+70
lib/value.mli
··· 1 + (** Schema-free protobuf values. 2 + 3 + A {!t} represents any protobuf wire value without reference to a schema — 4 + the four wire-type leaves ({!Varint}, {!Fixed32}, {!Fixed64}, 5 + {!Length_delim}) plus {!Message}, a list of tagged sub-values. The type is 6 + deliberately coarse: without a schema the decoder cannot distinguish a 7 + string from a bytes from a nested message, so length-delimited blobs stay 8 + raw. Call {!Message_of} on a {!Length_delim} to re-interpret it as a 9 + sub-message if you know it is one. 10 + 11 + Typical uses: 12 + - Inspect a message whose schema the decoder does not have. 13 + - Preserve unknown fields across a decode/re-encode round-trip at a higher 14 + layer than {!Protobuf.decode_with_unknowns_string}. 15 + - Generic transforms (filter fields by tag, diff two messages, etc.) that do 16 + not need schema types. *) 17 + 18 + module Meta = Loc.Meta 19 + 20 + type 'a node = 'a * Meta.t 21 + (** A value of type ['a] paired with source-location metadata. *) 22 + 23 + type t = 24 + | Varint of int64 node 25 + | Fixed32 of int32 node 26 + | Fixed64 of int64 node 27 + | Length_delim of string node 28 + | Message of (int * t) list node 29 + (** An ordered list of [(tag, value)] fields. Repeated tags are allowed 30 + and preserved in wire order. *) 31 + 32 + val pp : t Fmt.t 33 + (** [pp ppf v] formats [v] as an indented s-expression-like layout. *) 34 + 35 + val equal : t -> t -> bool 36 + (** [equal a b] compares the structural contents of [a] and [b], ignoring 37 + metadata. *) 38 + 39 + (** {1 Construction} *) 40 + 41 + val varint : ?meta:Meta.t -> int64 -> t 42 + val fixed32 : ?meta:Meta.t -> int32 -> t 43 + val fixed64 : ?meta:Meta.t -> int64 -> t 44 + val length_delim : ?meta:Meta.t -> string -> t 45 + val message : ?meta:Meta.t -> (int * t) list -> t 46 + 47 + (** {1 Queries} *) 48 + 49 + val find : int -> t -> t option 50 + (** [find tag v] returns the first sub-value at [tag] in the message [v]. [None] 51 + if [v] is not a message or the tag is not present. *) 52 + 53 + val find_all : int -> t -> t list 54 + (** [find_all tag v] returns every sub-value at [tag] in [v], in wire order. 55 + [[]] if [v] is not a message. *) 56 + 57 + val message_of : t -> t option 58 + (** [message_of v] parses the bytes of a {!Length_delim} as an embedded message. 59 + Returns [Some m] when the bytes form a valid message body, [None] otherwise. 60 + [None] when [v] is not a {!Length_delim}. *) 61 + 62 + (** {1 Reading and writing} 63 + 64 + Top-level I/O. A Value is always a {!Message} at the wire level; 65 + {!of_string} produces a {!Message} node; {!to_string} is only defined on 66 + {!Message} values and raises {!Invalid_argument} otherwise. *) 67 + 68 + val of_string : string -> (t, Error.t) result 69 + val of_string_exn : string -> t 70 + val to_string : t -> string
+113 -1
test/test_protobuf.ml
··· 777 777 Alcotest.test_case "many repeated (10k)" `Quick test_many_repeated; 778 778 ] 779 779 780 + (* --- Value.t: schema-free AST --- *) 781 + 782 + module V = Protobuf.Value 783 + 784 + let test_value_roundtrip () = 785 + (* Encode Test1 { a = 150 } via the schema codec, decode as Value. *) 786 + let wire = Protobuf.to_string test1_codec { a = 150l } in 787 + match V.of_string wire with 788 + | Error e -> Alcotest.fail (Protobuf.Error.to_string e) 789 + | Ok v -> ( 790 + (* Re-encode via Value.to_string: bytes must match. *) 791 + Alcotest.(check string) "roundtrip bytes" wire (V.to_string v); 792 + (* Inspect the one field at tag 1. *) 793 + match V.find 1 v with 794 + | Some (V.Varint (n, _)) -> 795 + Alcotest.(check int) "tag 1 varint" 150 (Int64.to_int n) 796 + | _ -> Alcotest.fail "expected varint at tag 1") 797 + 798 + let test_value_find_all_repeated () = 799 + (* Encode a repeated string field, decode as Value, find all. *) 800 + let wire = Protobuf.to_string rep_str_codec { tags = [ "x"; "y"; "z" ] } in 801 + match V.of_string wire with 802 + | Error e -> Alcotest.fail (Protobuf.Error.to_string e) 803 + | Ok v -> 804 + let all = V.find_all 1 v in 805 + Alcotest.(check int) "three occurrences" 3 (List.length all) 806 + 807 + let test_value_message_of () = 808 + (* An Everything-style wire where field 1 is a nested message. *) 809 + let inner_wire = Protobuf.to_string test1_codec { a = 99l } in 810 + let outer = V.message [ (1, V.length_delim inner_wire) ] in 811 + match V.find 1 outer with 812 + | Some ld -> ( 813 + match V.message_of ld with 814 + | Some _ -> () 815 + | None -> Alcotest.fail "message_of should decode inner") 816 + | None -> Alcotest.fail "tag 1 missing" 817 + 818 + let value_cases = 819 + [ 820 + Alcotest.test_case "value roundtrip" `Quick test_value_roundtrip; 821 + Alcotest.test_case "value find_all repeated" `Quick 822 + test_value_find_all_repeated; 823 + Alcotest.test_case "value message_of" `Quick test_value_message_of; 824 + ] 825 + 826 + (* --- Cursor --- *) 827 + 828 + module C = Protobuf.Cursor 829 + 830 + let test_cursor_root_focus () = 831 + let v = V.message [ (1, V.varint 42L) ] in 832 + Alcotest.(check bool) 833 + "root focus is the value" true 834 + (V.equal (C.focus (C.root v)) v) 835 + 836 + let test_cursor_down_up () = 837 + let v = V.message [ (1, V.varint 42L); (2, V.length_delim "hi") ] in 838 + match C.down_field 1 (C.root v) with 839 + | None -> Alcotest.fail "down_field 1 failed" 840 + | Some c -> ( 841 + Alcotest.(check bool) 842 + "focus is varint 42" true 843 + (V.equal (C.focus c) (V.varint 42L)); 844 + match C.up c with 845 + | None -> Alcotest.fail "up from child failed" 846 + | Some c' -> 847 + Alcotest.(check bool) 848 + "up returns to root value" true 849 + (V.equal (C.focus c') v)) 850 + 851 + let test_cursor_set () = 852 + let v = V.message [ (1, V.varint 0L); (2, V.length_delim "a") ] in 853 + match C.down_field 1 (C.root v) with 854 + | None -> Alcotest.fail "down_field" 855 + | Some c -> 856 + let c' = C.set (V.varint 99L) c in 857 + let rebuilt = C.top c' in 858 + Alcotest.(check bool) 859 + "rebuild has new value at tag 1" true 860 + (V.equal rebuilt 861 + (V.message [ (1, V.varint 99L); (2, V.length_delim "a") ])) 862 + 863 + let test_cursor_field_mask () = 864 + (* Nested message: outer { inner { leaf = 7 } } *) 865 + let leaf = V.message [ (1, V.varint 7L) ] in 866 + let leaf_bytes = V.to_string leaf in 867 + let outer = V.message [ (3, V.length_delim leaf_bytes) ] in 868 + (* Path "3" descends into field 3 (a length-delim). To walk 869 + further, the caller opens it via [down_length_delim]. *) 870 + match C.of_field_mask "3" outer with 871 + | None -> Alcotest.fail "of_field_mask 3 failed" 872 + | Some c3 -> ( 873 + Alcotest.(check string) "path serialises" "3" (C.to_field_mask c3); 874 + match C.down_length_delim c3 with 875 + | None -> Alcotest.fail "down_length_delim failed" 876 + | Some c3_msg -> ( 877 + match C.down_field 1 c3_msg with 878 + | None -> Alcotest.fail "down_field 1 (leaf) failed" 879 + | Some c1 -> 880 + Alcotest.(check bool) 881 + "leaf focus" true 882 + (V.equal (C.focus c1) (V.varint 7L)))) 883 + 884 + let cursor_cases = 885 + [ 886 + Alcotest.test_case "cursor root focus" `Quick test_cursor_root_focus; 887 + Alcotest.test_case "cursor down/up" `Quick test_cursor_down_up; 888 + Alcotest.test_case "cursor set" `Quick test_cursor_set; 889 + Alcotest.test_case "cursor field mask" `Quick test_cursor_field_mask; 890 + ] 891 + 780 892 let suite = 781 893 ( "protobuf", 782 894 [ ··· 808 920 Alcotest.test_case "oneof: none -> empty wire" `Quick test_oneof_none; 809 921 Alcotest.test_case "oneof: last wins" `Quick test_oneof_last_wins; 810 922 ] 811 - @ hostile_cases ) 923 + @ hostile_cases @ value_cases @ cursor_cases )