languages/ziglang/0.15/binary.md at main

zzstoatzz.io / notes
fork
about things
fork
notes / languages / ziglang / 0.15 / binary.md
at main 201 lines 7.2 kB view raw view rendered
wrap content
zzstoatzz push up notes 2mo ago
5699bb35
  1# binary encoding
  2
  3patterns for encoding/decoding binary wire formats (CBOR, CAR, protocol frames). distinct from JSON - you're working with raw bytes and need to handle endianness, varints, and content addressing.
  4
  5## anytype writer for encoders
  6
  7the core pattern: an encoder function that accepts any writer via `anytype`. this lets the same encoder write to fixed buffers, ArrayLists, or any other writer:
  8
  9```zig
 10pub fn encode(allocator: Allocator, writer: anytype, value: Value) !void {
 11    switch (value) {
 12        .unsigned => |v| try writeArgument(writer, 0, v),
 13        .text => |t| {
 14            try writeArgument(writer, 3, t.len);
 15            try writer.writeAll(t);
 16        },
 17        .map => |entries| {
 18            // sort keys (DAG-CBOR determinism), needs allocator
 19            const sorted = try allocator.dupe(MapEntry, entries);
 20            defer allocator.free(sorted);
 21            std.mem.sort(MapEntry, sorted, {}, keyLessThan);
 22            // ...
 23        },
 24        // ...
 25    }
 26}
 27```
 28
 29the allocator parameter is separate from the writer - needed for temporary allocations during encoding (sorting map keys, building intermediate buffers), not for the output itself.
 30
 31usage with different writers:
 32
 33```zig
 34// fixed buffer (no allocation for output)
 35var buf: [1024]u8 = undefined;
 36var stream = std.io.fixedBufferStream(&buf);
 37try encode(alloc, stream.writer(), value);
 38const result = stream.getWritten();
 39
 40// growable buffer
 41var list: std.ArrayList(u8) = .{};
 42defer list.deinit(alloc);
 43try encode(alloc, list.writer(alloc), value);
 44```
 45
 46**note**: `std.io.fixedBufferStream` is deprecated in 0.15 — the stdlib says to use `std.Io.Writer.fixed` / `std.Io.Reader.fixed` instead. the old API still compiles (zat uses it in 3 files) but new code should prefer the non-deprecated form. the `anytype` writer pattern itself is fine either way — the encoder doesn't care which writer type backs it.
 47
 48see: [zat/cbor.zig](https://tangled.sh/@zzstoatzz.io/zat/tree/main/src/internal/cbor.zig)
 49
 50## encodeAlloc convenience
 51
 52wrap the growable-buffer pattern into a helper:
 53
 54```zig
 55pub fn encodeAlloc(allocator: Allocator, value: Value) ![]u8 {
 56    var list: std.ArrayList(u8) = .{};
 57    errdefer list.deinit(allocator);
 58    try encode(allocator, list.writer(allocator), value);
 59    return try list.toOwnedSlice(allocator);
 60}
 61```
 62
 63caller owns the returned slice. `errdefer` ensures cleanup if encoding fails partway through.
 64
 65## big-endian integers without writeInt
 66
 67when writing fixed-width big-endian integers to an `anytype` writer, build the bytes manually rather than depending on `writeInt` (which may not be available on all writer types):
 68
 69```zig
 70fn writeArgument(writer: anytype, major: u3, val: u64) !void {
 71    const prefix: u8 = @as(u8, major) << 5;
 72    if (val <= 0xffff) {
 73        try writer.writeByte(prefix | 25);
 74        const v: u16 = @intCast(val);
 75        try writer.writeAll(&[2]u8{ @truncate(v >> 8), @truncate(v) });
 76    }
 77    // ...
 78}
 79```
 80
 81`@truncate` on shifted values is the idiomatic way to extract individual bytes.
 82
 83## unsigned varint (LEB128)
 84
 85used by CID, CAR, and other IPLD formats for variable-length integers:
 86
 87```zig
 88// write
 89pub fn writeUvarint(writer: anytype, val: u64) !void {
 90    var v = val;
 91    while (v >= 0x80) {
 92        try writer.writeByte(@as(u8, @truncate(v)) | 0x80);
 93        v >>= 7;
 94    }
 95    try writer.writeByte(@as(u8, @truncate(v)));
 96}
 97
 98// read
 99fn readUvarint(data: []const u8, pos: *usize) ?u64 {
100    var result: u64 = 0;
101    var shift: u6 = 0;
102    while (pos.* < data.len) {
103        const byte = data[pos.*];
104        pos.* += 1;
105        result |= @as(u64, byte & 0x7f) << shift;
106        if (byte & 0x80 == 0) return result;
107        shift +|= 7;
108        if (shift >= 64) return null;
109    }
110    return null;
111}
112```
113
114note `+|=` (saturating add) prevents overflow on the shift counter.
115
116## arena per message
117
118for streaming protocols, create an arena per incoming message. all decoding allocations go into it, then free everything at once:
119
120```zig
121pub fn serverMessage(self: *Self, data: []const u8) !void {
122    var arena = std.heap.ArenaAllocator.init(self.allocator);
123    defer arena.deinit();
124
125    const event = decodeFrame(arena.allocator(), data) catch |err| {
126        log.debug("decode error: {s}", .{@errorName(err)});
127        return;
128    };
129
130    self.handler.onEvent(event);
131    // arena freed here — all decoded data is gone
132}
133```
134
135this means the handler's `onEvent` must not hold references to event data past the call. if it needs to, it must copy into its own allocator.
136
137see: [zat/firehose.zig](https://tangled.sh/@zzstoatzz.io/zat/tree/main/src/internal/firehose.zig), [zat/jetstream.zig](https://tangled.sh/@zzstoatzz.io/zat/tree/main/src/internal/jetstream.zig)
138
139## specialized decoders
140
141when generic decoding is too expensive, write a purpose-built parser for a known schema. the generic path builds `Value` unions, `MapEntry` arrays, and handles every CBOR type. if you know the exact shape, skip all that.
142
143example: MST nodes are always `map(2) { "e": array[entries...], "l": CID|null }`. instead of `cbor.decodeAll()` → extract fields from Value unions, parse the CBOR bytes directly:
144
145```zig
146pub fn decodeMstNode(allocator: Allocator, data: []const u8) MstDecodeError!MstNodeData {
147    // expect map(2), key "e", array(n) — known byte sequence
148    // parse entries inline, zero-copy slicing into input buffer
149    // only allocation: the entries array itself
150}
151
152pub const MstNodeData = struct {
153    left: ?[]const u8,       // raw CID bytes (borrowed from input)
154    entries: []MstEntryData, // heap-allocated array
155};
156
157pub const MstEntryData = struct {
158    prefix_len: usize,
159    key_suffix: []const u8,  // borrowed from input
160    value_cid: []const u8,   // borrowed from input
161    tree: ?[]const u8,       // borrowed from input
162};
163```
164
165the result: MST walk went from 45.5ms (generic decode per node) to 39.3ms (specialized decode) on 243k blocks. the bigger win was avoiding the full tree rebuild (218ms → 39ms total) by verifying structure during the walk.
166
167when to use this pattern:
168- you decode the same schema thousands of times (MST nodes, CBOR blocks)
169- the schema is stable and well-known
170- profiling shows decode time dominates
171
172when NOT to use it:
173- the schema varies or is user-defined
174- you only decode a handful of times
175- generic decode is fast enough
176
177see: [zat/mst.zig decodeMstNode](https://tangled.sh/@zzstoatzz.io/zat/tree/main/src/internal/repo/mst.zig)
178
179## deterministic encoding
180
181DAG-CBOR requires deterministic output (same value → same bytes). the main rules:
182
183- **shortest integer encoding**: 0-23 inline, 24-255 in 1 byte, etc.
184- **map keys sorted**: by byte length first, then lexicographically
185- **no floats, no indefinite lengths**
186
187sorting map keys during encoding:
188
189```zig
190fn dagCborKeyLessThan(_: void, a: MapEntry, b: MapEntry) bool {
191    if (a.key.len != b.key.len) return a.key.len < b.key.len;
192    return std.mem.order(u8, a.key, b.key) == .lt;
193}
194
195// in encoder:
196const sorted = try allocator.dupe(MapEntry, entries);
197defer allocator.free(sorted);
198std.mem.sort(MapEntry, sorted, {}, dagCborKeyLessThan);
199```
200
201the dupe + sort pattern avoids mutating the input — the caller's `entries` slice stays unchanged.
Configure Feed

Configure Feed