CCSDS 121.0-B-3 Lossless Data Compression (Rice/Golomb coding)
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

rice: revert to exhaustive k scan — interop test requires libaec parity

Commit c680a7407 switched [best_split_k] to use the [select_k]
heuristic (floor(log2(mean))), reasoning that this is what Rice
coding normally does per CCSDS 121.0-B-3. That's mathematically
true but practically wrong for this library:

libaec (our CCSDS 121 reference oracle in test/interop/libaec/)
implements its own [best_split_k] via an exhaustive scan with a
first-found-wins tiebreak. On tie blocks — inputs where multiple k
values produce the same total bit length, e.g. an 8-bit uniform
ramp of 16 samples where k=2 and k=3 both yield 72 bits — libaec
picks whichever k comes first in [0..kmax]. The heuristic picks
floor(log2(mean)), which disagrees on those ties.

The interop test [compress] case [ramp_bs16_bps8] fails byte-for-
byte after the heuristic switch, emitting a 1-bit-shifted output
that phase-propagates through the entire stream. Interop-exact
match against libaec is non-negotiable for a reference-aligned
library.

Restore the exhaustive scan and keep [select_k]'s reasoning as a
comment so the next maintainer knows why we search instead of
computing. [select_k] itself is deleted (no callers).

Test result:
interop/libaec: 8/8 pass (compress + decompress × 4 vectors)
fuzz: 6/6 pass

+33 -27
+33 -27
lib/rice.ml
··· 277 277 (** kmax = 2^id_len - 3. *) 278 278 let kmax_of_id_len id_len = (1 lsl id_len) - 3 279 279 280 - (** Select the optimal split parameter k for a block of residuals. 281 - [k = floor(log2(sum / len))] clamped to [0, kmax] — the classic 282 - Rice-coding heuristic that minimises the expected code length 283 - for a geometric distribution. *) 284 - let select_k residuals ofs len kmax = 285 - let sum = ref 0 in 286 - for i = ofs to ofs + len - 1 do 287 - sum := !sum + residuals.(i) 288 - done; 289 - if !sum = 0 then 0 290 - else 291 - let ratio = float_of_int !sum /. float_of_int len in 292 - let k = int_of_float (Float.floor (log ratio /. log 2.0)) in 293 - let k = max 0 k in 294 - min k kmax 280 + (* The floor(log2(mean)) heuristic was tried here as [select_k], since 281 + it's mathematically optimal for geometric distributions and is what 282 + Rice coding "normally" does. libaec — our CCSDS 121 reference 283 + oracle — doesn't use it: its [best_split_k] below does an 284 + exhaustive scan with a first-found-wins tiebreak, and on tie blocks 285 + (e.g. 8-bit uniform ramps where k=2 and k=3 produce the same 286 + total bit length) it picks a different k than the heuristic. The 287 + interop test requires byte-for-byte match against libaec, so we 288 + follow libaec's scan; the heuristic is recorded here for the next 289 + maintainer who wonders why we search instead. *) 295 290 296 291 (** Check if all values in a sub-array are zero. *) 297 292 let all_zero arr ofs len = ··· 358 353 in 359 354 Bitwriter.write_unary bw fs 360 355 361 - (* Pick the split parameter [k] and compute the resulting encoded 362 - length. Per CCSDS 121.0-B-3 §5.1.3, k is floor(log2(block_sum / J)) 363 - clamped to [0, kmax] — that's what [select_k] returns. This is 364 - exactly the Rice/Golomb optimum for the geometric distribution the 365 - residuals follow, so no search is needed. *) 356 + (* Pick the split parameter [k] that produces the shortest encoding. 357 + 358 + Rice coding admits multiple k values with identical total encoded 359 + length (e.g. for a uniform 8-bit ramp of 16 samples, k=2 and k=3 360 + both produce 72 bits). The [select_k] heuristic — floor(log2(mean)) 361 + — picks a mathematically optimal k for the geometric 362 + distribution, but libaec (our CCSDS 121 reference oracle) uses an 363 + exhaustive search and picks a specific tiebreak. The interop test 364 + requires byte-for-byte matching against libaec, so we do the same 365 + exhaustive scan. On typical inputs both approaches agree; 366 + exhaustive only differs on tie blocks, and preserves exact 367 + parity with libaec's output. *) 366 368 let best_split_k residuals res_ofs count id_len bps is_ref kmax = 367 - let k = select_k residuals res_ofs count kmax in 368 - let len = 369 - id_len 370 - + (if is_ref then bps else 0) 371 - + split_encoded_len residuals res_ofs count k 372 - in 373 - (k, len) 369 + let best_k = ref 0 in 370 + let best_len = ref max_int in 371 + for k = 0 to kmax do 372 + let l = split_encoded_len residuals res_ofs count k in 373 + let total = id_len + (if is_ref then bps else 0) + l in 374 + if total < !best_len then begin 375 + best_k := k; 376 + best_len := total 377 + end 378 + done; 379 + (!best_k, !best_len) 374 380 375 381 let encode_second_extension bw residuals res_ofs count bps id_len is_ref 376 382 ref_sample =