dupfind#
Duplicate code detection for OCaml. Finds structurally identical functions across packages by normalizing ASTs and comparing structural hashes.
How it works#
- Parse — OCaml source files are parsed into the compiler's Parsetree
- Normalize — Each binding's expression is converted to a simplified AST
with locations erased and local variables alpha-renamed to canonical names
(
_0,_1, ...), sofun x -> x + 1andfun y -> y + 1produce identical representations - Hash — The normalized AST is structurally hashed (tag-length-value encoding into a buffer, then MD5)
- Cluster — Bindings with the same hash are grouped; cross-package clusters are reported as clones
Installation#
Install with opam:
$ opam install dupfind
If opam cannot find the package, it may not yet be released in the public
opam-repository. Add the overlay repository, then install it:
$ opam repo add samoht https://tangled.org/gazagnaire.org/opam-overlay.git
$ opam update
$ opam install dupfind
Usage#
Find cross-package duplicates#
dupfind scan [--no-intra] [--min-size N] [--top N] [--format cli|json] [--exclude PATTERN] DIR...
Scan directories for .ml files and report duplicate functions across
packages. By default, both cross-package and within-package duplicates are
reported. Use --no-intra to show only cross-package duplicates.
$ dupfind scan src/lib-a src/lib-b
Clone #1 (21 nodes, 2 occurrences)
src/lib-b/b.ml:1 encode
src/lib-a/a.ml:1 encode
Found 1 clone clusters across 2 packages (42 total duplicated nodes)
Find duplicates of a specific function#
dupfind find QUERY [--min-size N] [--top N] [--format cli|json] DIR...
QUERY can be:
- A qualified name like
Module.func— looks up the binding in the scanned files, then finds all structural matches - A hex hash (32 chars) — finds all bindings with that exact hash
- An inline OCaml expression like
" let f x = x + 1"— normalizes and hashes it, then searches for matches
$ dupfind find Encode.to_bytes src/
$ dupfind find db4909941d07713e84ced7187adab6c8 src/
$ dupfind find "
let f x = x + 1" src/
Inspect the normalized AST and hash#
dupfind hash EXPR
Shows the normalized AST and structural hash for an inline OCaml expression. Useful for understanding how normalization works and debugging.
$ dupfind hash "
let f x = x + 1"
AST: fun _0 -> + _0 1
Hash: db4909941d07713e84ced7187adab6c8
$ dupfind hash "
let g y = y + 1"
AST: fun _0 -> + _0 1
Hash: db4909941d07713e84ced7187adab6c8
Note how f x and g y produce identical ASTs — local variable names are
erased during normalization.
Options#
| Flag | Description |
|---|---|
--min-size N |
Minimum AST node count (default: 30) |
--no-intra |
Only show cross-package duplicates (intra shown by default) |
--top N |
Show only top N clone clusters |
--format cli|json |
Output format |
--exclude PATTERN |
Exclude paths matching glob |