Duplicate code detection across OCaml packages
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

OCaml 93.1%
Perl 4.9%
Dune 0.6%
Other 1.4%
30 1 0

Clone this repository

https://tangled.org/gazagnaire.org/dupfind https://tangled.org/did:plc:jhift2vwcxhou52p3sewcrpx/dupfind
git@git.recoil.org:gazagnaire.org/dupfind git@git.recoil.org:did:plc:jhift2vwcxhou52p3sewcrpx/dupfind

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

dupfind#

Duplicate code detection for OCaml. Finds structurally identical functions across packages by normalizing ASTs and comparing structural hashes.

How it works#

  1. Parse — OCaml source files are parsed into the compiler's Parsetree
  2. Normalize — Each binding's expression is converted to a simplified AST with locations erased and local variables alpha-renamed to canonical names (_0, _1, ...), so fun x -> x + 1 and fun y -> y + 1 produce identical representations
  3. Hash — The normalized AST is structurally hashed (tag-length-value encoding into a buffer, then MD5)
  4. Cluster — Bindings with the same hash are grouped; cross-package clusters are reported as clones

Installation#

Install with opam:

$ opam install dupfind

If opam cannot find the package, it may not yet be released in the public opam-repository. Add the overlay repository, then install it:

$ opam repo add samoht https://tangled.org/gazagnaire.org/opam-overlay.git
$ opam update
$ opam install dupfind

Usage#

Find cross-package duplicates#

dupfind scan [--no-intra] [--min-size N] [--top N] [--format cli|json] [--exclude PATTERN] DIR...

Scan directories for .ml files and report duplicate functions across packages. By default, both cross-package and within-package duplicates are reported. Use --no-intra to show only cross-package duplicates.

$ dupfind scan src/lib-a src/lib-b
 Clone #1 (21 nodes, 2 occurrences)
   src/lib-b/b.ml:1  encode
   src/lib-a/a.ml:1  encode

Found 1 clone clusters across 2 packages (42 total duplicated nodes)

Find duplicates of a specific function#

dupfind find QUERY [--min-size N] [--top N] [--format cli|json] DIR...

QUERY can be:

  • A qualified name like Module.func — looks up the binding in the scanned files, then finds all structural matches
  • A hex hash (32 chars) — finds all bindings with that exact hash
  • An inline OCaml expression like " let f x = x + 1" — normalizes and hashes it, then searches for matches
$ dupfind find Encode.to_bytes src/
$ dupfind find db4909941d07713e84ced7187adab6c8 src/
$ dupfind find "
let f x = x + 1" src/

Inspect the normalized AST and hash#

dupfind hash EXPR

Shows the normalized AST and structural hash for an inline OCaml expression. Useful for understanding how normalization works and debugging.

$ dupfind hash "
let f x = x + 1"
AST:  fun _0 -> + _0 1
Hash: db4909941d07713e84ced7187adab6c8

$ dupfind hash "
let g y = y + 1"
AST:  fun _0 -> + _0 1
Hash: db4909941d07713e84ced7187adab6c8

Note how f x and g y produce identical ASTs — local variable names are erased during normalization.

Options#

Flag Description
--min-size N Minimum AST node count (default: 30)
--no-intra Only show cross-package duplicates (intra shown by default)
--top N Show only top N clone clusters
--format cli|json Output format
--exclude PATTERN Exclude paths matching glob