Detect which human language a document uses from OCaml, from the Nu Html validator
languages
unicode
ocaml
1# langdetect
2
3Language detection library for OCaml using n-gram frequency analysis.
4
5This is an OCaml port of the [Cybozu
6langdetect](https://github.com/shuyo/language-detection) algorithm. It detects
7the natural language of text using n-gram frequency profiles. It was ported
8from <https://github.com/validator/validator>.
9
10## Features
11
12- Detects 49 languages including English, Chinese, Japanese, Arabic, and many European languages
13- Fast probabilistic detection using n-gram frequency analysis
14- Configurable detection parameters (smoothing, convergence thresholds)
15- Reproducible results with optional random seed control
16- Pure OCaml implementation with minimal dependencies
17
18## Installation
19
20```bash
21opam install langdetect
22```
23
24## Usage
25
26```ocaml
27(* Create a detector with all built-in profiles *)
28let detector = Langdetect.create_default ()
29
30(* Detect the best matching language *)
31let () =
32 match Langdetect.detect_best detector "Hello, world!" with
33 | Some lang -> Printf.printf "Detected: %s\n" lang
34 | None -> print_endline "Could not detect language"
35
36(* Get all possible languages with probabilities *)
37let () =
38 let results = Langdetect.detect detector "Bonjour le monde" in
39 List.iter (fun r ->
40 Printf.printf "%s: %.2f\n" r.Langdetect.lang r.Langdetect.prob
41 ) results
42
43(* Use custom configuration *)
44let config = { Langdetect.default_config with prob_threshold = 0.3 }
45let detector = Langdetect.create_default ~config ()
46```
47
48## Supported Languages
49
50Arabic, Bengali, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English,
51Estonian, Farsi, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi,
52Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian,
53Macedonian, Malayalam, Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian,
54Russian, Sinhalese, Albanian, Spanish, Swedish, Tamil, Telugu, Thai, Tagalog,
55Turkish, Ukrainian, Urdu, Vietnamese, Chinese (Simplified), Chinese
56(Traditional).
57
58## License
59
60MIT License - see LICENSE file for details.
61
62Based on the Cybozu langdetect algorithm. Copyright (c) 2007-2016 Mozilla Foundation and 2025 Anil Madhavapeddy.