Detect which human language a document uses from OCaml, from the Nu Html validator
languages unicode ocaml
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 62 lines 2.1 kB view raw view rendered
1# langdetect 2 3Language detection library for OCaml using n-gram frequency analysis. 4 5This is an OCaml port of the [Cybozu 6langdetect](https://github.com/shuyo/language-detection) algorithm. It detects 7the natural language of text using n-gram frequency profiles. It was ported 8from <https://github.com/validator/validator>. 9 10## Features 11 12- Detects 49 languages including English, Chinese, Japanese, Arabic, and many European languages 13- Fast probabilistic detection using n-gram frequency analysis 14- Configurable detection parameters (smoothing, convergence thresholds) 15- Reproducible results with optional random seed control 16- Pure OCaml implementation with minimal dependencies 17 18## Installation 19 20```bash 21opam install langdetect 22``` 23 24## Usage 25 26```ocaml 27(* Create a detector with all built-in profiles *) 28let detector = Langdetect.create_default () 29 30(* Detect the best matching language *) 31let () = 32 match Langdetect.detect_best detector "Hello, world!" with 33 | Some lang -> Printf.printf "Detected: %s\n" lang 34 | None -> print_endline "Could not detect language" 35 36(* Get all possible languages with probabilities *) 37let () = 38 let results = Langdetect.detect detector "Bonjour le monde" in 39 List.iter (fun r -> 40 Printf.printf "%s: %.2f\n" r.Langdetect.lang r.Langdetect.prob 41 ) results 42 43(* Use custom configuration *) 44let config = { Langdetect.default_config with prob_threshold = 0.3 } 45let detector = Langdetect.create_default ~config () 46``` 47 48## Supported Languages 49 50Arabic, Bengali, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, 51Estonian, Farsi, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, 52Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, 53Macedonian, Malayalam, Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, 54Russian, Sinhalese, Albanian, Spanish, Swedish, Tamil, Telugu, Thai, Tagalog, 55Turkish, Ukrainian, Urdu, Vietnamese, Chinese (Simplified), Chinese 56(Traditional). 57 58## License 59 60MIT License - see LICENSE file for details. 61 62Based on the Cybozu langdetect algorithm. Copyright (c) 2007-2016 Mozilla Foundation and 2025 Anil Madhavapeddy.