Detect which human language a document uses from OCaml, from the Nu Html validator
languages unicode ocaml
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

metadata

+191 -7
+21
LICENSE
··· 1 + MIT License 2 + 3 + Copyright (c) 2024 Anil Madhavapeddy 4 + 5 + Permission is hereby granted, free of charge, to any person obtaining a copy 6 + of this software and associated documentation files (the "Software"), to deal 7 + in the Software without restriction, including without limitation the rights 8 + to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 + copies of the Software, and to permit persons to whom the Software is 10 + furnished to do so, subject to the following conditions: 11 + 12 + The above copyright notice and this permission notice shall be included in all 13 + copies or substantial portions of the Software. 14 + 15 + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 + AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 + LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 + OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 + SOFTWARE.
+153
README.md
··· 1 + # langdetect-jsoo 2 + 3 + Language detection for JavaScript/WebAssembly, compiled from OCaml using 4 + `js_of_ocaml/wasm_of_ocaml`. This is via an OCaml port of the [Cybozu langdetect](https://github.com/shuyo/language-detection) algorithm that uses 5 + n-gram frequency profiles to detect the natural language of text. 6 + 7 + Supports 47 languages including English, Chinese, Japanese, Arabic, and many European languages. 8 + 9 + ## Installation 10 + 11 + ```bash 12 + npm install langdetect-jsoo 13 + ``` 14 + 15 + ## Quick Start 16 + 17 + ### Browser (Script Tag) 18 + 19 + #### Pure JavaScript Version (~7.6MB) 20 + 21 + ```html 22 + <script src="node_modules/langdetect-jsoo/langdetect.js"></script> 23 + <script> 24 + // Wait for library to load 25 + document.addEventListener('langdetectReady', () => { 26 + const lang = langdetect.detect("Hello, world!"); 27 + console.log(lang); // "en" 28 + }); 29 + </script> 30 + ``` 31 + 32 + #### WebAssembly Version (~7.5MB WASM + ~12KB loader) 33 + 34 + The WASM version offers better performance for repeated detections: 35 + 36 + ```html 37 + <script src="node_modules/langdetect-jsoo/langdetect_js_main.bc.wasm.js"></script> 38 + <script> 39 + document.addEventListener('langdetectReady', () => { 40 + const lang = langdetect.detect("Bonjour le monde!"); 41 + console.log(lang); // "fr" 42 + }); 43 + </script> 44 + ``` 45 + 46 + ## API Reference 47 + 48 + ### `langdetect.detect(text)` 49 + 50 + Detect the most likely language of the input text. 51 + 52 + ```javascript 53 + langdetect.detect("The quick brown fox jumps over the lazy dog.") 54 + // Returns: "en" 55 + 56 + langdetect.detect("こんにちは世界") 57 + // Returns: "ja" 58 + 59 + langdetect.detect("") 60 + // Returns: null (text too short) 61 + ``` 62 + 63 + **Parameters:** 64 + - `text` (string): The text to analyze 65 + 66 + **Returns:** 67 + - `string | null`: ISO 639-1 language code (e.g., "en", "fr", "zh-cn") or `null` if detection fails 68 + 69 + ### `langdetect.detectWithProb(text)` 70 + 71 + Detect the language with confidence score. 72 + 73 + ```javascript 74 + langdetect.detectWithProb("Bonjour le monde!") 75 + // Returns: { lang: "fr", prob: 0.9999 } 76 + 77 + langdetect.detectWithProb("a") 78 + // Returns: null (text too short) 79 + ``` 80 + 81 + **Parameters:** 82 + - `text` (string): The text to analyze 83 + 84 + **Returns:** 85 + - `{ lang: string, prob: number } | null`: Object with language code and probability (0-1), or `null` if detection fails 86 + 87 + ### `langdetect.detectAll(text)` 88 + 89 + Get all candidate languages with their probabilities. 90 + 91 + ```javascript 92 + langdetect.detectAll("Hello world") 93 + // Returns: [ 94 + // { lang: "en", prob: 0.857 }, 95 + // { lang: "de", prob: 0.095 }, 96 + // { lang: "nl", prob: 0.023 }, 97 + // ... 98 + // ] 99 + ``` 100 + 101 + **Parameters:** 102 + - `text` (string): The text to analyze 103 + 104 + **Returns:** 105 + - `Array<{ lang: string, prob: number }>`: Array of language candidates sorted by probability (highest first) 106 + 107 + ### `langdetect.languages()` 108 + 109 + Get the list of supported language codes. 110 + 111 + ```javascript 112 + langdetect.languages() 113 + // Returns: ["ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", ...] 114 + ``` 115 + 116 + **Returns:** 117 + - `string[]`: Array of ISO 639-1 language codes 118 + 119 + ## Demo 120 + 121 + Open `langdetect.html` in a browser to try the interactive demo. It supports switching between JavaScript and WebAssembly runtimes. 122 + 123 + ## Events 124 + 125 + The library dispatches a `langdetectReady` event on `document` when fully loaded: 126 + 127 + ```javascript 128 + document.addEventListener('langdetectReady', () => { 129 + // langdetect API is now available 130 + console.log('Loaded', langdetect.languages().length, 'languages'); 131 + }); 132 + ``` 133 + 134 + ## Algorithm 135 + 136 + This library uses the Cybozu langdetect algorithm which: 137 + 138 + 1. Extracts n-grams (1-3 characters) from the input text 139 + 2. Compares against pre-computed frequency profiles for 47 languages 140 + 3. Uses a probabilistic model with Bayesian inference 141 + 4. Applies text normalization for consistent detection 142 + 143 + The language profiles contain ~172,000 unique n-grams across all supported languages. 144 + 145 + ## License 146 + 147 + MIT 148 + 149 + ## Links 150 + 151 + - [Homepage](https://tangled.org/anil.recoil.org/ocaml-langdetect) 152 + - [Source Repository](https://tangled.org/anil.recoil.org/ocaml-langdetect) 153 + - [Original Cybozu langdetect](https://github.com/shuyo/language-detection)
+17 -7
package.json
··· 2 2 "name": "langdetect-jsoo", 3 3 "version": "1.0.0", 4 4 "description": "OCaml/JS port of the Cybozu langdetect algorithm. Detects the natural language of text using n-gram frequency profiles. Supports 47 languages including English, Chinese, Japanese, Arabic, and many European languages.", 5 - "main": "index.js", 5 + "browser": "langdetect.js", 6 6 "homepage": "https://tangled.org/anil.recoil.org/ocaml-langdetect", 7 7 "scripts": { 8 - "test": "echo \"Error: no test specified\" && exit 1" 8 + "test": "echo 'Open langdetect.html in a browser to run tests'" 9 9 }, 10 10 "author": "Anil Madhavapeddy", 11 11 "license": "MIT", ··· 14 14 "url": "git+https://tangled.org/anil.recoil.org/ocaml-langdetect.git" 15 15 }, 16 16 "keywords": [ 17 - "detection" 17 + "language", 18 + "detection", 19 + "langdetect", 20 + "nlp", 21 + "natural-language", 22 + "i18n", 23 + "internationalization", 24 + "ocaml", 25 + "js_of_ocaml", 26 + "wasm", 27 + "webassembly" 18 28 ], 19 29 "files": [ 30 + "langdetect.js", 20 31 "langdetect_js_main.bc.wasm.js", 21 - "langdetect_js_tests.bc.wasm.js", 32 + "langdetect_js_main.bc.wasm.assets/", 22 33 "langdetect-tests.js", 34 + "langdetect_js_tests.bc.wasm.js", 35 + "langdetect_js_tests.bc.wasm.assets/", 23 36 "langdetect.html", 24 - "langdetect.js", 25 - "langdetect_js_tests.bc.wasm.assets/code-8b8dbddbdecc901ea77e.wasm", 26 - "langdetect_js_main.bc.wasm.assets/code-0b304599a8b3a5f712cf.wasm", 27 37 "README.md", 28 38 "LICENSE" 29 39 ]