···11+MIT License
22+33+Copyright (c) 2024 Anil Madhavapeddy
44+55+Permission is hereby granted, free of charge, to any person obtaining a copy
66+of this software and associated documentation files (the "Software"), to deal
77+in the Software without restriction, including without limitation the rights
88+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
99+copies of the Software, and to permit persons to whom the Software is
1010+furnished to do so, subject to the following conditions:
1111+1212+The above copyright notice and this permission notice shall be included in all
1313+copies or substantial portions of the Software.
1414+1515+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
1616+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
1717+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
1818+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
1919+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
2020+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
2121+SOFTWARE.
+153
README.md
···11+# langdetect-jsoo
22+33+Language detection for JavaScript/WebAssembly, compiled from OCaml using
44+`js_of_ocaml/wasm_of_ocaml`. This is via an OCaml port of the [Cybozu langdetect](https://github.com/shuyo/language-detection) algorithm that uses
55+n-gram frequency profiles to detect the natural language of text.
66+77+Supports 47 languages including English, Chinese, Japanese, Arabic, and many European languages.
88+99+## Installation
1010+1111+```bash
1212+npm install langdetect-jsoo
1313+```
1414+1515+## Quick Start
1616+1717+### Browser (Script Tag)
1818+1919+#### Pure JavaScript Version (~7.6MB)
2020+2121+```html
2222+<script src="node_modules/langdetect-jsoo/langdetect.js"></script>
2323+<script>
2424+ // Wait for library to load
2525+ document.addEventListener('langdetectReady', () => {
2626+ const lang = langdetect.detect("Hello, world!");
2727+ console.log(lang); // "en"
2828+ });
2929+</script>
3030+```
3131+3232+#### WebAssembly Version (~7.5MB WASM + ~12KB loader)
3333+3434+The WASM version offers better performance for repeated detections:
3535+3636+```html
3737+<script src="node_modules/langdetect-jsoo/langdetect_js_main.bc.wasm.js"></script>
3838+<script>
3939+ document.addEventListener('langdetectReady', () => {
4040+ const lang = langdetect.detect("Bonjour le monde!");
4141+ console.log(lang); // "fr"
4242+ });
4343+</script>
4444+```
4545+4646+## API Reference
4747+4848+### `langdetect.detect(text)`
4949+5050+Detect the most likely language of the input text.
5151+5252+```javascript
5353+langdetect.detect("The quick brown fox jumps over the lazy dog.")
5454+// Returns: "en"
5555+5656+langdetect.detect("こんにちは世界")
5757+// Returns: "ja"
5858+5959+langdetect.detect("")
6060+// Returns: null (text too short)
6161+```
6262+6363+**Parameters:**
6464+- `text` (string): The text to analyze
6565+6666+**Returns:**
6767+- `string | null`: ISO 639-1 language code (e.g., "en", "fr", "zh-cn") or `null` if detection fails
6868+6969+### `langdetect.detectWithProb(text)`
7070+7171+Detect the language with confidence score.
7272+7373+```javascript
7474+langdetect.detectWithProb("Bonjour le monde!")
7575+// Returns: { lang: "fr", prob: 0.9999 }
7676+7777+langdetect.detectWithProb("a")
7878+// Returns: null (text too short)
7979+```
8080+8181+**Parameters:**
8282+- `text` (string): The text to analyze
8383+8484+**Returns:**
8585+- `{ lang: string, prob: number } | null`: Object with language code and probability (0-1), or `null` if detection fails
8686+8787+### `langdetect.detectAll(text)`
8888+8989+Get all candidate languages with their probabilities.
9090+9191+```javascript
9292+langdetect.detectAll("Hello world")
9393+// Returns: [
9494+// { lang: "en", prob: 0.857 },
9595+// { lang: "de", prob: 0.095 },
9696+// { lang: "nl", prob: 0.023 },
9797+// ...
9898+// ]
9999+```
100100+101101+**Parameters:**
102102+- `text` (string): The text to analyze
103103+104104+**Returns:**
105105+- `Array<{ lang: string, prob: number }>`: Array of language candidates sorted by probability (highest first)
106106+107107+### `langdetect.languages()`
108108+109109+Get the list of supported language codes.
110110+111111+```javascript
112112+langdetect.languages()
113113+// Returns: ["ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", ...]
114114+```
115115+116116+**Returns:**
117117+- `string[]`: Array of ISO 639-1 language codes
118118+119119+## Demo
120120+121121+Open `langdetect.html` in a browser to try the interactive demo. It supports switching between JavaScript and WebAssembly runtimes.
122122+123123+## Events
124124+125125+The library dispatches a `langdetectReady` event on `document` when fully loaded:
126126+127127+```javascript
128128+document.addEventListener('langdetectReady', () => {
129129+ // langdetect API is now available
130130+ console.log('Loaded', langdetect.languages().length, 'languages');
131131+});
132132+```
133133+134134+## Algorithm
135135+136136+This library uses the Cybozu langdetect algorithm which:
137137+138138+1. Extracts n-grams (1-3 characters) from the input text
139139+2. Compares against pre-computed frequency profiles for 47 languages
140140+3. Uses a probabilistic model with Bayesian inference
141141+4. Applies text normalization for consistent detection
142142+143143+The language profiles contain ~172,000 unique n-grams across all supported languages.
144144+145145+## License
146146+147147+MIT
148148+149149+## Links
150150+151151+- [Homepage](https://tangled.org/anil.recoil.org/ocaml-langdetect)
152152+- [Source Repository](https://tangled.org/anil.recoil.org/ocaml-langdetect)
153153+- [Original Cybozu langdetect](https://github.com/shuyo/language-detection)
+17-7
package.json
···22 "name": "langdetect-jsoo",
33 "version": "1.0.0",
44 "description": "OCaml/JS port of the Cybozu langdetect algorithm. Detects the natural language of text using n-gram frequency profiles. Supports 47 languages including English, Chinese, Japanese, Arabic, and many European languages.",
55- "main": "index.js",
55+ "browser": "langdetect.js",
66 "homepage": "https://tangled.org/anil.recoil.org/ocaml-langdetect",
77 "scripts": {
88- "test": "echo \"Error: no test specified\" && exit 1"
88+ "test": "echo 'Open langdetect.html in a browser to run tests'"
99 },
1010 "author": "Anil Madhavapeddy",
1111 "license": "MIT",
···1414 "url": "git+https://tangled.org/anil.recoil.org/ocaml-langdetect.git"
1515 },
1616 "keywords": [
1717- "detection"
1717+ "language",
1818+ "detection",
1919+ "langdetect",
2020+ "nlp",
2121+ "natural-language",
2222+ "i18n",
2323+ "internationalization",
2424+ "ocaml",
2525+ "js_of_ocaml",
2626+ "wasm",
2727+ "webassembly"
1828 ],
1929 "files": [
3030+ "langdetect.js",
2031 "langdetect_js_main.bc.wasm.js",
2121- "langdetect_js_tests.bc.wasm.js",
3232+ "langdetect_js_main.bc.wasm.assets/",
2233 "langdetect-tests.js",
3434+ "langdetect_js_tests.bc.wasm.js",
3535+ "langdetect_js_tests.bc.wasm.assets/",
2336 "langdetect.html",
2424- "langdetect.js",
2525- "langdetect_js_tests.bc.wasm.assets/code-8b8dbddbdecc901ea77e.wasm",
2626- "langdetect_js_main.bc.wasm.assets/code-0b304599a8b3a5f712cf.wasm",
2737 "README.md",
2838 "LICENSE"
2939 ]