OCaml HTML5 parser/serialiser based on Python's JustHTML
1(*---------------------------------------------------------------------------
2 Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
3 SPDX-License-Identifier: MIT
4 ---------------------------------------------------------------------------*)
5
6(** Language detection and validation checker.
7
8 This checker validates that the document's [lang] attribute matches the
9 detected language of the content, and that the [dir] attribute is correct
10 for right-to-left (RTL) languages.
11
12 {2 Detection Algorithm}
13
14 The checker:
15 1. Collects text content from the document body (up to 30720 characters)
16 2. Skips text from certain elements (scripts, navigation, form controls)
17 3. Skips foreign namespace content (SVG, MathML)
18 4. Uses statistical language detection with >90% confidence threshold
19 5. Handles Traditional vs Simplified Chinese detection
20
21 {2 Validation Rules}
22
23 - Documents should have a [lang] attribute on the [<html>] element
24 - The declared language should match the detected content language
25 - RTL languages (Arabic, Hebrew, Persian, Urdu, etc.) should have [dir="rtl"]
26
27 {2 Error Messages}
28
29 - [Wrong_lang]: The declared language doesn't match detected content
30 - [Missing_dir_rtl]: An RTL language is detected but no [dir] attribute
31 - [Wrong_dir]: The [dir] attribute doesn't match the detected RTL language
32
33 @see <https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes>
34 HTML Standard: The lang attribute
35*)
36
37val checker : Checker.t
38(** The language detection checker instance.
39
40 This checker collects text during DOM traversal and performs language
41 detection at document end. *)