OCaml HTML5 parser/serialiser based on Python's JustHTML
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at f7c69be4eae5476a0985d55de71f2cc34c8d5361 41 lines 1.7 kB view raw
1(*--------------------------------------------------------------------------- 2 Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved. 3 SPDX-License-Identifier: MIT 4 ---------------------------------------------------------------------------*) 5 6(** Language detection and validation checker. 7 8 This checker validates that the document's [lang] attribute matches the 9 detected language of the content, and that the [dir] attribute is correct 10 for right-to-left (RTL) languages. 11 12 {2 Detection Algorithm} 13 14 The checker: 15 1. Collects text content from the document body (up to 30720 characters) 16 2. Skips text from certain elements (scripts, navigation, form controls) 17 3. Skips foreign namespace content (SVG, MathML) 18 4. Uses statistical language detection with >90% confidence threshold 19 5. Handles Traditional vs Simplified Chinese detection 20 21 {2 Validation Rules} 22 23 - Documents should have a [lang] attribute on the [<html>] element 24 - The declared language should match the detected content language 25 - RTL languages (Arabic, Hebrew, Persian, Urdu, etc.) should have [dir="rtl"] 26 27 {2 Error Messages} 28 29 - [Wrong_lang]: The declared language doesn't match detected content 30 - [Missing_dir_rtl]: An RTL language is detected but no [dir] attribute 31 - [Wrong_dir]: The [dir] attribute doesn't match the detected RTL language 32 33 @see <https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes> 34 HTML Standard: The lang attribute 35*) 36 37val checker : Checker.t 38(** The language detection checker instance. 39 40 This checker collects text during DOM traversal and performs language 41 detection at document end. *)