(*--------------------------------------------------------------------------- Copyright (c) 2025 Anil Madhavapeddy . All rights reserved. SPDX-License-Identifier: MIT ---------------------------------------------------------------------------*) (** HTML5 Parser - Low-Level API This module provides the core HTML5 parsing functionality implementing the {{:https://html.spec.whatwg.org/multipage/parsing.html} WHATWG HTML5 parsing specification}. It handles tokenization, tree construction, error recovery, and produces a DOM tree. For most uses, prefer the top-level {!Html5rw} module which provides a simpler interface. This module is for advanced use cases that need access to parser internals. {2 How HTML5 Parsing Works} The HTML5 parsing algorithm is unusual compared to most parsers. It was reverse-engineered from browser behavior rather than designed from a formal grammar. This ensures the parser handles malformed HTML exactly like web browsers do. The algorithm has three main phases: {3 1. Encoding Detection} Before parsing begins, the character encoding must be determined. The WHATWG specification defines a "sniffing" algorithm: 1. Check for a BOM (Byte Order Mark) at the start 2. Look for [] in the first 1024 bytes 3. Use HTTP Content-Type header hint if available 4. Fall back to UTF-8 @see WHATWG: Determining the character encoding {3 2. Tokenization} The tokenizer converts the input stream into a sequence of tokens. It implements a state machine with over 80 states to handle: - Data (text content) - Tags (start tags, end tags, self-closing tags) - Comments - DOCTYPEs - Character references ([&], [<], [<]) - CDATA sections (in SVG/MathML) The tokenizer has special handling for: - {b Raw text elements}: [