···11+# These are the default owners for everything in the repo. They will
22+# be requested for review when someone opens a pull request.
33+* @alainfrisch @Drup @pmetzger
···11+# 3.7 (2025-10-06)
22+- Update to unicode 17.0.0
33+44+# 3.6 (2025-01-05)
55+- Fixed one of the ranges implementing
66+ Implement Corrigendum #1: UTF-8 Shortest Form
77+ for 4-bytes long characters (#171)
88+99+# 3.5 (2025-05-29)
1010+- Implement Corrigendum #1: UTF-8 Shortest Form
1111+- Add utf8 support for string literal (#127)
1212+1313+# 3.4 (2025-03-28)
1414+- Make the library compatibility with ppxlib.0.36 (#166)
1515+1616+# 3.3 (2024-10-29)
1717+- Add support for unicode `16.0.0` (#157)
1818+- Add API for retrieving start and stop positions separately (#155)
1919+2020+# 3.2 (2023-06-28):
2121+- Restore compatibility with OCaml 4.08
2222+- Use `Sedlexing.{Utf8,Utf16}.from_gen` to initialize UTF8 (resp. UTF16) lexing buffers from
2323+ string.
2424+- Delay raising Malformed until actually reading the malformed part of the imput. (#140)
2525+- Count lines in all cases (#130). Previously, certain functions for initiating the
2626+ lexical buffer would disable lines counting.
2727+- Check and fix invariants from Cset. The codebase was not respecting
2828+ invariants documented in the Cset module which could break code
2929+ relying on it. The code generated by sedlex.ppx could be affected.
3030+- Do not rely on comments from unicode UCD files
3131+- Add API to track position in bytes. Should be opt-in and backward compatible. (#146)
3232+3333+# 3.1:
3434+- Fix directly nested sedlex matches (@smuenzel, PR #117, fixes: #12)
3535+- Use explicit stdlib in generated code (@hhugo, PR #122, fixes: #115)
3636+- Preserve location of lexbuf (@hhugo, PR #118, fixes: #19)
3737+- Don't use gen to consume channels (@hhugo, PR #124, fixes: #45)
3838+- New expect_test testsuite (@hhugo, PR #124)
3939+- Properly recognize malformed truncated input (@hhugo, PR #124)
4040+- Raise `Malformed` instead of `Invalid_arg` (@hhugo, PR #126, fixes: #91)
4141+- Updated unicode support to `15.0.0`
4242+4343+# 3.0:
4444+- Dropped `Stream` api which was removed in `4.14.0` ahead of the `5.0`
4545+ release.
4646+4747+2.6:
4848+- Adapted to ppxlib `0.26`, thanks to @pitag-ha
4949+5050+2.5:
5151+- Fix exponential compilation time, thanks to @mnxn for reporting in #97
5252+ and @fangyi-zhou for fixing in #106
5353+- Update unicode support for `14.0.0`.
5454+5555+# 2.4
5656+- Update `dune` support to `2.8`, add auto-generated `opam` files.
5757+- Optimize generated code, thanks to @bobzhang
5858+- Update unicode version to 13.0.0
5959+6060+# 2.3
6161+- Switch to ppxlib
6262+6363+# 2.2
6464+- Support for OCaml 4.08
6565+6666+# 2.1
6767+- GPR#78: Auto-generate unicode data
6868+6969+# 2.0
7070+- GPR#70: Switch to dune, opam v2
7171+- GPR#60: Breaking change: switch from int codepoints to Uchar.t
7272+ codepoints
7373+- GPR#59: Track lexing position
7474+7575+# 1.99.4
7676+- GPR#47: Switch to ocaml-migrate-parsetree (contributed by Adrien Guatto)
7777+- GPR#42: Added 'Rep' (repeat operator) (contributed by jpathy)
7878+7979+# 1.99.3
8080+- Update to work with 4.03 (4.02 still supported)
8181+8282+# 1.99.2
8383+- First official release of sedlex
8484+8585+# 1.99.1
8686+- Support for new Ast_mapper registration API, follow OCaml trunk after
8787+ the inclusion of the extension_point branch
8888+8989+# 1.99
9090+- First version of sedlex. The history below refers to ulex, the ancestor
9191+ or sedlex implemented with Camlp4.
9292+9393+# 1.1
9494+- Generate (more) globally unique identifiers to avoid conflicts when open'ing another module
9595+ processed by ulex (issue reported by Gerd Stolpmann)
9696+9797+# 1.0
9898+- Update to the new Camlp4 and to ocamlbuild (release for OCaml 3.10
9999+ only), by Nicolas Pouillard.
100100+101101+# 0.8
102102+- Really make it work with OCaml 3.09.
103103+- Support for Utf-16.
104104+105105+# 0.7 released May 24 2005
106106+- Bug fixes
107107+- Update to OCaml 3.09 (currently CVS). Still works with OCaml 3.08.
108108+- MIT-like license (used to LGPL)
109109+110110+# 0.5 release Jul. 8 2004
111111+- Document how to use a custom implementation for lex buffers
112112+- Update to OCaml 3.08
113113+114114+# 0.4 released Jan. 10 2004
115115+- Bug fix (accept 1114111 as valid Unicode code point)
116116+- Add the rollback function-
117117+# 0.3 released Oct. 8 2003
118118+- Bug fix
119119+- Add a new predefined class for ISO identifiers
120120+121121+# 0.2 released Sep. 22 2003
122122+- Changed the names of predefined regexp
123123+- Fix max_code = 0x10ffff
124124+- Lexers that changes encoding on the fly
125125+- Documentation of the interface Ulexing
126126+127127+# 0.1 released Sep. 20 2003
128128+- Initial release
+22
vendor/opam/sedlex/LICENSE
···11+The MIT License (MIT)
22+33+Copyright 2005, 2014 by Alain Frisch and LexiFi.
44+55+Permission is hereby granted, free of charge, to any person obtaining
66+a copy of this software and associated documentation files (the
77+"Software"), to deal in the Software without restriction, including
88+without limitation the rights to use, copy, modify, merge, publish,
99+distribute, sublicense, and/or sell copies of the Software, and to
1010+permit persons to whom the Software is furnished to do so, subject to
1111+the following conditions:
1212+1313+The above copyright notice and this permission notice shall be
1414+included in all copies or substantial portions of the Software.
1515+1616+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
1717+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
1818+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
1919+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
2020+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
2121+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
2222+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+27
vendor/opam/sedlex/Makefile
···11+# The package sedlex is released under the terms of an MIT-like license.
22+# See the attached LICENSE file.
33+# Copyright 2005, 2013 by Alain Frisch and LexiFi.
44+55+INSTALL_ARGS := $(if $(PREFIX),--prefix $(PREFIX),)
66+77+.PHONY: build install uninstall clean doc test all
88+99+build:
1010+ dune build @install
1111+1212+install:
1313+ dune install $(INSTALL_ARGS)
1414+1515+uninstall:
1616+ dune uninstall $(INSTALL_ARGS)
1717+1818+clean:
1919+ dune clean
2020+2121+doc:
2222+ dune build @doc
2323+2424+test:
2525+ dune build @runtest
2626+2727+all: build test doc
+250
vendor/opam/sedlex/README.md
···11+# sedlex
22+33+[](https://github.com/ocaml-community/sedlex/actions/workflows/build.yml)
44+55+Unicode-friendly lexer generator for OCaml.
66+77+This package is licensed by LexiFi under the terms of the MIT license.
88+99+sedlex was originally written by Alain Frisch
1010+<alain.frisch@lexifi.com> and is now maintained as part of the
1111+ocaml-community repositories on github.
1212+1313+## API
1414+The API is documented [here](https://ocaml-community.github.io/sedlex).
1515+1616+## Overview
1717+1818+sedlex is a lexer generator for OCaml, similar to ocamllex, but
1919+supporting Unicode. Contrary to ocamllex, lexer specifications for
2020+sedlex are embedded in regular OCaml source files.
2121+2222+The lexers work with a new kind of "lexbuf", similar to ocamllex
2323+Lexing lexbufs, but designed to support Unicode, and abstracting from
2424+a specific encoding. A single lexer can work with arbitrary encodings
2525+of the input stream.
2626+2727+sedlex is the successor of the ulex project. Contrary to ulex which
2828+was implemented as a Camlp4 syntax extension, sedlex is based on the
2929+new "-ppx" technology of OCaml, which allow rewriting OCaml parse
3030+trees through external rewriters. (And what a better name than "sed"
3131+for a rewriter?)
3232+3333+As any -ppx rewriter, sedlex does not touch the concrete syntax of the
3434+language: lexer specifications are written in source file which comply
3535+with the standard grammar of OCaml programs. sedlex reuse the syntax
3636+for pattern matching in order to describe lexers (regular expressions
3737+are encoded within OCaml patterns). A nice consequence is that your
3838+editor (vi, emacs, ...) won't get confused (indentation, coloring) and
3939+you don't need to learn new priority rules. Moreover, sedlex is
4040+compatible with any front-end parsing technology: it works fine even
4141+if you use camlp4 or camlp5, with the standard or revised syntax.
4242+4343+4444+## Lexer specifications
4545+4646+4747+sedlex adds a new kind of expression to OCaml: lexer definitions.
4848+The syntax for the new construction is:
4949+5050+```ocaml
5151+ match%sedlex lexbuf with
5252+ | R1 -> e1
5353+ ...
5454+ | Rn -> en
5555+ | _ -> def
5656+```
5757+5858+or:
5959+6060+```ocaml
6161+ [%sedlex match lexbuf with
6262+ | R1 -> e1
6363+ ...
6464+ | Rn -> en
6565+ | _ -> def
6666+ ]
6767+```
6868+6969+(The first vertical bar is optional as in any OCaml pattern matching.
7070+Guard expressions are not allowed.)
7171+7272+where:
7373+- lexbuf is an arbitrary lowercase identifier, which must refer to
7474+ an existing value of type `Sedlexing.lexbuf`.
7575+- the Ri are regular expressions (see below);
7676+- the ei and def are OCaml expressions (called actions) of the same type
7777+ (the type for the whole lexer definition).
7878+7979+Unlike ocamllex, lexers work on stream of Unicode codepoints, not
8080+bytes.
8181+8282+The actions can call functions from the Sedlexing module to extract
8383+(parts of) the matched lexeme, in the desired encoding.
8484+8585+Regular expressions are syntactically OCaml patterns:
8686+8787+- `"...."` (string constant): recognize the specified string.
8888+- `'....'` (character constant) : recognize the specified character
8989+- `i` (integer constant) : recognize the specified codepoint
9090+- `'...' .. '...'`: character range
9191+- `i1 .. i2`: range between two codepoints
9292+- `R1 | R2` : alternation
9393+- `R, R2, ..., Rn` : concatenation
9494+- `Star R` : Kleene star (0 or more repetition)
9595+- `Plus R` : equivalent to `R, R*`
9696+- `Opt R` : equivalent to `("" | R)`
9797+- `Rep (R, n)` : equivalent to `R{n}`
9898+- `Rep (R, n .. m)` : equivalent to `R{n, m}`
9999+- `Chars "..."` : recognize any character in the string
100100+- `Compl R` : assume that R is a single-character length regexp (see below)
101101+ and recognize the complement set
102102+- `Sub (R1,R2)` : assume that R is a single-character length regexp (see below)
103103+ and recognize the set of items in `R1` but not in `R2` ("subtract")
104104+- `Intersect (R1,R2)` : assume that `R` is a single-character length regexp (see
105105+ below) and recognize the set of items which are in both `R1` and `R2`
106106+- `Utf8 R` : string literals inside R are assumed to be utf-8 encoded.
107107+- `Latin1 R` : string literals inside R are assumed to be latin1 encoded.
108108+- `Ascii R` : string literals inside R are assumed to be ascii encoded.
109109+- `lid` (lowercase identifier) : reference a named regexp (see below)
110110+111111+A single-character length regexp is a regexp which does not contain (after
112112+expansion of references) concatenation, Star, Plus, Opt or string constants
113113+with a length different from one.
114114+115115+116116+117117+Note:
118118+ - The OCaml source is assumed to be encoded in UTF-8.
119119+ - Strings and chars litterals will be interpreted in ASCII unless otherwise
120120+ specified by the `Latin1`,`Ascii` and `Utf8` constructors in patterns.
121121+122122+123123+It is possible to define named regular expressions with the following
124124+construction, that can appear in place of a structure item:
125125+126126+```ocaml
127127+ let lid = [%sedlex.regexp? R]
128128+```
129129+130130+where lid is the regexp name to be defined and R its definition. The
131131+scope of the "lid" regular expression is the rest of the structure,
132132+after the definition.
133133+134134+The same syntax can be used for local binding:
135135+136136+```ocaml
137137+ let lid = [%sedlex.regexp? R] in
138138+ body
139139+```
140140+141141+The scope of "lid" is the body expression.
142142+143143+144144+## Predefined regexps
145145+146146+sedlex provides a set of predefined regexps:
147147+- any: any character
148148+- eof: the virtual end-of-file character
149149+- xml_letter, xml_digit, xml_extender, xml_base_char, xml_ideographic,
150150+ xml_combining_char, xml_blank: as defined by the XML recommandation
151151+- tr8876_ident_char: characters names in identifiers from ISO TR8876
152152+- cc, cf, cn, co, cs, ll, lm, lo, lt, lu, mc, me, mn, nd, nl, no, pc, pd,
153153+ pe, pf, pi, po, ps, sc, sk, sm, so, zl, zp, zs: as defined by the
154154+ Unicode standard (categories)
155155+- alphabetic, ascii_hex_digit, hex_digit, id_continue, id_start,
156156+ lowercase, math, other_alphabetic, other_lowercase, other_math,
157157+ other_uppercase, uppercase, white_space, xid_continue, xid_start: as
158158+ defined by the Unicode standard (properties)
159159+160160+161161+## Running a lexer
162162+163163+See the interface of the Sedlexing module for a description of how to
164164+create lexbuf values (from strings, stream or channels encoded in
165165+Latin1, utf8 or utf16, or from integer arrays or streams representing
166166+Unicode code points).
167167+168168+It is possible to work with a custom implementation for lex buffers.
169169+To do this, you just have to ensure that a module called Sedlexing is
170170+in scope of your lexer specifications, and that it defines at least
171171+the following functions: start, next, mark, backtrack. See the interface
172172+of the Sedlexing module for more information.
173173+174174+175175+176176+## Using sedlex
177177+178178+The quick way:
179179+180180+```
181181+ opam install sedlex
182182+```
183183+184184+185185+Otherwise, the first thing to do is to compile and install sedlex.
186186+You need a recent version of OCaml and [dune](https://dune.build/).
187187+188188+```
189189+ make
190190+```
191191+192192+### With findlib
193193+194194+If you have findlib, you can use it to install and use sedlex.
195195+The name of the findlib package is "sedlex".
196196+197197+Installation (after "make"):
198198+199199+```
200200+ make install
201201+```
202202+203203+Compilation of OCaml files with lexer specifications:
204204+205205+```
206206+ ocamlfind ocamlc -c -package sedlex.ppx my_file.ml
207207+```
208208+209209+When linking, you must also include the sedlex package:
210210+211211+```
212212+ ocamlfind ocamlc -o my_prog -linkpkg -package sedlex.ppx my_file.cmo
213213+```
214214+215215+216216+There is also a sedlex.ppx subpackage containing the code of the ppx
217217+filter. This can be used to build custom drivers (combining several ppx
218218+transformations in a single process).
219219+220220+221221+### Without findlib
222222+223223+You can use sedlex without findlib. To compile, you need to run the
224224+source file through -ppx rewriter ppx_sedlex. Moreover, you need to
225225+link the application with the runtime support library for sedlex
226226+(sedlexing.cma / sedlexing.cmxa).
227227+228228+### With utop
229229+230230+Once sedlex is installed as per above, simply type
231231+232232+```
233233+#require "sedlex.ppx";;
234234+```
235235+236236+## Examples
237237+238238+The `examples/` subdirectory contains several samples of sedlex in use.
239239+240240+## Contributors
241241+242242+- Benus Becker: implementation of Utf16
243243+- sghost: for Unicode 6.3 categories and properties
244244+- Peter Zotov:
245245+ - improvements to the build system
246246+ - switched parts of ppx_sedlex to using concrete syntax (with ppx_metaquot)
247247+- Steffen Smolka: port to dune
248248+- Romain Beauxis:
249249+ - Implementation of the unicode table extractors
250250+ - General maintenance
+26
vendor/opam/sedlex/dune-project
···11+(lang dune 3.0)
22+(version 3.7)
33+(name sedlex)
44+(source (github ocaml-community/sedlex))
55+(license MIT)
66+(authors "Alain Frisch <alain.frisch@lexifi.com>"
77+ "https://github.com/ocaml-community/sedlex/graphs/contributors")
88+(maintainers "Alain Frisch <alain.frisch@lexifi.com>")
99+(homepage "https://github.com/ocaml-community/sedlex")
1010+1111+(generate_opam_files true)
1212+(executables_implicit_empty_intf true)
1313+1414+(package
1515+ (name sedlex)
1616+ (synopsis "An OCaml lexer generator for Unicode")
1717+ (description "sedlex is a lexer generator for OCaml. It is similar to ocamllex, but supports
1818+Unicode. Unlike ocamllex, sedlex allows lexer specifications within regular
1919+OCaml source files. Lexing specific constructs are provided via a ppx syntax
2020+extension.")
2121+ (depends
2222+ (ocaml (>= 4.08))
2323+ dune
2424+ (ppxlib (>= 0.26.0))
2525+ gen
2626+ (ppx_expect :with-test)))
···11+let rec token buf =
22+ match%sedlex buf with any -> token buf | eof -> () | _ -> assert false
33+44+let time f x =
55+ let rec acc f x = function
66+ | 0 -> f x
77+ | n ->
88+ f x |> ignore;
99+ acc f x (n - 1)
1010+ in
1111+ let t = Sys.time () in
1212+ let fx = acc f x 10 in
1313+ Printf.printf "Execution time: %fs\n" (Sys.time () -. t);
1414+ fx
1515+1616+let () =
1717+ let long_str = String.make 1000000 '\n' in
1818+ let token_from _ =
1919+ let lexbuf = Sedlexing.Latin1.from_string long_str in
2020+ (* let () = Sedlexing.set_curr_p lexbuf Lexing.dummy_pos in *)
2121+ token lexbuf
2222+ in
2323+ time token_from long_str
+58
vendor/opam/sedlex/examples/regressions.ml
···11+(* This test that unicode_old.ml is a strict sub-set of new unicode.ml. *)
22+33+module CSet = Sedlex_ppx.Sedlex_cset
44+module Unicode = Sedlex_ppx.Unicode
55+66+let test_versions = ("16.0.0", "17.0.0")
77+88+let regressions =
99+ [ (* Example *)
1010+ (* ("lt", CSet.union (CSet.singleton 0x1c5) (CSet.singleton (0x0001))) *) ]
1111+1212+let compare name (old_ : CSet.t) (new_ : CSet.t) =
1313+ let diff = CSet.difference old_ new_ in
1414+ let regressions =
1515+ match List.assoc name regressions with
1616+ | exception Not_found -> CSet.empty
1717+ | x -> x
1818+ in
1919+ let regressions_intersect = CSet.intersection regressions old_ in
2020+ let regressions = CSet.difference regressions regressions_intersect in
2121+ let regressions_useless = CSet.difference regressions new_ in
2222+ let diff = CSet.difference diff regressions in
2323+ Seq.iter
2424+ (fun x ->
2525+ Printf.printf
2626+ "Invalid regression for 0x%x in %s: already present in old set.\n" x
2727+ name)
2828+ (CSet.to_seq regressions_intersect);
2929+ Seq.iter
3030+ (fun x ->
3131+ Printf.printf "Invalid regression for 0x%x in %s: absent in new set.\n" x
3232+ name)
3333+ (CSet.to_seq regressions_useless);
3434+ Seq.iter
3535+ (fun x -> Printf.printf "Code point 0x%x missing in %s!\n" x name)
3636+ (CSet.to_seq diff)
3737+3838+let test new_l (name, old_l) =
3939+ (* Cn is for unassigned code points, which are allowed to be
4040+ * used in future version. *)
4141+ let old_l = Sedlex_utils.Cset.to_list old_l in
4242+ if name <> "cn" then (
4343+ let old_l =
4444+ List.fold_left
4545+ (fun acc (a, b) -> CSet.union acc (CSet.interval a b))
4646+ CSet.empty old_l
4747+ in
4848+ compare name old_l (List.assoc name new_l))
4949+5050+let () =
5151+ if (Unicode_old.version, Unicode.version) <> test_versions then
5252+ failwith
5353+ (Printf.sprintf "Test written for versions: %s => %s\n%!"
5454+ Unicode_old.version Unicode.version);
5555+ Printf.printf "Testing Unicode regression: %s => %s\n%!" Unicode_old.version
5656+ Unicode.version;
5757+ List.iter (test Unicode.Categories.list) Unicode_old.Categories.list;
5858+ List.iter (test Unicode.Properties.list) Unicode_old.Properties.list
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+(* Copyright 2005, 2013 by Alain Frisch and LexiFi. *)
44+55+(* Character sets are represented as lists of intervals. The
66+ intervals must be non-overlapping and not collapsable, and the list
77+ must be ordered in increasing order. *)
88+99+type t = (int * int) list
1010+1111+let rec range_to_seq a b next () =
1212+ if a = b then Seq.Cons (a, next) else Seq.Cons (a, range_to_seq (a + 1) b next)
1313+1414+let rec to_seq x () =
1515+ match x with [] -> Seq.Nil | (a, b) :: xs -> range_to_seq a b (to_seq xs) ()
1616+1717+let check_invariant l =
1818+ let rec loop prev = function
1919+ | [] -> ()
2020+ | (a, b) :: xs ->
2121+ if a < prev then
2222+ failwith
2323+ (Printf.sprintf
2424+ "Sedlex_cset.of_list: not in increasing order or overlapping. \
2525+ [_-%d]-[%d-%d]"
2626+ prev a b);
2727+ if a = prev then
2828+ failwith
2929+ (Printf.sprintf
3030+ "Sedlex_cset.of_list: adjacent range. [_-%d]-[%d-%d]" prev a b);
3131+ if a > b then
3232+ failwith
3333+ (Printf.sprintf "Sedlex_cset.of_list: malformed range. [%d-%d]" a b);
3434+ loop b xs
3535+ in
3636+ loop (-1) l
3737+3838+let of_list l =
3939+ check_invariant l;
4040+ l
4141+4242+let to_list l = l
4343+let max_code = 0x10ffff (* must be < max_int *)
4444+let min_code = -1
4545+let empty = []
4646+let singleton i = [(i, i)]
4747+let is_empty = function [] -> true | _ -> false
4848+let interval i j = if i <= j then [(i, j)] else [(j, i)]
4949+let eof = singleton (-1)
5050+let any = interval 0 max_code
5151+5252+let rec union c1 c2 =
5353+ match (c1, c2) with
5454+ | [], _ -> c2
5555+ | _, [] -> c1
5656+ | ((i1, j1) as s1) :: r1, (i2, j2) :: r2 ->
5757+ if i1 <= i2 then
5858+ if j1 + 1 < i2 then s1 :: union r1 c2
5959+ else if j1 < j2 then union r1 ((i1, j2) :: r2)
6060+ else union c1 r2
6161+ else union c2 c1
6262+6363+let union_list : t list -> t = function
6464+ | [] -> empty
6565+ | [x] -> x
6666+ | l ->
6767+ List.concat l
6868+ |> List.sort (fun a b -> compare b a)
6969+ |> List.fold_left (fun (acc : t) (x : int * int) -> union [x] acc) empty
7070+7171+let complement c =
7272+ let rec aux start = function
7373+ | [] -> if start <= max_code then [(start, max_code)] else []
7474+ | (i, j) :: l -> (start, i - 1) :: aux (succ j) l
7575+ in
7676+ match c with (-1, j) :: l -> aux (succ j) l | l -> aux (-1) l
7777+7878+let intersection c1 c2 = complement (union (complement c1) (complement c2))
7979+let difference c1 c2 = complement (union (complement c1) c2)
+26
vendor/opam/sedlex/src/common/cset.mli
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+(* Copyright 2005, 2013 by Alain Frisch and LexiFi. *)
44+55+(** Representation of sets of unicode code points. *)
66+77+(** Character sets are represented as lists of intervals. The intervals must be
88+ non-overlapping and not collapsable, and the list must be ordered in
99+ increasing order. *)
1010+type t = private (int * int) list
1111+1212+val of_list : (int * int) list -> t
1313+val to_list : t -> (int * int) list
1414+val min_code : int
1515+val max_code : int
1616+val empty : t
1717+val any : t
1818+val union : t -> t -> t
1919+val union_list : t list -> t
2020+val difference : t -> t -> t
2121+val intersection : t -> t -> t
2222+val is_empty : t -> bool
2323+val eof : t
2424+val singleton : int -> t
2525+val interval : int -> int -> t
2626+val to_seq : t -> int Seq.t
···11+# Unicode specification extraction
22+33+The file `src/syntax/unicode.ml` is generated using the data available at
44+[unicode.org](https://www.unicode.org/Public/).
55+66+The rule with `targets unicode.ml` at `src/syntax/dune` is the main entry point for this process.
77+It specifies how `unicode.ml` should be generated when running `dune @build` and triggers:
88+* download of data files at `src/generator/data`
99+* build of `src/generator/gen_unicode.exe`
1010+* generation `unicode.ml` and places a copy in the source tree and a copy in the build tree
1111+1212+The rule is ignored when using the `--ignore-promoted-rules` option. This option is also implied
1313+when using `-p`/`--for-release-of-packages` which is used for production build so production build
1414+do not download the text data and re-generate `unicode.ml`.
1515+1616+However, each development build re-generates a `unicode.ml` file which is placed into the source
1717+tree and, thus, can be easily commited when it is updated.
1818+1919+See: [dune documentation](https://dune.readthedocs.io/en/latest/dune-files.html#modes) for more
2020+information.
2121+2222+## Update to new Unicode versions
2323+2424+To update the supported version, update the URL at `src/generator/data/base_url`. Make sure to
2525+not include a trailing new line so that it is properly read in `src/generator/data/dune`.
2626+2727+Finally, place a copy of the old `unicode.ml` at `examples/unicode_old.ml` and update
2828+`test_versions` and `regressions` in `examples/regressions.ml`.
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+(* Copyright 2005, 2013 by Alain Frisch and LexiFi. *)
44+55+exception InvalidCodepoint of int
66+exception MalFormed
77+88+module Uchar = struct
99+ (* This for compatibility with ocaml < 4.14.0 *)
1010+ let utf_8_byte_length u =
1111+ match Uchar.to_int u with
1212+ | u when u < 0 -> assert false
1313+ | u when u <= 0x007F -> 1
1414+ | u when u <= 0x07FF -> 2
1515+ | u when u <= 0xFFFF -> 3
1616+ | u when u <= 0x10FFFF -> 4
1717+ | _ -> assert false
1818+1919+ let utf_16_byte_length u =
2020+ match Uchar.to_int u with
2121+ | u when u < 0 -> assert false
2222+ | u when u <= 0xFFFF -> 2
2323+ | u when u <= 0x10FFFF -> 4
2424+ | _ -> assert false
2525+2626+ let () =
2727+ ignore utf_8_byte_length;
2828+ ignore utf_16_byte_length
2929+3030+ include Uchar
3131+3232+ let of_int x =
3333+ if Uchar.is_valid x then Uchar.unsafe_of_int x else raise MalFormed
3434+end
3535+3636+(* shadow polymorphic equal *)
3737+let ( = ) (a : int) b = a = b
3838+let ( >>| ) o f = match o with Some x -> Some (f x) | None -> None
3939+4040+(* Absolute position from the beginning of the stream *)
4141+type apos = int
4242+4343+type lexbuf = {
4444+ refill : Uchar.t array -> int -> int -> int;
4545+ bytes_per_char : Uchar.t -> int;
4646+ mutable buf : Uchar.t array;
4747+ mutable len : int;
4848+ (* Number of meaningful uchar in buffer *)
4949+ mutable offset : apos;
5050+ (* Number of meaningful bytes in buffer *)
5151+ mutable bytes_offset : apos;
5252+ (* Position of the first uchar in buffer
5353+ in the input stream *)
5454+ mutable pos : int;
5555+ (* Position of the first byte in buffer
5656+ in the input stream *)
5757+ mutable bytes_pos : int;
5858+ (* Position of the beginning of the line in the buffer, in uchar *)
5959+ mutable curr_bol : int;
6060+ (* Position of the beginning of the line in the buffer, in bytes *)
6161+ mutable curr_bytes_bol : int;
6262+ (* Index of the current line in the input stream. *)
6363+ mutable curr_line : int;
6464+ (* starting position, in uchar. *)
6565+ mutable start_pos : int;
6666+ (* starting position, in bytes. *)
6767+ mutable start_bytes_pos : int;
6868+ (* First uchar we need to keep visible *)
6969+ mutable start_bol : int;
7070+ (* First byte we need to keep visible *)
7171+ mutable start_bytes_bol : int;
7272+ (* start from 1 *)
7373+ mutable start_line : int;
7474+ mutable marked_pos : int;
7575+ mutable marked_bytes_pos : int;
7676+ mutable marked_bol : int;
7777+ mutable marked_bytes_bol : int;
7878+ mutable marked_line : int;
7979+ mutable marked_val : int;
8080+ mutable filename : string;
8181+ mutable finished : bool;
8282+}
8383+8484+let chunk_size = 512
8585+8686+let empty_lexbuf bytes_per_char =
8787+ {
8888+ refill = (fun _ _ _ -> assert false);
8989+ bytes_per_char;
9090+ buf = [||];
9191+ len = 0;
9292+ offset = 0;
9393+ bytes_offset = 0;
9494+ pos = 0;
9595+ bytes_pos = 0;
9696+ curr_bol = 0;
9797+ curr_bytes_bol = 0;
9898+ curr_line = 1;
9999+ start_pos = 0;
100100+ start_bytes_pos = 0;
101101+ start_bol = 0;
102102+ start_bytes_bol = 0;
103103+ start_line = 0;
104104+ marked_pos = 0;
105105+ marked_bytes_pos = 0;
106106+ marked_bol = 0;
107107+ marked_bytes_bol = 0;
108108+ marked_line = 0;
109109+ marked_val = 0;
110110+ filename = "";
111111+ finished = false;
112112+ }
113113+114114+let dummy_uchar = Uchar.of_int 0
115115+let nl_uchar = Uchar.of_int 10
116116+117117+let create ?(bytes_per_char = fun _ -> 1) refill =
118118+ {
119119+ (empty_lexbuf bytes_per_char) with
120120+ refill;
121121+ buf = Array.make chunk_size dummy_uchar;
122122+ }
123123+124124+let set_position ?bytes_position lexbuf position =
125125+ lexbuf.offset <- position.Lexing.pos_cnum - lexbuf.pos;
126126+ lexbuf.curr_bol <- position.Lexing.pos_bol;
127127+ lexbuf.curr_line <- position.Lexing.pos_lnum;
128128+ let bytes_position = Option.value ~default:position bytes_position in
129129+ lexbuf.bytes_offset <- bytes_position.Lexing.pos_cnum - lexbuf.bytes_pos;
130130+ lexbuf.curr_bytes_bol <- bytes_position.Lexing.pos_bol
131131+132132+let set_filename lexbuf fname = lexbuf.filename <- fname
133133+134134+let from_gen ?bytes_per_char gen =
135135+ let malformed = ref false in
136136+ let refill buf pos len =
137137+ let rec loop i =
138138+ if !malformed then raise MalFormed;
139139+ if i >= len then len
140140+ else (
141141+ match gen () with
142142+ | Some c ->
143143+ buf.(pos + i) <- c;
144144+ loop (i + 1)
145145+ | None -> i
146146+ | exception MalFormed when i <> 0 ->
147147+ malformed := true;
148148+ i)
149149+ in
150150+ loop 0
151151+ in
152152+ create ?bytes_per_char refill
153153+154154+let from_int_array ?bytes_per_char a =
155155+ from_gen ?bytes_per_char
156156+ (Gen.init ~limit:(Array.length a) (fun i -> Uchar.of_int a.(i)))
157157+158158+let from_uchar_array ?(bytes_per_char = fun _ -> 1) a =
159159+ let len = Array.length a in
160160+ {
161161+ (empty_lexbuf bytes_per_char) with
162162+ buf = Array.init len (fun i -> a.(i));
163163+ len;
164164+ finished = true;
165165+ }
166166+167167+let refill lexbuf =
168168+ if lexbuf.len + chunk_size > Array.length lexbuf.buf then begin
169169+ let s = lexbuf.start_pos in
170170+ let s_bytes = lexbuf.start_bytes_pos in
171171+ let ls = lexbuf.len - s in
172172+ if ls + chunk_size <= Array.length lexbuf.buf then
173173+ Array.blit lexbuf.buf s lexbuf.buf 0 ls
174174+ else begin
175175+ let newlen = (Array.length lexbuf.buf + chunk_size) * 2 in
176176+ let newbuf = Array.make newlen dummy_uchar in
177177+ Array.blit lexbuf.buf s newbuf 0 ls;
178178+ lexbuf.buf <- newbuf
179179+ end;
180180+ lexbuf.len <- ls;
181181+ lexbuf.offset <- lexbuf.offset + s;
182182+ lexbuf.bytes_offset <- lexbuf.bytes_offset + s_bytes;
183183+ lexbuf.pos <- lexbuf.pos - s;
184184+ lexbuf.bytes_pos <- lexbuf.bytes_pos - s_bytes;
185185+ lexbuf.marked_pos <- lexbuf.marked_pos - s;
186186+ lexbuf.marked_bytes_pos <- lexbuf.marked_bytes_pos - s_bytes;
187187+ lexbuf.start_pos <- 0;
188188+ lexbuf.start_bytes_pos <- 0
189189+ end;
190190+ let n = lexbuf.refill lexbuf.buf lexbuf.pos chunk_size in
191191+ if n = 0 then lexbuf.finished <- true else lexbuf.len <- lexbuf.len + n
192192+193193+let new_line lexbuf =
194194+ lexbuf.curr_line <- lexbuf.curr_line + 1;
195195+ lexbuf.curr_bol <- lexbuf.pos + lexbuf.offset;
196196+ lexbuf.curr_bytes_bol <- lexbuf.bytes_pos + lexbuf.bytes_offset
197197+198198+let[@inline always] next_aux some none lexbuf =
199199+ if (not lexbuf.finished) && lexbuf.pos = lexbuf.len then refill lexbuf;
200200+ if lexbuf.finished && lexbuf.pos = lexbuf.len then none
201201+ else begin
202202+ let ret = lexbuf.buf.(lexbuf.pos) in
203203+ lexbuf.pos <- lexbuf.pos + 1;
204204+ lexbuf.bytes_pos <- lexbuf.bytes_pos + lexbuf.bytes_per_char ret;
205205+ if Uchar.equal ret nl_uchar then new_line lexbuf;
206206+ some ret
207207+ end
208208+209209+let next lexbuf = (next_aux [@inlined]) (fun x -> Some x) None lexbuf
210210+let __private__next_int lexbuf = (next_aux [@inlined]) Uchar.to_int (-1) lexbuf
211211+212212+let mark lexbuf i =
213213+ lexbuf.marked_pos <- lexbuf.pos;
214214+ lexbuf.marked_bytes_pos <- lexbuf.bytes_pos;
215215+ lexbuf.marked_bol <- lexbuf.curr_bol;
216216+ lexbuf.marked_bytes_bol <- lexbuf.curr_bytes_bol;
217217+ lexbuf.marked_line <- lexbuf.curr_line;
218218+ lexbuf.marked_val <- i
219219+220220+let start lexbuf =
221221+ lexbuf.start_pos <- lexbuf.pos;
222222+ lexbuf.start_bytes_pos <- lexbuf.bytes_pos;
223223+ lexbuf.start_bol <- lexbuf.curr_bol;
224224+ lexbuf.start_bytes_bol <- lexbuf.curr_bytes_bol;
225225+ lexbuf.start_line <- lexbuf.curr_line;
226226+ mark lexbuf (-1)
227227+228228+let backtrack lexbuf =
229229+ lexbuf.pos <- lexbuf.marked_pos;
230230+ lexbuf.bytes_pos <- lexbuf.marked_bytes_pos;
231231+ lexbuf.curr_bol <- lexbuf.marked_bol;
232232+ lexbuf.curr_bytes_bol <- lexbuf.marked_bytes_bol;
233233+ lexbuf.curr_line <- lexbuf.marked_line;
234234+ lexbuf.marked_val
235235+236236+let rollback lexbuf =
237237+ lexbuf.pos <- lexbuf.start_pos;
238238+ lexbuf.bytes_pos <- lexbuf.start_bytes_pos;
239239+ lexbuf.curr_bol <- lexbuf.start_bol;
240240+ lexbuf.curr_bytes_bol <- lexbuf.start_bytes_bol;
241241+ lexbuf.curr_line <- lexbuf.start_line
242242+243243+let lexeme_start lexbuf = lexbuf.start_pos + lexbuf.offset
244244+let lexeme_bytes_start lexbuf = lexbuf.start_bytes_pos + lexbuf.bytes_offset
245245+let lexeme_end lexbuf = lexbuf.pos + lexbuf.offset
246246+let lexeme_bytes_end lexbuf = lexbuf.bytes_pos + lexbuf.bytes_offset
247247+let loc lexbuf = (lexbuf.start_pos + lexbuf.offset, lexbuf.pos + lexbuf.offset)
248248+249249+let bytes_loc lexbuf =
250250+ ( lexbuf.start_bytes_pos + lexbuf.bytes_offset,
251251+ lexbuf.bytes_pos + lexbuf.bytes_offset )
252252+253253+let lexeme_length lexbuf = lexbuf.pos - lexbuf.start_pos
254254+let lexeme_bytes_length lexbuf = lexbuf.bytes_pos - lexbuf.start_bytes_pos
255255+256256+let sub_lexeme lexbuf pos len =
257257+ Array.sub lexbuf.buf (lexbuf.start_pos + pos) len
258258+259259+let lexeme lexbuf =
260260+ Array.sub lexbuf.buf lexbuf.start_pos (lexbuf.pos - lexbuf.start_pos)
261261+262262+let lexeme_char lexbuf pos = lexbuf.buf.(lexbuf.start_pos + pos)
263263+264264+let lexing_position_start lexbuf =
265265+ {
266266+ Lexing.pos_fname = lexbuf.filename;
267267+ pos_lnum = lexbuf.start_line;
268268+ pos_cnum = lexbuf.start_pos + lexbuf.offset;
269269+ pos_bol = lexbuf.start_bol;
270270+ }
271271+272272+let lexing_position_curr lexbuf =
273273+ {
274274+ Lexing.pos_fname = lexbuf.filename;
275275+ pos_lnum = lexbuf.curr_line;
276276+ pos_cnum = lexbuf.pos + lexbuf.offset;
277277+ pos_bol = lexbuf.curr_bol;
278278+ }
279279+280280+let lexing_positions lexbuf =
281281+ let start_p = lexing_position_start lexbuf
282282+ and curr_p = lexing_position_curr lexbuf in
283283+ (start_p, curr_p)
284284+285285+let lexing_bytes_position_start lexbuf =
286286+ {
287287+ Lexing.pos_fname = lexbuf.filename;
288288+ pos_lnum = lexbuf.start_line;
289289+ pos_cnum = lexbuf.start_bytes_pos + lexbuf.bytes_offset;
290290+ pos_bol = lexbuf.start_bytes_bol;
291291+ }
292292+293293+let lexing_bytes_position_curr lexbuf =
294294+ {
295295+ Lexing.pos_fname = lexbuf.filename;
296296+ pos_lnum = lexbuf.curr_line;
297297+ pos_cnum = lexbuf.bytes_pos + lexbuf.bytes_offset;
298298+ pos_bol = lexbuf.curr_bytes_bol;
299299+ }
300300+301301+let lexing_bytes_positions lexbuf =
302302+ let start_p = lexing_bytes_position_start lexbuf
303303+ and curr_p = lexing_bytes_position_curr lexbuf in
304304+ (start_p, curr_p)
305305+306306+let with_tokenizer lexer' lexbuf =
307307+ let lexer () =
308308+ let token = lexer' lexbuf in
309309+ let start_p, curr_p = lexing_positions lexbuf in
310310+ (token, start_p, curr_p)
311311+ in
312312+ lexer
313313+314314+module Chan = struct
315315+ exception Missing_input
316316+317317+ type t = {
318318+ b : Bytes.t;
319319+ ic : in_channel;
320320+ mutable len : int;
321321+ mutable pos : int;
322322+ }
323323+324324+ let min_buffer_size = 64
325325+326326+ let create ic len : t =
327327+ let len = max len min_buffer_size in
328328+ { b = Bytes.create len; ic; len = 0; pos = 0 }
329329+330330+ let available (t : t) = t.len - t.pos
331331+332332+ let rec ensure_bytes_available (t : t) ~can_refill n =
333333+ if available t >= n then ()
334334+ else if can_refill then (
335335+ let len = t.len - t.pos in
336336+ if len > 0 then Bytes.blit t.b t.pos t.b 0 len;
337337+ let read = input t.ic t.b len (Bytes.length t.b - len) in
338338+ t.len <- len + read;
339339+ t.pos <- 0;
340340+ if read = 0 then raise Missing_input
341341+ else ensure_bytes_available t ~can_refill n)
342342+ else raise Missing_input
343343+344344+ let ensure_bytes_available t ~can_refill n =
345345+ (* [n] should not exceed the size of the buffer. Here we are
346346+ conservative and make sure it doesn't exceed the mininum size
347347+ for the buffer. *)
348348+ if n <= 0 || n > min_buffer_size then invalid_arg "Sedlexing.Chan.ensure";
349349+ ensure_bytes_available t ~can_refill n
350350+351351+ let get (t : t) i = Bytes.get t.b (t.pos + i)
352352+353353+ let advance (t : t) n =
354354+ if t.pos + n > t.len then invalid_arg "advance";
355355+ t.pos <- t.pos + n
356356+357357+ let raw_buf (t : t) = t.b
358358+ let raw_pos (t : t) = t.pos
359359+end
360360+361361+let make_from_channel ?bytes_per_char ic ~max_bytes_per_uchar
362362+ ~min_bytes_per_uchar ~read_uchar =
363363+ let t = Chan.create ic (chunk_size * max_bytes_per_uchar) in
364364+ let malformed = ref false in
365365+ let refill buf pos len =
366366+ let rec loop i =
367367+ if !malformed then raise MalFormed;
368368+ if i = len then i
369369+ else (
370370+ match
371371+ (* we refill our bytes buffer only if we haven't refilled any uchar yet. *)
372372+ let can_refill = i = 0 in
373373+ Chan.ensure_bytes_available t ~can_refill min_bytes_per_uchar;
374374+ read_uchar ~can_refill t
375375+ with
376376+ | c ->
377377+ buf.(pos + i) <- c;
378378+ loop (i + 1)
379379+ | exception MalFormed when i <> 0 ->
380380+ malformed := true;
381381+ i
382382+ | exception Chan.Missing_input ->
383383+ if i = 0 && Chan.available t > 0 then raise MalFormed;
384384+ i)
385385+ in
386386+ loop 0
387387+ in
388388+ create ?bytes_per_char refill
389389+390390+module Latin1 = struct
391391+ let from_gen s =
392392+ from_gen ~bytes_per_char:(fun _ -> 1) (Gen.map Uchar.of_char s)
393393+394394+ let from_string s =
395395+ let len = String.length s in
396396+ {
397397+ (empty_lexbuf (fun _ -> 1)) with
398398+ buf = Array.init len (fun i -> Uchar.of_char s.[i]);
399399+ len;
400400+ finished = true;
401401+ }
402402+403403+ let from_channel ic =
404404+ make_from_channel ic
405405+ ~bytes_per_char:(fun _ -> 1)
406406+ ~min_bytes_per_uchar:1 ~max_bytes_per_uchar:1
407407+ ~read_uchar:(fun ~can_refill:_ t ->
408408+ let c = Chan.get t 0 in
409409+ Chan.advance t 1;
410410+ Uchar.of_char c)
411411+412412+ let to_latin1 c =
413413+ if Uchar.is_char c then Uchar.to_char c
414414+ else raise (InvalidCodepoint (Uchar.to_int c))
415415+416416+ let lexeme_char lexbuf pos = to_latin1 (lexeme_char lexbuf pos)
417417+418418+ let sub_lexeme lexbuf pos len =
419419+ let s = Bytes.create len in
420420+ for i = 0 to len - 1 do
421421+ Bytes.set s i (to_latin1 lexbuf.buf.(lexbuf.start_pos + pos + i))
422422+ done;
423423+ Bytes.to_string s
424424+425425+ let lexeme lexbuf = sub_lexeme lexbuf 0 (lexbuf.pos - lexbuf.start_pos)
426426+end
427427+428428+module Utf8 = struct
429429+ module Helper = struct
430430+ (* http://www.faqs.org/rfcs/rfc3629.html *)
431431+432432+ let width = function
433433+ | '\000' .. '\127' -> 1
434434+ | '\192' .. '\223' -> 2
435435+ | '\224' .. '\239' -> 3
436436+ | '\240' .. '\247' -> 4
437437+ | _ -> raise MalFormed
438438+439439+ (* https://www.unicode.org/versions/corrigendum1.html *)
440440+ let check_two n1 n2 =
441441+ if n1 < 0xc2 || 0xdf < n1 then raise MalFormed;
442442+ if n2 < 0x80 || 0xbf < n2 then raise MalFormed;
443443+ if n2 lsr 6 != 0b10 then raise MalFormed;
444444+ ((n1 land 0x1f) lsl 6) lor (n2 land 0x3f)
445445+446446+ let check_three n1 n2 n3 =
447447+ if n1 = 0xe0 then (
448448+ if n2 < 0xa0 || 0xbf < n2 then raise MalFormed;
449449+ if n3 < 0x80 || 0xbf < n3 then raise MalFormed)
450450+ else (
451451+ if n1 < 0xe1 || 0xef < n1 then raise MalFormed;
452452+ if n2 < 0x80 || 0xbf < n2 then raise MalFormed;
453453+ if n3 < 0x80 || 0xbf < n3 then raise MalFormed);
454454+ if n2 lsr 6 != 0b10 || n3 lsr 6 != 0b10 then raise MalFormed;
455455+ let p =
456456+ ((n1 land 0x0f) lsl 12) lor ((n2 land 0x3f) lsl 6) lor (n3 land 0x3f)
457457+ in
458458+ if p >= 0xd800 && p <= 0xdf00 then raise MalFormed;
459459+ p
460460+461461+ let check_four n1 n2 n3 n4 =
462462+ if n1 = 0xf0 then (
463463+ if n2 < 0x90 || 0xbf < n2 then raise MalFormed;
464464+ if n3 < 0x80 || 0xbf < n3 then raise MalFormed;
465465+ if n4 < 0x80 || 0xbf < n4 then raise MalFormed)
466466+ else if n1 = 0xf4 then (
467467+ if n2 < 0x80 || 0x8f < n2 then raise MalFormed;
468468+ if n3 < 0x80 || 0xbf < n3 then raise MalFormed;
469469+ if n4 < 0x80 || 0xbf < n4 then raise MalFormed)
470470+ else (
471471+ if n1 < 0xf1 || 0xf3 < n1 then raise MalFormed;
472472+ if n2 < 0x80 || 0xbf < n2 then raise MalFormed;
473473+ if n3 < 0x80 || 0xbf < n3 then raise MalFormed;
474474+ if n4 < 0x80 || 0xbf < n4 then raise MalFormed);
475475+ if n2 lsr 6 != 0b10 || n3 lsr 6 != 0b10 || n4 lsr 6 != 0b10 then
476476+ raise MalFormed;
477477+ ((n1 land 0x07) lsl 18)
478478+ lor ((n2 land 0x3f) lsl 12)
479479+ lor ((n3 land 0x3f) lsl 6)
480480+ lor (n4 land 0x3f)
481481+482482+ let next s i =
483483+ let c1 = s.[i] in
484484+ match width c1 with
485485+ | 1 -> Char.code c1
486486+ | 2 ->
487487+ let n1 = Char.code c1 in
488488+ let n2 = Char.code s.[i + 1] in
489489+ check_two n1 n2
490490+ | 3 ->
491491+ let n1 = Char.code c1 in
492492+ let n2 = Char.code s.[i + 1] in
493493+ let n3 = Char.code s.[i + 2] in
494494+ check_three n1 n2 n3
495495+ | 4 ->
496496+ let n1 = Char.code c1 in
497497+ let n2 = Char.code s.[i + 1] in
498498+ let n3 = Char.code s.[i + 2] in
499499+ let n4 = Char.code s.[i + 3] in
500500+ check_four n1 n2 n3 n4
501501+ | _ -> assert false
502502+503503+ let gen_from_char_gen s =
504504+ let next_or_fail () =
505505+ match Gen.next s with None -> raise MalFormed | Some x -> Char.code x
506506+ in
507507+ fun () ->
508508+ Gen.next s >>| fun c1 ->
509509+ match width c1 with
510510+ | 1 -> Uchar.of_char c1
511511+ | 2 ->
512512+ let n1 = Char.code c1 in
513513+ let n2 = next_or_fail () in
514514+ Uchar.of_int (check_two n1 n2)
515515+ | 3 ->
516516+ let n1 = Char.code c1 in
517517+ let n2 = next_or_fail () in
518518+ let n3 = next_or_fail () in
519519+ Uchar.of_int (check_three n1 n2 n3)
520520+ | 4 ->
521521+ let n1 = Char.code c1 in
522522+ let n2 = next_or_fail () in
523523+ let n3 = next_or_fail () in
524524+ let n4 = next_or_fail () in
525525+ Uchar.of_int (check_four n1 n2 n3 n4)
526526+ | _ -> raise MalFormed
527527+528528+ (**************************)
529529+530530+ let to_buffer a apos len b =
531531+ for i = apos to apos + len - 1 do
532532+ Buffer.add_utf_8_uchar b a.(i)
533533+ done
534534+ end
535535+536536+ let from_channel ic =
537537+ make_from_channel ic ~bytes_per_char:Uchar.utf_8_byte_length
538538+ ~min_bytes_per_uchar:1 ~max_bytes_per_uchar:4
539539+ ~read_uchar:(fun ~can_refill t ->
540540+ let w = Helper.width (Chan.get t 0) in
541541+ Chan.ensure_bytes_available t ~can_refill w;
542542+ let c =
543543+ Helper.next (Bytes.unsafe_to_string (Chan.raw_buf t)) (Chan.raw_pos t)
544544+ in
545545+ Chan.advance t w;
546546+ Uchar.of_int c)
547547+548548+ let from_gen s =
549549+ from_gen ~bytes_per_char:Uchar.utf_8_byte_length
550550+ (Helper.gen_from_char_gen s)
551551+552552+ let from_string s =
553553+ from_gen (Gen.init ~limit:(String.length s) (fun i -> String.get s i))
554554+555555+ let sub_lexeme lexbuf pos len =
556556+ let buf = Buffer.create (len * 4) in
557557+ Helper.to_buffer lexbuf.buf (lexbuf.start_pos + pos) len buf;
558558+ Buffer.contents buf
559559+560560+ let lexeme lexbuf = sub_lexeme lexbuf 0 (lexbuf.pos - lexbuf.start_pos)
561561+end
562562+563563+module Utf16 = struct
564564+ type byte_order = Little_endian | Big_endian
565565+566566+ module Helper = struct
567567+ (* http://www.ietf.org/rfc/rfc2781.txt *)
568568+569569+ let number_of_pair bo c1 c2 =
570570+ match bo with
571571+ | Little_endian -> (c2 lsl 8) + c1
572572+ | Big_endian -> (c1 lsl 8) + c2
573573+574574+ let get_bo bo c1 c2 =
575575+ match !bo with
576576+ | Some o -> o
577577+ | None ->
578578+ let o =
579579+ match (c1, c2) with
580580+ | 0xff, 0xfe -> Little_endian
581581+ | _ -> Big_endian
582582+ in
583583+ bo := Some o;
584584+ o
585585+586586+ let gen_from_char_gen opt_bo s =
587587+ let next_or_fail () =
588588+ match Gen.next s with None -> raise MalFormed | Some x -> Char.code x
589589+ in
590590+ let bo = ref opt_bo in
591591+ fun () ->
592592+ Gen.next s >>| fun c1 ->
593593+ let n1 = Char.code c1 in
594594+ let n2 = next_or_fail () in
595595+ let o = get_bo bo n1 n2 in
596596+ let w1 = number_of_pair o n1 n2 in
597597+ if w1 = 0xfffe then raise (InvalidCodepoint w1);
598598+ if w1 < 0xd800 || 0xdfff < w1 then Uchar.of_int w1
599599+ else if w1 <= 0xdbff then (
600600+ let n3 = next_or_fail () in
601601+ let n4 = next_or_fail () in
602602+ let w2 = number_of_pair o n3 n4 in
603603+ if w2 < 0xdc00 || w2 > 0xdfff then raise MalFormed;
604604+ let upper10 = (w1 land 0x3ff) lsl 10 and lower10 = w2 land 0x3ff in
605605+ Uchar.of_int (0x10000 + upper10 + lower10))
606606+ else raise MalFormed
607607+608608+ let to_buffer bo a apos len bom b =
609609+ let store =
610610+ match bo with
611611+ | Big_endian -> Buffer.add_utf_16be_uchar b
612612+ | Little_endian -> Buffer.add_utf_16le_uchar b
613613+ in
614614+ if bom then store (Uchar.of_int 0xfeff);
615615+ (* first, store the BOM *)
616616+ for i = apos to apos + len - 1 do
617617+ store a.(i)
618618+ done
619619+ end
620620+621621+ let from_channel ic opt_bo =
622622+ let bo = ref opt_bo in
623623+ make_from_channel ic ~bytes_per_char:Uchar.utf_16_byte_length
624624+ ~min_bytes_per_uchar:2 ~max_bytes_per_uchar:4
625625+ ~read_uchar:(fun ~can_refill t ->
626626+ let n1 = Char.code (Chan.get t 0) in
627627+ let n2 = Char.code (Chan.get t 1) in
628628+ let o = Helper.get_bo bo n1 n2 in
629629+ let w1 = Helper.number_of_pair o n1 n2 in
630630+ if w1 = 0xfffe then raise (InvalidCodepoint w1);
631631+ if w1 < 0xd800 || 0xdfff < w1 then (
632632+ Chan.advance t 2;
633633+ Uchar.of_int w1)
634634+ else if w1 <= 0xdbff then (
635635+ Chan.ensure_bytes_available t ~can_refill 4;
636636+ let n3 = Char.code (Chan.get t 2) in
637637+ let n4 = Char.code (Chan.get t 3) in
638638+ let w2 = Helper.number_of_pair o n3 n4 in
639639+ if w2 < 0xdc00 || w2 > 0xdfff then raise MalFormed;
640640+ let upper10 = (w1 land 0x3ff) lsl 10 and lower10 = w2 land 0x3ff in
641641+ Chan.advance t 4;
642642+ Uchar.of_int (0x10000 + upper10 + lower10))
643643+ else raise MalFormed)
644644+645645+ let from_gen s opt_bo =
646646+ from_gen ~bytes_per_char:Uchar.utf_16_byte_length
647647+ (Helper.gen_from_char_gen opt_bo s)
648648+649649+ let from_string s =
650650+ from_gen (Gen.init ~limit:(String.length s) (fun i -> String.get s i))
651651+652652+ let sub_lexeme lb pos len bo bom =
653653+ let buf = Buffer.create ((len * 4) + 2) in
654654+ (* +2 for the BOM *)
655655+ Helper.to_buffer bo lb.buf (lb.start_pos + pos) len bom buf;
656656+ Buffer.contents buf
657657+658658+ let lexeme lb bo bom = sub_lexeme lb 0 (lb.pos - lb.start_pos) bo bom
659659+end
+301
vendor/opam/sedlex/src/lib/sedlexing.mli
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+(* Copyright 2005, 2013 by Alain Frisch and LexiFi. *)
44+55+(** Runtime support for lexers generated by [sedlex]. *)
66+77+(** This module is roughly equivalent to the module Lexing from the OCaml
88+ standard library, except that its lexbuffers handle Unicode code points
99+ (OCaml type: {!Uchar.t} in the range [0..0x10ffff]) instead of bytes (OCaml
1010+ type: [char]).
1111+1212+ It is possible to have sedlex-generated lexers work on a custom
1313+ implementation for lex buffers. To do this, define a module [L] which
1414+ implements the [start], [next], [mark] and [backtrack] functions (See the
1515+ Internal Interface section below for a specification). They need not work on
1616+ a type named [lexbuf]: you can use the type name you want. Then, just do in
1717+ your sedlex-processed source, bind this module to the name [Sedlexing] (for
1818+ instance, with a local module definition: [let module Sedlexing = L in ...].
1919+2020+ Of course, you'll probably want to define functions like [lexeme] to be used
2121+ in the lexers semantic actions. *)
2222+2323+(** The type of lexer buffers. A lexer buffer is the argument passed to the
2424+ scanning functions defined by the generated lexers. The lexer buffer holds
2525+ the internal information for the scanners, including the code points of the
2626+ token currently scanned, its position from the beginning of the input
2727+ stream, and the current position of the lexer. *)
2828+type lexbuf
2929+3030+(** Raised by some functions to signal that some code point is not compatible
3131+ with a specified encoding. *)
3232+exception InvalidCodepoint of int
3333+3434+(** Raised by functions in the [Utf8] and [Utf16] modules to report strings
3535+ which do not comply to the encoding. *)
3636+exception MalFormed
3737+3838+(** {6 Creating generic lexbufs} *)
3939+4040+(** Create a generic lexer buffer. When the lexer needs more characters, it will
4141+ call the given function, giving it an array of Uchars [a], a position [pos]
4242+ and a code point count [n]. The function should put [n] code points or less
4343+ in [a], starting at position [pos], and return the number of characters
4444+ provided. A return value of 0 means end of input. [bytes_per_char] argument
4545+ is optional. If unspecified, byte positions are the same as code point
4646+ position. *)
4747+val create :
4848+ ?bytes_per_char:(Uchar.t -> int) ->
4949+ (Uchar.t array -> int -> int -> int) ->
5050+ lexbuf
5151+5252+(** set the initial tracked input position, in code point, for [lexbuf]. If
5353+ unspecified, byte postion is set to the same value as code point position.
5454+*)
5555+val set_position :
5656+ ?bytes_position:Lexing.position -> lexbuf -> Lexing.position -> unit
5757+5858+(** [set_filename lexbuf file] sets the filename to [file] in [lexbuf]. It also
5959+ sets the {!Lexing.pos_fname} field in returned {!Lexing.position} records.
6060+*)
6161+val set_filename : lexbuf -> string -> unit
6262+6363+(** Create a lexbuf from a stream of Unicode code points. [bytes_per_char] is
6464+ optional. If unspecified, byte positions are the same as code point
6565+ positions. *)
6666+val from_gen : ?bytes_per_char:(Uchar.t -> int) -> Uchar.t Gen.t -> lexbuf
6767+6868+(** Create a lexbuf from an array of Unicode code points. [bytes_per_char] is
6969+ optional. If unspecified, byte positions are the same as code point
7070+ positions. *)
7171+val from_int_array : ?bytes_per_char:(Uchar.t -> int) -> int array -> lexbuf
7272+7373+(** Create a lexbuf from an array of Unicode code points. [bytes_per_char] is
7474+ optional. If unspecified, byte positions are the same as code point
7575+ positions. *)
7676+val from_uchar_array :
7777+ ?bytes_per_char:(Uchar.t -> int) -> Uchar.t array -> lexbuf
7878+7979+(** {6 Interface for lexers semantic actions} *)
8080+8181+(** The following functions can be called from the semantic actions of lexer
8282+ definitions. They give access to the character string matched by the regular
8383+ expression associated with the semantic action. *)
8484+8585+(** [Sedlexing.lexeme_start lexbuf] returns the offset in the input stream of
8686+ the first code point of the matched string. The first code point of the
8787+ stream has offset 0. *)
8888+val lexeme_start : lexbuf -> int
8989+9090+(** [Sedlexing.lexeme_start lexbuf] returns the offset in the input stream of
9191+ the first byte of the matched string. The first code point of the stream has
9292+ offset 0. *)
9393+val lexeme_bytes_start : lexbuf -> int
9494+9595+(** [Sedlexing.lexeme_end lexbuf] returns the offset in the input stream of the
9696+ character following the last code point of the matched string. The first
9797+ character of the stream has offset 0. *)
9898+val lexeme_end : lexbuf -> int
9999+100100+(** [Sedlexing.lexeme_end lexbuf] returns the offset in the input stream of the
101101+ byte following the last code point of the matched string. The first
102102+ character of the stream has offset 0. *)
103103+val lexeme_bytes_end : lexbuf -> int
104104+105105+(** [Sedlexing.loc lexbuf] returns the pair
106106+ [(Sedlexing.lexeme_start lexbuf,Sedlexing.lexeme_end lexbuf)]. *)
107107+val loc : lexbuf -> int * int
108108+109109+(** [Sedlexing.bytes_loc lexbuf] returns the pair
110110+ [(Sedlexing.lexeme_bytes_start lexbuf,Sedlexing.lexeme_bytes_end lexbuf)].
111111+*)
112112+val bytes_loc : lexbuf -> int * int
113113+114114+(** [Sedlexing.lexeme_length lexbuf] returns the difference
115115+ [(Sedlexing.lexeme_end lexbuf) - (Sedlexing.lexeme_start lexbuf)], that is,
116116+ the length (in code points) of the matched string. *)
117117+val lexeme_length : lexbuf -> int
118118+119119+(** [Sedlexing.lexeme_length lexbuf] returns the difference
120120+ [(Sedlexing.lexeme_bytes_end lexbuf) - (Sedlexing.lexeme_bytes_start
121121+ lexbuf)], that is, the length (in bytes) of the matched string. *)
122122+val lexeme_bytes_length : lexbuf -> int
123123+124124+(** [Sedlexing.lexing_positions lexbuf] returns the start and end positions, in
125125+ code points, of the current token, using a record of type [Lexing.position].
126126+ This is intended for consumption by parsers like those generated by
127127+ [Menhir]. *)
128128+val lexing_positions : lexbuf -> Lexing.position * Lexing.position
129129+130130+(** [Sedlexing.lexing_position_start lexbuf] returns the start position, in code
131131+ points, of the current token. *)
132132+val lexing_position_start : lexbuf -> Lexing.position
133133+134134+(** [Sedlexing.lexing_position_curr lexbuf] returns the end position, in code
135135+ points, of the current token. *)
136136+val lexing_position_curr : lexbuf -> Lexing.position
137137+138138+(** [Sedlexing.lexing_bytes_positions lexbuf] returns the start and end
139139+ positions, in bytes, of the current token, using a record of type
140140+ [Lexing.position]. This is intended for consumption by parsers like those
141141+ generated by [Menhir]. *)
142142+val lexing_bytes_positions : lexbuf -> Lexing.position * Lexing.position
143143+144144+(** [Sedlexing.lexing_bytes_position_start lexbuf] returns the start position,
145145+ in bytes, of the current token. *)
146146+val lexing_bytes_position_start : lexbuf -> Lexing.position
147147+148148+(** [Sedlexing.lexing_bytes_position_curr lexbuf] returns the end position, in
149149+ bytes, of the current token. *)
150150+val lexing_bytes_position_curr : lexbuf -> Lexing.position
151151+152152+(** [Sedlexing.new_line lexbuf] increments the line count and sets the beginning
153153+ of line to the current position, as though a newline character had been
154154+ encountered in the input. *)
155155+val new_line : lexbuf -> unit
156156+157157+(** [Sedlexing.lexeme lexbuf] returns the string matched by the regular
158158+ expression as an array of Unicode code point. *)
159159+val lexeme : lexbuf -> Uchar.t array
160160+161161+(** [Sedlexing.lexeme_char lexbuf pos] returns code point number [pos] in the
162162+ matched string. *)
163163+val lexeme_char : lexbuf -> int -> Uchar.t
164164+165165+(** [Sedlexing.sub_lexeme lexbuf pos len] returns a substring of the string
166166+ matched by the regular expression as an array of Unicode code point. *)
167167+val sub_lexeme : lexbuf -> int -> int -> Uchar.t array
168168+169169+(** [Sedlexing.rollback lexbuf] puts [lexbuf] back in its configuration before
170170+ the last lexeme was matched. It is then possible to use another lexer to
171171+ parse the same characters again. The other functions above in this section
172172+ should not be used in the semantic action after a call to
173173+ [Sedlexing.rollback]. *)
174174+val rollback : lexbuf -> unit
175175+176176+(** {6 Internal interface} *)
177177+178178+(** These functions are used internally by the lexers. They could be used to
179179+ write lexers by hand, or with a lexer generator different from [sedlex]. The
180180+ lexer buffers have a unique internal slot that can store an integer. They
181181+ also store a "backtrack" position. *)
182182+183183+(** [start t] informs the lexer buffer that any code points until the current
184184+ position can be discarded. The current position become the "start" position
185185+ as returned by [Sedlexing.lexeme_start]. Moreover, the internal slot is set
186186+ to [-1] and the backtrack position is set to the current position. *)
187187+val start : lexbuf -> unit
188188+189189+(** [next lexbuf] extracts the next code point from the lexer buffer and
190190+ increments to current position. If the input stream is exhausted, the
191191+ function returns [None]. If a ['\n'] is encountered, the tracked line number
192192+ is incremented. *)
193193+val next : lexbuf -> Uchar.t option
194194+195195+(** [__private__next_int lexbuf] extracts the next code point from the lexer
196196+ buffer and increments to current position. If the input stream is exhausted,
197197+ the function returns -1. If a ['\n'] is encountered, the tracked line number
198198+ is incremented.
199199+200200+ This is a private API, it should not be used by code using this module's API
201201+ and can be removed at any time. *)
202202+val __private__next_int : lexbuf -> int
203203+204204+(** [mark lexbuf i] stores the integer [i] in the internal slot. The backtrack
205205+ position is set to the current position. *)
206206+val mark : lexbuf -> int -> unit
207207+208208+(** [backtrack lexbuf] returns the value stored in the internal slot of the
209209+ buffer, and performs backtracking (the current position is set to the value
210210+ of the backtrack position). *)
211211+val backtrack : lexbuf -> int
212212+213213+(** [with_tokenizer tokenizer lexbuf] given a lexer and a lexbuf, returns a
214214+ generator of tokens annotated with positions. This generator can be used
215215+ with the Menir parser generator's incremental API. *)
216216+val with_tokenizer :
217217+ (lexbuf -> 'token) ->
218218+ lexbuf ->
219219+ unit ->
220220+ 'token * Lexing.position * Lexing.position
221221+222222+(** {6 Support for common encodings} *)
223223+224224+module Latin1 : sig
225225+ (** Create a lexbuf from a Latin1 encoded stream (ie a stream of Unicode code
226226+ points in the range [0..255]) *)
227227+ val from_gen : char Gen.t -> lexbuf
228228+229229+ (** Create a lexbuf from a Latin1 encoded input channel. The client is
230230+ responsible for closing the channel. *)
231231+ val from_channel : in_channel -> lexbuf
232232+233233+ (** Create a lexbuf from a Latin1 encoded string. *)
234234+ val from_string : string -> lexbuf
235235+236236+ (** As [Sedlexing.lexeme] with a result encoded in Latin1. This function
237237+ throws an exception [InvalidCodepoint] if it is not possible to encode the
238238+ result in Latin1. *)
239239+ val lexeme : lexbuf -> string
240240+241241+ (** As [Sedlexing.sub_lexeme] with a result encoded in Latin1. This function
242242+ throws an exception [InvalidCodepoint] if it is not possible to encode the
243243+ result in Latin1. *)
244244+ val sub_lexeme : lexbuf -> int -> int -> string
245245+246246+ (** As [Sedlexing.lexeme_char] with a result encoded in Latin1. This function
247247+ throws an exception [InvalidCodepoint] if it is not possible to encode the
248248+ result in Latin1. *)
249249+ val lexeme_char : lexbuf -> int -> char
250250+end
251251+252252+module Utf8 : sig
253253+ (** Create a lexbuf from a UTF-8 encoded stream. *)
254254+ val from_gen : char Gen.t -> lexbuf
255255+256256+ (** Create a lexbuf from a UTF-8 encoded input channel. *)
257257+ val from_channel : in_channel -> lexbuf
258258+259259+ (** Create a lexbuf from a UTF-8 encoded string. *)
260260+ val from_string : string -> lexbuf
261261+262262+ (** As [Sedlexing.lexeme] with a result encoded in UTF-8. *)
263263+ val lexeme : lexbuf -> string
264264+265265+ (** As [Sedlexing.sub_lexeme] with a result encoded in UTF-8. *)
266266+ val sub_lexeme : lexbuf -> int -> int -> string
267267+268268+ module Helper : sig
269269+ val width : char -> int
270270+ val check_two : int -> int -> int
271271+ val check_three : int -> int -> int -> int
272272+ val check_four : int -> int -> int -> int -> int
273273+ end
274274+end
275275+276276+module Utf16 : sig
277277+ type byte_order = Little_endian | Big_endian
278278+279279+ (** [Utf16.from_gen s opt_bo] creates a lexbuf from an UTF-16 encoded stream.
280280+ If [opt_bo] matches with [None] the function expects a BOM (Byte Order
281281+ Mark), and takes the byte order as [Utf16.Big_endian] if it cannot find
282282+ one. When [opt_bo] matches with [Some bo], [bo] is taken as byte order. In
283283+ this case a leading BOM is kept in the stream - the lexer has to ignore it
284284+ and a `wrong' BOM ([0xfffe]) will raise Utf16.InvalidCodepoint. *)
285285+ val from_gen : char Gen.t -> byte_order option -> lexbuf
286286+287287+ (** Works as [Utf16.from_gen] with an [in_channel]. *)
288288+ val from_channel : in_channel -> byte_order option -> lexbuf
289289+290290+ (** Works as [Utf16.from_gen] with a [string]. *)
291291+ val from_string : string -> byte_order option -> lexbuf
292292+293293+ (** [utf16_lexeme lb bo bom] as [Sedlexing.lexeme] with a result encoded in
294294+ UTF-16 in byte_order [bo] and starting with a BOM if [bom = true]. *)
295295+ val lexeme : lexbuf -> byte_order -> bool -> string
296296+297297+ (** [sub_lexeme lb pos len bo bom] as [Sedlexing.sub_lexeme] with a result
298298+ encoded in UTF-16 with byte order [bo] and starting with a BOM if
299299+ [bom=true] *)
300300+ val sub_lexeme : lexbuf -> int -> int -> byte_order -> bool -> string
301301+end
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+44+open Sedlex_cset
55+66+(** Letters to be used in identifiers, as specified by ISO .... *)
77+88+(* Data provided by John M. Skaller *)
99+1010+val tr8876_ident_char : t
+579
vendor/opam/sedlex/src/syntax/ppx_sedlex.ml
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+(* Copyright 2005, 2013 by Alain Frisch and LexiFi. *)
44+55+open Ppxlib
66+open Ast_builder.Default
77+open Ast_helper
88+99+(* let ocaml_version = Versions.ocaml_408 *)
1010+1111+module Cset = Sedlex_cset
1212+1313+(* Decision tree for partitions *)
1414+1515+let default_loc = Location.none
1616+1717+type decision_tree =
1818+ | Lte of int * decision_tree * decision_tree
1919+ | Table of int * int array
2020+ | Return of int
2121+2222+let rec simplify_decision_tree (x : decision_tree) =
2323+ match x with
2424+ | Table _ | Return _ -> x
2525+ | Lte (_, (Return a as l), Return b) when a = b -> l
2626+ | Lte (i, l, r) -> (
2727+ let l = simplify_decision_tree l in
2828+ let r = simplify_decision_tree r in
2929+ match (l, r) with
3030+ | Return a, Return b when a = b -> l
3131+ | _ -> Lte (i, l, r))
3232+3333+let decision l =
3434+ let l = List.map (fun (a, b, i) -> (a, b, Return i)) l in
3535+ let rec merge2 = function
3636+ | (a1, b1, d1) :: (a2, b2, d2) :: rest ->
3737+ let x = if b1 + 1 = a2 then d2 else Lte (a2 - 1, Return (-1), d2) in
3838+ (a1, b2, Lte (b1, d1, x)) :: merge2 rest
3939+ | rest -> rest
4040+ in
4141+ let rec aux = function
4242+ | [(a, b, d)] -> Lte (a - 1, Return (-1), Lte (b, d, Return (-1)))
4343+ | [] -> Return (-1)
4444+ | l -> aux (merge2 l)
4545+ in
4646+ aux l
4747+4848+let limit = 8192
4949+5050+let decision_table l =
5151+ let rec aux m accu = function
5252+ | ((a, b, i) as x) :: rem when b < limit && i < 255 ->
5353+ aux (min a m) (x :: accu) rem
5454+ | rem -> (m, accu, rem)
5555+ in
5656+ let min, table, rest = aux max_int [] l in
5757+ match table with
5858+ | [] -> decision l
5959+ | [(min, max, i)] ->
6060+ Lte (min - 1, Return (-1), Lte (max, Return i, decision rest))
6161+ | (_, max, _) :: _ ->
6262+ let arr = Array.make (max - min + 1) 0 in
6363+ let set (a, b, i) =
6464+ for j = a to b do
6565+ arr.(j - min) <- i + 1
6666+ done
6767+ in
6868+ List.iter set table;
6969+ Lte (min - 1, Return (-1), Lte (max, Table (min, arr), decision rest))
7070+7171+let rec simplify min max = function
7272+ | Lte (i, yes, no) ->
7373+ if i >= max then simplify min max yes
7474+ else if i < min then simplify min max no
7575+ else Lte (i, simplify min i yes, simplify (i + 1) max no)
7676+ | x -> x
7777+7878+let segments_of_partition p =
7979+ let seg = ref [] in
8080+ Array.iteri
8181+ (fun i c ->
8282+ List.iter
8383+ (fun (a, b) -> seg := (a, b, i) :: !seg)
8484+ (c : Sedlex_cset.t :> (int * int) list))
8585+ p;
8686+ List.sort (fun (a1, _, _) (a2, _, _) -> compare a1 a2) !seg
8787+8888+let decision_table p =
8989+ simplify (-1) Cset.max_code (decision_table (segments_of_partition p))
9090+9191+(* Helpers to build AST *)
9292+9393+let appfun s l =
9494+ let loc = default_loc in
9595+ eapply ~loc (evar ~loc s) l
9696+9797+let glb_value name def =
9898+ let loc = default_loc in
9999+ pstr_value ~loc Nonrecursive
100100+ [value_binding ~loc ~pat:(pvar ~loc name) ~expr:def]
101101+102102+(* Named regexps *)
103103+104104+module StringMap = Map.Make (struct
105105+ type t = string
106106+107107+ let compare = compare
108108+end)
109109+110110+let builtin_regexps =
111111+ List.fold_left
112112+ (fun acc (n, c) -> StringMap.add n (Sedlex.chars c) acc)
113113+ StringMap.empty
114114+ ([
115115+ ("any", Cset.any);
116116+ ("eof", Cset.eof);
117117+ ("xml_letter", Xml.letter);
118118+ ("xml_digit", Xml.digit);
119119+ ("xml_extender", Xml.extender);
120120+ ("xml_base_char", Xml.base_char);
121121+ ("xml_ideographic", Xml.ideographic);
122122+ ("xml_combining_char", Xml.combining_char);
123123+ ("xml_blank", Xml.blank);
124124+ ("tr8876_ident_char", Iso.tr8876_ident_char);
125125+ ]
126126+ @ Unicode.Categories.list @ Unicode.Properties.list)
127127+128128+(* Tables (indexed mapping: codepoint -> next state) *)
129129+130130+let tables = Hashtbl.create 31
131131+let table_counter = ref 0
132132+let get_tables () = Hashtbl.fold (fun key x accu -> (x, key) :: accu) tables []
133133+134134+let table_name x =
135135+ try Hashtbl.find tables x
136136+ with Not_found ->
137137+ incr table_counter;
138138+ let s = Printf.sprintf "__sedlex_table_%i" !table_counter in
139139+ Hashtbl.add tables x s;
140140+ s
141141+142142+let table (name, v) =
143143+ let n = Array.length v in
144144+ let s = Bytes.create n in
145145+ for i = 0 to n - 1 do
146146+ Bytes.set s i (Char.chr v.(i))
147147+ done;
148148+ glb_value name (estring ~loc:default_loc (Bytes.to_string s))
149149+150150+(* Partition (function: codepoint -> next state) *)
151151+152152+let partitions = Hashtbl.create 31
153153+let partition_counter = ref 0
154154+155155+let get_partitions () =
156156+ Hashtbl.fold (fun key x accu -> (x, key) :: accu) partitions []
157157+158158+let partition_name x =
159159+ try Hashtbl.find partitions x
160160+ with Not_found ->
161161+ incr partition_counter;
162162+ let s = Printf.sprintf "__sedlex_partition_%i" !partition_counter in
163163+ Hashtbl.add partitions x s;
164164+ s
165165+166166+(* We duplicate the body for the EOF (-1) case rather than creating
167167+ an interior utility function. *)
168168+let partition (name, p) =
169169+ let loc = default_loc in
170170+ let rec gen_tree = function
171171+ | Lte (i, yes, no) ->
172172+ [%expr
173173+ if c <= [%e eint ~loc i] then [%e gen_tree yes] else [%e gen_tree no]]
174174+ | Return i -> eint ~loc:default_loc i
175175+ | Table (offset, t) ->
176176+ let c =
177177+ if offset = 0 then [%expr c] else [%expr c - [%e eint ~loc offset]]
178178+ in
179179+ [%expr
180180+ Char.code (String.unsafe_get [%e evar ~loc (table_name t)] [%e c]) - 1]
181181+ in
182182+ let body = gen_tree (simplify_decision_tree (decision_table p)) in
183183+ glb_value name
184184+ [%expr
185185+ fun c ->
186186+ let open! Stdlib in
187187+ [%e body]]
188188+189189+(* Code generation for the automata *)
190190+191191+let best_final final =
192192+ let fin = ref None in
193193+ for i = Array.length final - 1 downto 0 do
194194+ if final.(i) then fin := Some i
195195+ done;
196196+ !fin
197197+198198+let state_fun state = Printf.sprintf "__sedlex_state_%i" state
199199+200200+let call_state lexbuf auto state =
201201+ let trans, final = auto.(state) in
202202+ if Array.length trans = 0 then (
203203+ match best_final final with
204204+ | Some i -> eint ~loc:default_loc i
205205+ | None -> assert false)
206206+ else appfun (state_fun state) [lexbuf]
207207+208208+let gen_state (lexbuf_name, lexbuf) auto i (trans, final) =
209209+ let loc = default_loc in
210210+ let partition = Array.map fst trans in
211211+ let cases =
212212+ Array.mapi
213213+ (fun i (_, j) ->
214214+ case ~lhs:(pint ~loc i) ~guard:None ~rhs:(call_state lexbuf auto j))
215215+ trans
216216+ in
217217+ let cases = Array.to_list cases in
218218+ let body () =
219219+ pexp_match ~loc
220220+ (appfun (partition_name partition)
221221+ [[%expr Sedlexing.__private__next_int [%e lexbuf]]])
222222+ (cases
223223+ @ [
224224+ case
225225+ ~lhs:[%pat? _]
226226+ ~guard:None
227227+ ~rhs:[%expr Sedlexing.backtrack [%e lexbuf]];
228228+ ])
229229+ in
230230+ let ret body =
231231+ let lhs = pvar ~loc:lexbuf.pexp_loc lexbuf_name in
232232+ [
233233+ value_binding ~loc
234234+ ~pat:(pvar ~loc (state_fun i))
235235+ ~expr:(Exp.fun_ ~loc Nolabel None lhs body);
236236+ ]
237237+ in
238238+ match best_final final with
239239+ | None -> ret (body ())
240240+ | Some _ when Array.length trans = 0 -> []
241241+ | Some i ->
242242+ ret
243243+ [%expr
244244+ Sedlexing.mark [%e lexbuf] [%e eint ~loc i];
245245+ [%e body ()]]
246246+247247+let gen_recflag auto =
248248+ (* The generated function is not recursive if the transitions end
249249+ in states with no further transitions. *)
250250+ try
251251+ Array.iter
252252+ (fun (trans_i, _) ->
253253+ Array.iter
254254+ (fun (_, j) ->
255255+ let trans_j, _ = auto.(j) in
256256+ if Array.length trans_j > 0 then raise Exit)
257257+ trans_i)
258258+ auto;
259259+ Nonrecursive
260260+ with Exit -> Recursive
261261+262262+let gen_definition ((_, lexbuf) as lexbuf_with_name) l error =
263263+ let loc = default_loc in
264264+ let brs = Array.of_list l in
265265+ let auto = Sedlex.compile (Array.map fst brs) in
266266+ let cases =
267267+ Array.to_list
268268+ (Array.mapi
269269+ (fun i (_, e) -> case ~lhs:(pint ~loc i) ~guard:None ~rhs:e)
270270+ brs)
271271+ in
272272+ let states = Array.mapi (gen_state lexbuf_with_name auto) auto in
273273+ let states = List.flatten (Array.to_list states) in
274274+ pexp_let ~loc (gen_recflag auto) states
275275+ (pexp_sequence ~loc
276276+ [%expr Sedlexing.start [%e lexbuf]]
277277+ (pexp_match ~loc
278278+ (appfun (state_fun 0) [lexbuf])
279279+ (cases @ [case ~lhs:(ppat_any ~loc) ~guard:None ~rhs:error])))
280280+281281+(* Lexer specification parser *)
282282+283283+let codepoint i =
284284+ if i < 0 || i > Cset.max_code then
285285+ failwith (Printf.sprintf "Invalid Unicode code point: %i" i);
286286+ i
287287+288288+let char c = Cset.singleton (Char.code c)
289289+let uchar c = Cset.singleton (Uchar.to_int c)
290290+291291+let err loc fmt =
292292+ Printf.ksprintf
293293+ (fun s ->
294294+ raise (Location.Error (Location.Error.createf ~loc "Sedlex: %s" s)))
295295+ fmt
296296+297297+type encoding = Utf8 | Latin1 | Ascii
298298+299299+let string_of_encoding = function
300300+ | Utf8 -> "UTF-8"
301301+ | Latin1 -> "Latin-1"
302302+ | Ascii -> "ASCII"
303303+304304+let rev_csets_of_string ~loc ~encoding s =
305305+ match encoding with
306306+ | Utf8 ->
307307+ Utf8.fold
308308+ ~f:(fun acc _ x ->
309309+ match x with
310310+ | `Malformed _ ->
311311+ err loc "Malformed %s string" (string_of_encoding encoding)
312312+ | `Uchar c -> uchar c :: acc)
313313+ [] s
314314+ | Latin1 ->
315315+ let l = ref [] in
316316+ for i = 0 to String.length s - 1 do
317317+ l := char s.[i] :: !l
318318+ done;
319319+ !l
320320+ | Ascii ->
321321+ let l = ref [] in
322322+ for i = 0 to String.length s - 1 do
323323+ match s.[i] with
324324+ | '\x00' .. '\x7F' as c -> l := char c :: !l
325325+ | _ -> err loc "Malformed %s string" (string_of_encoding encoding)
326326+ done;
327327+ !l
328328+329329+let rec repeat r = function
330330+ | 0, 0 -> Sedlex.eps
331331+ | 0, m -> Sedlex.alt Sedlex.eps (Sedlex.seq r (repeat r (0, m - 1)))
332332+ | n, m -> Sedlex.seq r (repeat r (n - 1, m - 1))
333333+334334+let regexp_of_pattern env =
335335+ let rec char_pair_op func name ~encoding p tuple =
336336+ (* Construct something like Sub(a,b) *)
337337+ match tuple with
338338+ | Some { ppat_desc = Ppat_tuple [p0; p1] } -> begin
339339+ match func (aux ~encoding p0) (aux ~encoding p1) with
340340+ | Some r -> r
341341+ | None ->
342342+ err p.ppat_loc
343343+ "the %s operator can only applied to single-character length \
344344+ regexps"
345345+ name
346346+ end
347347+ | _ ->
348348+ err p.ppat_loc "the %s operator requires two arguments, like %s(a,b)"
349349+ name name
350350+ and aux ~encoding p =
351351+ (* interpret one pattern node *)
352352+ match p.ppat_desc with
353353+ | Ppat_or (p1, p2) -> Sedlex.alt (aux ~encoding p1) (aux ~encoding p2)
354354+ | Ppat_tuple (p :: pl) ->
355355+ List.fold_left
356356+ (fun r p -> Sedlex.seq r (aux ~encoding p))
357357+ (aux ~encoding p) pl
358358+ | Ppat_construct ({ txt = Lident "Star" }, Some (_, p)) ->
359359+ Sedlex.rep (aux ~encoding p)
360360+ | Ppat_construct ({ txt = Lident "Plus" }, Some (_, p)) ->
361361+ Sedlex.plus (aux ~encoding p)
362362+ | Ppat_construct ({ txt = Lident "Utf8" }, Some (_, p)) ->
363363+ aux ~encoding:Utf8 p
364364+ | Ppat_construct ({ txt = Lident "Latin1" }, Some (_, p)) ->
365365+ aux ~encoding:Latin1 p
366366+ | Ppat_construct ({ txt = Lident "Ascii" }, Some (_, p)) ->
367367+ aux ~encoding:Ascii p
368368+ | Ppat_construct
369369+ ( { txt = Lident "Rep" },
370370+ Some
371371+ ( _,
372372+ {
373373+ ppat_desc =
374374+ Ppat_tuple
375375+ [
376376+ p0;
377377+ {
378378+ ppat_desc =
379379+ Ppat_constant (i1 as i2) | Ppat_interval (i1, i2);
380380+ };
381381+ ];
382382+ } ) ) -> begin
383383+ match (i1, i2) with
384384+ | Pconst_integer (i1, _), Pconst_integer (i2, _) ->
385385+ let i1 = int_of_string i1 in
386386+ let i2 = int_of_string i2 in
387387+ if 0 <= i1 && i1 <= i2 then repeat (aux ~encoding p0) (i1, i2)
388388+ else err p.ppat_loc "Invalid range for Rep operator"
389389+ | _ ->
390390+ err p.ppat_loc "Rep must take an integer constant or interval"
391391+ end
392392+ | Ppat_construct ({ txt = Lident "Rep" }, _) ->
393393+ err p.ppat_loc "the Rep operator takes 2 arguments"
394394+ | Ppat_construct ({ txt = Lident "Opt" }, Some (_, p)) ->
395395+ Sedlex.alt Sedlex.eps (aux ~encoding p)
396396+ | Ppat_construct ({ txt = Lident "Compl" }, arg) -> begin
397397+ match arg with
398398+ | Some (_, p0) -> begin
399399+ match Sedlex.compl (aux ~encoding p0) with
400400+ | Some r -> r
401401+ | None ->
402402+ err p.ppat_loc
403403+ "the Compl operator can only applied to a \
404404+ single-character length regexp"
405405+ end
406406+ | _ -> err p.ppat_loc "the Compl operator requires an argument"
407407+ end
408408+ | Ppat_construct ({ txt = Lident "Sub" }, arg) ->
409409+ char_pair_op ~encoding Sedlex.subtract "Sub" p
410410+ (Option.map (fun (_, arg) -> arg) arg)
411411+ | Ppat_construct ({ txt = Lident "Intersect" }, arg) ->
412412+ char_pair_op ~encoding Sedlex.intersection "Intersect" p
413413+ (Option.map (fun (_, arg) -> arg) arg)
414414+ | Ppat_construct ({ txt = Lident "Chars" }, arg) -> (
415415+ let const =
416416+ match arg with
417417+ | Some (_, { ppat_desc = Ppat_constant const }) -> Some const
418418+ | _ -> None
419419+ in
420420+ match const with
421421+ | Some (Pconst_string (s, _, _)) ->
422422+ let l = rev_csets_of_string ~loc:p.ppat_loc ~encoding s in
423423+ let chars = List.fold_left Cset.union Cset.empty l in
424424+ Sedlex.chars chars
425425+ | _ ->
426426+ err p.ppat_loc "the Chars operator requires a string argument")
427427+ | Ppat_interval (i_start, i_end) -> begin
428428+ match (i_start, i_end) with
429429+ | Pconst_char c1, Pconst_char c2 ->
430430+ let valid =
431431+ match encoding with
432432+ (* utf8 char interval can only match ascii because
433433+ of the OCaml lexer. *)
434434+ | Ascii | Utf8 -> (
435435+ function '\x00' .. '\x7f' -> true | _ -> false)
436436+ | Latin1 -> ( function _ -> true)
437437+ in
438438+ if not (valid c1 && valid c2) then
439439+ err p.ppat_loc
440440+ "this pattern is not a valid %s interval regexp"
441441+ (string_of_encoding encoding);
442442+ Sedlex.chars (Cset.interval (Char.code c1) (Char.code c2))
443443+ | Pconst_integer (i1, _), Pconst_integer (i2, _) ->
444444+ Sedlex.chars
445445+ (Cset.interval
446446+ (codepoint (int_of_string i1))
447447+ (codepoint (int_of_string i2)))
448448+ | _ -> err p.ppat_loc "this pattern is not a valid interval regexp"
449449+ end
450450+ | Ppat_constant const -> begin
451451+ match const with
452452+ | Pconst_string (s, _, _) ->
453453+ let rev_l = rev_csets_of_string s ~loc:p.ppat_loc ~encoding in
454454+ List.fold_left
455455+ (fun acc cset -> Sedlex.seq (Sedlex.chars cset) acc)
456456+ Sedlex.eps rev_l
457457+ | Pconst_char c -> Sedlex.chars (char c)
458458+ | Pconst_integer (i, _) ->
459459+ Sedlex.chars (Cset.singleton (codepoint (int_of_string i)))
460460+ | _ -> err p.ppat_loc "this pattern is not a valid regexp"
461461+ end
462462+ | Ppat_var { txt = x } -> begin
463463+ try StringMap.find x env
464464+ with Not_found -> err p.ppat_loc "unbound regexp %s" x
465465+ end
466466+ | _ -> err p.ppat_loc "this pattern is not a valid regexp"
467467+ in
468468+ aux ~encoding:Ascii
469469+470470+let previous = ref []
471471+let regexps = ref []
472472+let should_set_cookies = ref false
473473+474474+let mapper =
475475+ object (this)
476476+ inherit Ast_traverse.map as super
477477+ val env = builtin_regexps
478478+479479+ method define_regexp name p =
480480+ {<env = StringMap.add name (regexp_of_pattern env p) env>}
481481+482482+ method! expression e =
483483+ match e with
484484+ | [%expr [%sedlex [%e? { pexp_desc = Pexp_match (lexbuf, cases) }]]] ->
485485+ let lexbuf =
486486+ match lexbuf with
487487+ | { pexp_desc = Pexp_ident { txt = Lident txt } } ->
488488+ (txt, lexbuf)
489489+ | _ ->
490490+ err lexbuf.pexp_loc
491491+ "the matched expression must be a single identifier"
492492+ in
493493+ let cases = List.rev cases in
494494+ let error =
495495+ match List.hd cases with
496496+ | { pc_lhs = [%pat? _]; pc_rhs = e; pc_guard = None } ->
497497+ this#expression e
498498+ | { pc_lhs = p } ->
499499+ err p.ppat_loc
500500+ "the last branch must be a catch-all error case"
501501+ in
502502+ let cases = List.rev (List.tl cases) in
503503+ let cases =
504504+ List.map
505505+ (function
506506+ | { pc_lhs = p; pc_rhs = e; pc_guard = None } ->
507507+ (regexp_of_pattern env p, this#expression e)
508508+ | { pc_guard = Some e } ->
509509+ err e.pexp_loc "'when' guards are not supported")
510510+ cases
511511+ in
512512+ gen_definition lexbuf cases error
513513+ | [%expr
514514+ let [%p? { ppat_desc = Ppat_var { txt = name } }] =
515515+ [%sedlex.regexp? [%p? p]]
516516+ in
517517+ [%e? body]] ->
518518+ (this#define_regexp name p)#expression body
519519+ | [%expr [%sedlex [%e? _]]] ->
520520+ err e.pexp_loc
521521+ "the %%sedlex extension is only recognized on match expressions"
522522+ | _ -> super#expression e
523523+524524+ val toplevel = true
525525+526526+ method structure_with_regexps l =
527527+ let mapper = ref this in
528528+ let regexps = ref [] in
529529+ let l =
530530+ List.concat
531531+ (List.map
532532+ (function
533533+ | [%stri
534534+ let [%p? { ppat_desc = Ppat_var { txt = name } }] =
535535+ [%sedlex.regexp? [%p? p]]] as i ->
536536+ regexps := i :: !regexps;
537537+ mapper := !mapper#define_regexp name p;
538538+ []
539539+ | i -> [!mapper#structure_item i])
540540+ l)
541541+ in
542542+ (l, List.rev !regexps)
543543+544544+ method! structure l =
545545+ if toplevel then (
546546+ let sub = {<toplevel = false>} in
547547+ let l, regexps' = sub#structure_with_regexps (!previous @ l) in
548548+ let parts = List.map partition (get_partitions ()) in
549549+ let tables = List.map table (get_tables ()) in
550550+ regexps := regexps';
551551+ should_set_cookies := true;
552552+ tables @ parts @ l)
553553+ else fst (this#structure_with_regexps l)
554554+ end
555555+556556+let pre_handler cookies =
557557+ previous :=
558558+ match Driver.Cookies.get cookies "sedlex.regexps" Ast_pattern.__ with
559559+ | Some { pexp_desc = Pexp_extension (_, PStr l) } -> l
560560+ | Some _ -> assert false
561561+ | None -> []
562562+563563+let post_handler cookies =
564564+ if !should_set_cookies then (
565565+ let loc = default_loc in
566566+ Driver.Cookies.set cookies "sedlex.regexps"
567567+ (pexp_extension ~loc ({ loc; txt = "regexps" }, PStr !regexps)))
568568+569569+let extensions =
570570+ [
571571+ Extension.declare "sedlex" Extension.Context.expression
572572+ Ast_pattern.(single_expr_payload __)
573573+ (fun ~loc:_ ~path:_ expr -> mapper#expression expr);
574574+ ]
575575+576576+let () =
577577+ Driver.Cookies.add_handler pre_handler;
578578+ Driver.Cookies.add_post_handler post_handler;
579579+ Driver.register_transformation "sedlex" ~impl:mapper#structure
+141
vendor/opam/sedlex/src/syntax/sedlex.ml
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+(* Copyright 2005, 2013 by Alain Frisch and LexiFi. *)
44+55+module Cset = Sedlex_cset
66+77+(* NFA *)
88+99+type node = {
1010+ id : int;
1111+ mutable eps : node list;
1212+ mutable trans : (Cset.t * node) list;
1313+}
1414+1515+(* Compilation regexp -> NFA *)
1616+1717+type regexp = node -> node
1818+1919+let cur_id = ref 0
2020+2121+let new_node () =
2222+ incr cur_id;
2323+ { id = !cur_id; eps = []; trans = [] }
2424+2525+let seq r1 r2 succ = r1 (r2 succ)
2626+2727+let is_chars final = function
2828+ | { eps = []; trans = [(c, f)] } when f == final -> Some c
2929+ | _ -> None
3030+3131+let chars c succ =
3232+ let n = new_node () in
3333+ n.trans <- [(c, succ)];
3434+ n
3535+3636+let alt r1 r2 succ =
3737+ let nr1 = r1 succ and nr2 = r2 succ in
3838+ match (is_chars succ nr1, is_chars succ nr2) with
3939+ | Some c1, Some c2 -> chars (Cset.union c1 c2) succ
4040+ | _ ->
4141+ let n = new_node () in
4242+ n.eps <- [nr1; nr2];
4343+ n
4444+4545+let rep r succ =
4646+ let n = new_node () in
4747+ n.eps <- [r n; succ];
4848+ n
4949+5050+let plus r succ =
5151+ let n = new_node () in
5252+ let nr = r n in
5353+ n.eps <- [nr; succ];
5454+ nr
5555+5656+let eps succ = succ (* eps for epsilon *)
5757+5858+let compl r =
5959+ let n = new_node () in
6060+ match is_chars n (r n) with
6161+ | Some c -> Some (chars (Cset.difference Cset.any c))
6262+ | _ -> None
6363+6464+let pair_op f r0 r1 =
6565+ (* Construct subtract or intersection *)
6666+ let n = new_node () in
6767+ let to_chars r = is_chars n (r n) in
6868+ match (to_chars r0, to_chars r1) with
6969+ | Some c0, Some c1 -> Some (chars (f c0 c1))
7070+ | _ -> None
7171+7272+let subtract = pair_op Cset.difference
7373+let intersection = pair_op Cset.intersection
7474+7575+let compile_re re =
7676+ let final = new_node () in
7777+ (re final, final)
7878+7979+(* Determinization *)
8080+8181+type state = node list
8282+(* A state of the DFA corresponds to a set of nodes in the NFA. *)
8383+8484+let rec add_node state node =
8585+ if List.memq node state then state else add_nodes (node :: state) node.eps
8686+8787+and add_nodes state nodes = List.fold_left add_node state nodes
8888+8989+let transition (state : state) =
9090+ (* Merge transition with the same target *)
9191+ let rec norm = function
9292+ | (c1, n1) :: ((c2, n2) :: q as l) ->
9393+ if n1 == n2 then norm ((Cset.union c1 c2, n1) :: q)
9494+ else (c1, n1) :: norm l
9595+ | l -> l
9696+ in
9797+ let t = List.concat (List.map (fun n -> n.trans) state) in
9898+ let t = norm (List.sort (fun (_, n1) (_, n2) -> n1.id - n2.id) t) in
9999+100100+ (* Split char sets so as to make them disjoint *)
101101+ let split (all, t) (c0, n0) =
102102+ let t =
103103+ (Cset.difference c0 all, [n0])
104104+ :: List.map (fun (c, ns) -> (Cset.intersection c c0, n0 :: ns)) t
105105+ @ List.map (fun (c, ns) -> (Cset.difference c c0, ns)) t
106106+ in
107107+ (Cset.union all c0, List.filter (fun (c, _) -> not (Cset.is_empty c)) t)
108108+ in
109109+110110+ let _, t = List.fold_left split (Cset.empty, []) t in
111111+112112+ (* Epsilon closure of targets *)
113113+ let t = List.map (fun (c, ns) -> (c, add_nodes [] ns)) t in
114114+115115+ (* Canonical ordering *)
116116+ let t = Array.of_list t in
117117+ Array.sort (fun (c1, _) (c2, _) -> compare c1 c2) t;
118118+ t
119119+120120+let compile rs =
121121+ let rs = Array.map compile_re rs in
122122+ let counter = ref 0 in
123123+ let states = Hashtbl.create 31 in
124124+ let states_def = Hashtbl.create 31 in
125125+ let rec aux state =
126126+ try Hashtbl.find states state
127127+ with Not_found ->
128128+ let i = !counter in
129129+ incr counter;
130130+ Hashtbl.add states state i;
131131+ let trans = transition state in
132132+ let trans = Array.map (fun (p, t) -> (p, aux t)) trans in
133133+ let finals = Array.map (fun (_, f) -> List.memq f state) rs in
134134+ Hashtbl.add states_def i (trans, finals);
135135+ i
136136+ in
137137+ let init = ref [] in
138138+ Array.iter (fun (i, _) -> init := add_node !init i) rs;
139139+ let i = aux !init in
140140+ assert (i = 0);
141141+ Array.init !counter (Hashtbl.find states_def)
+25
vendor/opam/sedlex/src/syntax/sedlex.mli
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+(* Copyright 2005, 2013 by Alain Frisch and LexiFi. *)
44+55+type regexp
66+77+val chars : Sedlex_cset.t -> regexp
88+val seq : regexp -> regexp -> regexp
99+val alt : regexp -> regexp -> regexp
1010+val rep : regexp -> regexp
1111+val plus : regexp -> regexp
1212+val eps : regexp
1313+val compl : regexp -> regexp option
1414+1515+(* If the argument is a single [chars] regexp, returns a regexp
1616+ which matches the complement set. Otherwise returns [None]. *)
1717+val subtract : regexp -> regexp -> regexp option
1818+1919+(* If each argument is a single [chars] regexp, returns a regexp
2020+ which matches the set (arg1 - arg2). Otherwise returns [None]. *)
2121+val intersection : regexp -> regexp -> regexp option
2222+(* If each argument is a single [chars] regexp, returns a regexp
2323+ which matches the intersection set. Otherwise returns [None]. *)
2424+2525+val compile : regexp array -> ((Sedlex_cset.t * int) array * bool array) array
+5
vendor/opam/sedlex/src/syntax/sedlex_cset.ml
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+(* Copyright 2005, 2013 by Alain Frisch and LexiFi. *)
44+55+include Sedlex_utils.Cset
···11+(* Version is automatically generated from unicode data at
22+ * src/generator/data *)
33+val version : string
44+55+module Categories : sig
66+ val cc : Sedlex_cset.t
77+ val cf : Sedlex_cset.t
88+ val cn : Sedlex_cset.t
99+ val co : Sedlex_cset.t
1010+ val cs : Sedlex_cset.t
1111+ val ll : Sedlex_cset.t
1212+ val lm : Sedlex_cset.t
1313+ val lo : Sedlex_cset.t
1414+ val lt : Sedlex_cset.t
1515+ val lu : Sedlex_cset.t
1616+ val mc : Sedlex_cset.t
1717+ val me : Sedlex_cset.t
1818+ val mn : Sedlex_cset.t
1919+ val nd : Sedlex_cset.t
2020+ val nl : Sedlex_cset.t
2121+ val no : Sedlex_cset.t
2222+ val pc : Sedlex_cset.t
2323+ val pd : Sedlex_cset.t
2424+ val pe : Sedlex_cset.t
2525+ val pf : Sedlex_cset.t
2626+ val pi : Sedlex_cset.t
2727+ val po : Sedlex_cset.t
2828+ val ps : Sedlex_cset.t
2929+ val sc : Sedlex_cset.t
3030+ val sk : Sedlex_cset.t
3131+ val sm : Sedlex_cset.t
3232+ val so : Sedlex_cset.t
3333+ val zl : Sedlex_cset.t
3434+ val zp : Sedlex_cset.t
3535+ val zs : Sedlex_cset.t
3636+ val list : (string * Sedlex_cset.t) list
3737+end
3838+3939+module Properties : sig
4040+ val alphabetic : Sedlex_cset.t
4141+ val ascii_hex_digit : Sedlex_cset.t
4242+ val hex_digit : Sedlex_cset.t
4343+ val id_continue : Sedlex_cset.t
4444+ val id_start : Sedlex_cset.t
4545+ val lowercase : Sedlex_cset.t
4646+ val math : Sedlex_cset.t
4747+ val other_alphabetic : Sedlex_cset.t
4848+ val other_lowercase : Sedlex_cset.t
4949+ val other_math : Sedlex_cset.t
5050+ val other_uppercase : Sedlex_cset.t
5151+ val uppercase : Sedlex_cset.t
5252+ val white_space : Sedlex_cset.t
5353+ val xid_continue : Sedlex_cset.t
5454+ val xid_start : Sedlex_cset.t
5555+ val list : (string * Sedlex_cset.t) list
5656+end
+49
vendor/opam/sedlex/src/syntax/utf8.ml
···11+open Sedlexing
22+33+let unsafe_byte s j = Char.code (String.unsafe_get s j)
44+let malformed s j l = `Malformed (String.sub s j l)
55+66+let r_utf_8 s j l =
77+ (* assert (0 <= j && 0 <= l && j + l <= String.length s); *)
88+ let uchar c = `Uchar (Uchar.unsafe_of_int c) in
99+ match l with
1010+ | 1 -> uchar (unsafe_byte s j)
1111+ | 2 -> (
1212+ let b0 = unsafe_byte s j in
1313+ let b1 = unsafe_byte s (j + 1) in
1414+ match Utf8.Helper.check_two b0 b1 with
1515+ | i -> uchar i
1616+ | exception MalFormed -> malformed s j l)
1717+ | 3 -> (
1818+ let b0 = unsafe_byte s j in
1919+ let b1 = unsafe_byte s (j + 1) in
2020+ let b2 = unsafe_byte s (j + 2) in
2121+ match Utf8.Helper.check_three b0 b1 b2 with
2222+ | i -> uchar i
2323+ | exception MalFormed -> malformed s j l)
2424+ | 4 -> (
2525+ let b0 = unsafe_byte s j in
2626+ let b1 = unsafe_byte s (j + 1) in
2727+ let b2 = unsafe_byte s (j + 2) in
2828+ let b3 = unsafe_byte s (j + 3) in
2929+ match Utf8.Helper.check_four b0 b1 b2 b3 with
3030+ | i -> uchar i
3131+ | exception MalFormed -> malformed s j l)
3232+ | _ -> assert false
3333+3434+let fold ~f acc s =
3535+ let rec loop acc f s i last =
3636+ if i > last then acc
3737+ else (
3838+ match Utf8.Helper.width (String.unsafe_get s i) with
3939+ | exception MalFormed ->
4040+ loop (f acc i (malformed s i 1)) f s (i + 1) last
4141+ | need ->
4242+ let rem = last - i + 1 in
4343+ if rem < need then f acc i (malformed s i rem)
4444+ else loop (f acc i (r_utf_8 s i need)) f s (i + need) last)
4545+ in
4646+ let pos = 0 in
4747+ let len = String.length s in
4848+ let last = pos + len - 1 in
4949+ loop acc f s pos last
+5
vendor/opam/sedlex/src/syntax/utf8.mli
···11+val fold :
22+ f:('a -> int -> [> `Malformed of string | `Uchar of Uchar.t ] -> 'a) ->
33+ 'a ->
44+ string ->
55+ 'a
···11+(* The package sedlex is released under the terms of an MIT-like license. *)
22+(* See the attached LICENSE file. *)
33+44+(** Unicode classes from XML *)
55+66+open Sedlex_cset
77+88+val letter : t
99+val digit : t
1010+val extender : t
1111+val base_char : t
1212+val ideographic : t
1313+val combining_char : t
1414+val blank : t
vendor/opam/sedlex/test/UTF-8-test.txt
This is a binary file and will not be displayed.
+1062
vendor/opam/sedlex/test/basic.ml
···11+let () = set_binary_mode_out stdout true
22+let digit = [%sedlex.regexp? '0' .. '9']
33+let number = [%sedlex.regexp? Plus digit]
44+55+let print_pos buf =
66+ let f { Lexing.pos_lnum; pos_bol; pos_cnum; _ } =
77+ Printf.sprintf "line=%d:bol=%d:cnum=%d" pos_lnum pos_bol pos_cnum
88+ in
99+ let f ~prefix (startp, endp) =
1010+ Printf.printf "%s pos: [%s;%s]\n" prefix (f startp) (f endp)
1111+ in
1212+ f ~prefix:"code point" (Sedlexing.lexing_positions buf);
1313+ f ~prefix:"bytes" (Sedlexing.lexing_bytes_positions buf)
1414+1515+let rec token buf =
1616+ match%sedlex buf with
1717+ | number ->
1818+ print_pos buf;
1919+ Printf.printf "Number %s\n" (Sedlexing.Utf8.lexeme buf);
2020+ token buf
2121+ | id_start, Star id_continue ->
2222+ print_pos buf;
2323+ Printf.printf "Ident %s\n" (Sedlexing.Utf8.lexeme buf);
2424+ token buf
2525+ | Plus xml_blank -> token buf
2626+ | Plus (Chars "+*-/") ->
2727+ print_pos buf;
2828+ Printf.printf "Op %s\n" (Sedlexing.Utf8.lexeme buf);
2929+ token buf
3030+ | eof ->
3131+ print_pos buf;
3232+ print_endline "EOF"
3333+ | any ->
3434+ print_pos buf;
3535+ Printf.printf "Any %s\n" (Sedlexing.Utf8.lexeme buf);
3636+ token buf
3737+ | _ -> assert false
3838+3939+let utf16_of_utf8 ?(endian = Sedlexing.Utf16.Big_endian) s =
4040+ let b = Buffer.create (String.length s * 4) in
4141+ let rec loop pos =
4242+ if pos >= String.length s then ()
4343+ else (
4444+ let c = String.get_utf_8_uchar s pos in
4545+ let u = Uchar.utf_decode_uchar c in
4646+ (match endian with
4747+ | Big_endian -> Buffer.add_utf_16be_uchar b u
4848+ | Little_endian -> Buffer.add_utf_16le_uchar b u);
4949+ loop (pos + Uchar.utf_decode_length c))
5050+ in
5151+ loop 0;
5252+ Buffer.contents b
5353+5454+let remove_last s n = String.sub s 0 (String.length s - n)
5555+5656+let gen_from_string s =
5757+ let i = ref 0 in
5858+ fun () ->
5959+ if !i >= String.length s then None
6060+ else (
6161+ let c = String.get s !i in
6262+ incr i;
6363+ Some c)
6464+6565+let channel_from_string s =
6666+ let name, oc = Filename.open_temp_file "" "" in
6767+ output_string oc s;
6868+ close_out oc;
6969+ open_in_bin name
7070+7171+let test_latin s f =
7272+ print_endline "== from_string ==";
7373+ (try f (Sedlexing.Latin1.from_string s)
7474+ with Sedlexing.MalFormed -> print_endline "MalFormed");
7575+ print_endline "== from_gen ==";
7676+ (try f (Sedlexing.Latin1.from_gen (gen_from_string s))
7777+ with Sedlexing.MalFormed -> print_endline "MalFormed");
7878+ print_endline "== from_channel ==";
7979+ try f (Sedlexing.Latin1.from_channel (channel_from_string s))
8080+ with Sedlexing.MalFormed -> print_endline "MalFormed"
8181+8282+let test_utf8 s f =
8383+ print_endline "== from_string ==";
8484+ (try f (Sedlexing.Utf8.from_string s)
8585+ with Sedlexing.MalFormed -> print_endline "MalFormed");
8686+ print_endline "== from_gen ==";
8787+ (try f (Sedlexing.Utf8.from_gen (gen_from_string s))
8888+ with Sedlexing.MalFormed -> print_endline "MalFormed");
8989+ print_endline "== from_channel ==";
9090+ try f (Sedlexing.Utf8.from_channel (channel_from_string s))
9191+ with Sedlexing.MalFormed -> print_endline "MalFormed"
9292+9393+let test_utf16 s bo f =
9494+ print_endline "== from_string ==";
9595+ (try f (Sedlexing.Utf16.from_string s bo)
9696+ with Sedlexing.MalFormed -> print_endline "MalFormed");
9797+ print_endline "== from_gen ==";
9898+ (try f (Sedlexing.Utf16.from_gen (gen_from_string s) bo)
9999+ with Sedlexing.MalFormed -> print_endline "MalFormed");
100100+ print_endline "== from_channel ==";
101101+ try f (Sedlexing.Utf16.from_channel (channel_from_string s) bo)
102102+ with Sedlexing.MalFormed -> print_endline "MalFormed"
103103+104104+let%expect_test "latin1" =
105105+ let s = "asas 123 + 2asd" in
106106+ test_latin s (fun lb -> token lb);
107107+ [%expect
108108+ {|
109109+ == from_string ==
110110+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
111111+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
112112+ Ident asas
113113+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
114114+ bytes pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
115115+ Number 123
116116+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
117117+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
118118+ Op +
119119+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
120120+ bytes pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
121121+ Number 2
122122+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
123123+ bytes pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
124124+ Ident asd
125125+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
126126+ bytes pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
127127+ EOF
128128+ == from_gen ==
129129+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
130130+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
131131+ Ident asas
132132+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
133133+ bytes pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
134134+ Number 123
135135+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
136136+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
137137+ Op +
138138+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
139139+ bytes pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
140140+ Number 2
141141+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
142142+ bytes pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
143143+ Ident asd
144144+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
145145+ bytes pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
146146+ EOF
147147+ == from_channel ==
148148+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
149149+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
150150+ Ident asas
151151+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
152152+ bytes pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
153153+ Number 123
154154+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
155155+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
156156+ Op +
157157+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
158158+ bytes pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
159159+ Number 2
160160+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
161161+ bytes pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
162162+ Ident asd
163163+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
164164+ bytes pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
165165+ EOF |}];
166166+ let s = "asas 123 + 2\129" in
167167+ test_latin s (fun lb -> token lb);
168168+ [%expect
169169+ {|
170170+ == from_string ==
171171+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
172172+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
173173+ Ident asas
174174+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
175175+ bytes pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
176176+ Number 123
177177+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
178178+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
179179+ Op +
180180+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
181181+ bytes pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
182182+ Number 2
183183+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
184184+ bytes pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
185185+ Any
186186+ code point pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=13]
187187+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=13]
188188+ EOF
189189+ == from_gen ==
190190+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
191191+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
192192+ Ident asas
193193+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
194194+ bytes pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
195195+ Number 123
196196+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
197197+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
198198+ Op +
199199+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
200200+ bytes pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
201201+ Number 2
202202+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
203203+ bytes pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
204204+ Any
205205+ code point pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=13]
206206+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=13]
207207+ EOF
208208+ == from_channel ==
209209+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
210210+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
211211+ Ident asas
212212+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
213213+ bytes pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
214214+ Number 123
215215+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
216216+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
217217+ Op +
218218+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
219219+ bytes pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
220220+ Number 2
221221+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
222222+ bytes pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
223223+ Any
224224+ code point pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=13]
225225+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=13]
226226+ EOF |}]
227227+228228+let%expect_test "utf8" =
229229+ let s = {|as🎉as 123 + 2asd|} in
230230+ test_utf8 s (fun lb -> token lb);
231231+ [%expect
232232+ {|
233233+ == from_string ==
234234+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
235235+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
236236+ Ident as
237237+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=3]
238238+ bytes pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=6]
239239+ Any 🎉
240240+ code point pos: [line=1:bol=0:cnum=3;line=1:bol=0:cnum=5]
241241+ bytes pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=8]
242242+ Ident as
243243+ code point pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=9]
244244+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=12]
245245+ Number 123
246246+ code point pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=11]
247247+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=14]
248248+ Op +
249249+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
250250+ bytes pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=16]
251251+ Number 2
252252+ code point pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=16]
253253+ bytes pos: [line=1:bol=0:cnum=16;line=1:bol=0:cnum=19]
254254+ Ident asd
255255+ code point pos: [line=1:bol=0:cnum=16;line=1:bol=0:cnum=16]
256256+ bytes pos: [line=1:bol=0:cnum=19;line=1:bol=0:cnum=19]
257257+ EOF
258258+ == from_gen ==
259259+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
260260+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
261261+ Ident as
262262+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=3]
263263+ bytes pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=6]
264264+ Any 🎉
265265+ code point pos: [line=1:bol=0:cnum=3;line=1:bol=0:cnum=5]
266266+ bytes pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=8]
267267+ Ident as
268268+ code point pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=9]
269269+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=12]
270270+ Number 123
271271+ code point pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=11]
272272+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=14]
273273+ Op +
274274+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
275275+ bytes pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=16]
276276+ Number 2
277277+ code point pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=16]
278278+ bytes pos: [line=1:bol=0:cnum=16;line=1:bol=0:cnum=19]
279279+ Ident asd
280280+ code point pos: [line=1:bol=0:cnum=16;line=1:bol=0:cnum=16]
281281+ bytes pos: [line=1:bol=0:cnum=19;line=1:bol=0:cnum=19]
282282+ EOF
283283+ == from_channel ==
284284+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
285285+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
286286+ Ident as
287287+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=3]
288288+ bytes pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=6]
289289+ Any 🎉
290290+ code point pos: [line=1:bol=0:cnum=3;line=1:bol=0:cnum=5]
291291+ bytes pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=8]
292292+ Ident as
293293+ code point pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=9]
294294+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=12]
295295+ Number 123
296296+ code point pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=11]
297297+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=14]
298298+ Op +
299299+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=13]
300300+ bytes pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=16]
301301+ Number 2
302302+ code point pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=16]
303303+ bytes pos: [line=1:bol=0:cnum=16;line=1:bol=0:cnum=19]
304304+ Ident asd
305305+ code point pos: [line=1:bol=0:cnum=16;line=1:bol=0:cnum=16]
306306+ bytes pos: [line=1:bol=0:cnum=19;line=1:bol=0:cnum=19]
307307+ EOF |}];
308308+ let s = {|as🎉as 123 + 2|} ^ "\129" in
309309+ test_utf8 s (fun lb -> token lb);
310310+ [%expect
311311+ {|
312312+ == from_string ==
313313+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
314314+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
315315+ Ident as
316316+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=3]
317317+ bytes pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=6]
318318+ Any 🎉
319319+ code point pos: [line=1:bol=0:cnum=3;line=1:bol=0:cnum=5]
320320+ bytes pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=8]
321321+ Ident as
322322+ code point pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=9]
323323+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=12]
324324+ Number 123
325325+ code point pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=11]
326326+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=14]
327327+ Op +
328328+ MalFormed
329329+ == from_gen ==
330330+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
331331+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
332332+ Ident as
333333+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=3]
334334+ bytes pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=6]
335335+ Any 🎉
336336+ code point pos: [line=1:bol=0:cnum=3;line=1:bol=0:cnum=5]
337337+ bytes pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=8]
338338+ Ident as
339339+ code point pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=9]
340340+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=12]
341341+ Number 123
342342+ code point pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=11]
343343+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=14]
344344+ Op +
345345+ MalFormed
346346+ == from_channel ==
347347+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
348348+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
349349+ Ident as
350350+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=3]
351351+ bytes pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=6]
352352+ Any 🎉
353353+ code point pos: [line=1:bol=0:cnum=3;line=1:bol=0:cnum=5]
354354+ bytes pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=8]
355355+ Ident as
356356+ code point pos: [line=1:bol=0:cnum=6;line=1:bol=0:cnum=9]
357357+ bytes pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=12]
358358+ Number 123
359359+ code point pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=11]
360360+ bytes pos: [line=1:bol=0:cnum=13;line=1:bol=0:cnum=14]
361361+ Op +
362362+ MalFormed |}]
363363+364364+let%expect_test "utf16" =
365365+ let bo = None in
366366+ let s = utf16_of_utf8 "asas 123 + 2asd" in
367367+ test_utf16 s bo (fun lb -> token lb);
368368+ [%expect
369369+ {|
370370+ == from_string ==
371371+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
372372+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
373373+ Ident asas
374374+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
375375+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
376376+ Number 123
377377+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
378378+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
379379+ Op +
380380+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
381381+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
382382+ Number 2
383383+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
384384+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
385385+ Ident asd
386386+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
387387+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
388388+ EOF
389389+ == from_gen ==
390390+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
391391+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
392392+ Ident asas
393393+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
394394+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
395395+ Number 123
396396+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
397397+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
398398+ Op +
399399+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
400400+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
401401+ Number 2
402402+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
403403+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
404404+ Ident asd
405405+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
406406+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
407407+ EOF
408408+ == from_channel ==
409409+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
410410+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
411411+ Ident asas
412412+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
413413+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
414414+ Number 123
415415+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
416416+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
417417+ Op +
418418+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
419419+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
420420+ Number 2
421421+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
422422+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
423423+ Ident asd
424424+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
425425+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
426426+ EOF |}];
427427+ let s = utf16_of_utf8 "asas 123 + 2" ^ "a" in
428428+ test_utf16 s bo (fun lb -> token lb);
429429+ [%expect
430430+ {|
431431+ == from_string ==
432432+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
433433+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
434434+ Ident asas
435435+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
436436+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
437437+ Number 123
438438+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
439439+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
440440+ Op +
441441+ MalFormed
442442+ == from_gen ==
443443+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
444444+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
445445+ Ident asas
446446+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
447447+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
448448+ Number 123
449449+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
450450+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
451451+ Op +
452452+ MalFormed
453453+ == from_channel ==
454454+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
455455+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
456456+ Ident asas
457457+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
458458+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
459459+ Number 123
460460+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
461461+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
462462+ Op +
463463+ MalFormed |}];
464464+ let s1 = "12asd12\u{1F6F3}" in
465465+ let s = utf16_of_utf8 s1 in
466466+ test_utf16 s bo (fun lb -> token lb);
467467+ [%expect
468468+ {|
469469+ == from_string ==
470470+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
471471+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
472472+ Number 12
473473+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
474474+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
475475+ Ident asd12
476476+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
477477+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
478478+ Any 🛳
479479+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
480480+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
481481+ EOF
482482+ == from_gen ==
483483+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
484484+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
485485+ Number 12
486486+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
487487+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
488488+ Ident asd12
489489+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
490490+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
491491+ Any 🛳
492492+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
493493+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
494494+ EOF
495495+ == from_channel ==
496496+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
497497+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
498498+ Number 12
499499+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
500500+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
501501+ Ident asd12
502502+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
503503+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
504504+ Any 🛳
505505+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
506506+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
507507+ EOF |}];
508508+ test_utf16 (remove_last s 1) bo (fun lb -> token lb);
509509+ [%expect
510510+ {|
511511+ == from_string ==
512512+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
513513+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
514514+ Number 12
515515+ MalFormed
516516+ == from_gen ==
517517+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
518518+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
519519+ Number 12
520520+ MalFormed
521521+ == from_channel ==
522522+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
523523+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
524524+ Number 12
525525+ MalFormed |}];
526526+ test_utf16 (remove_last s 2) bo (fun lb -> token lb);
527527+ [%expect
528528+ {|
529529+ == from_string ==
530530+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
531531+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
532532+ Number 12
533533+ MalFormed
534534+ == from_gen ==
535535+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
536536+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
537537+ Number 12
538538+ MalFormed
539539+ == from_channel ==
540540+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
541541+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
542542+ Number 12
543543+ MalFormed |}];
544544+ test_utf16 (remove_last s 3) bo (fun lb -> token lb);
545545+ [%expect
546546+ {|
547547+ == from_string ==
548548+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
549549+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
550550+ Number 12
551551+ MalFormed
552552+ == from_gen ==
553553+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
554554+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
555555+ Number 12
556556+ MalFormed
557557+ == from_channel ==
558558+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
559559+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
560560+ Number 12
561561+ MalFormed |}];
562562+ test_utf16 (remove_last s 4) bo (fun lb -> token lb);
563563+ [%expect
564564+ {|
565565+ == from_string ==
566566+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
567567+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
568568+ Number 12
569569+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
570570+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
571571+ Ident asd12
572572+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
573573+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
574574+ EOF
575575+ == from_gen ==
576576+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
577577+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
578578+ Number 12
579579+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
580580+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
581581+ Ident asd12
582582+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
583583+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
584584+ EOF
585585+ == from_channel ==
586586+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
587587+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
588588+ Number 12
589589+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
590590+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
591591+ Ident asd12
592592+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
593593+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
594594+ EOF |}]
595595+596596+let%expect_test "utf16-be" =
597597+ let endian = Sedlexing.Utf16.Big_endian in
598598+ let utf16_of_utf8 = utf16_of_utf8 ~endian in
599599+ let bo = Some endian in
600600+ let s = utf16_of_utf8 "asas 123 + 2asd" in
601601+ test_utf16 s bo (fun lb -> token lb);
602602+ [%expect
603603+ {|
604604+ == from_string ==
605605+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
606606+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
607607+ Ident asas
608608+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
609609+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
610610+ Number 123
611611+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
612612+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
613613+ Op +
614614+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
615615+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
616616+ Number 2
617617+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
618618+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
619619+ Ident asd
620620+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
621621+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
622622+ EOF
623623+ == from_gen ==
624624+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
625625+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
626626+ Ident asas
627627+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
628628+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
629629+ Number 123
630630+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
631631+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
632632+ Op +
633633+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
634634+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
635635+ Number 2
636636+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
637637+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
638638+ Ident asd
639639+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
640640+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
641641+ EOF
642642+ == from_channel ==
643643+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
644644+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
645645+ Ident asas
646646+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
647647+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
648648+ Number 123
649649+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
650650+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
651651+ Op +
652652+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
653653+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
654654+ Number 2
655655+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
656656+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
657657+ Ident asd
658658+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
659659+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
660660+ EOF |}];
661661+ let s = utf16_of_utf8 "asas 123 + 2" ^ "a" in
662662+ test_utf16 s bo (fun lb -> token lb);
663663+ [%expect
664664+ {|
665665+ == from_string ==
666666+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
667667+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
668668+ Ident asas
669669+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
670670+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
671671+ Number 123
672672+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
673673+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
674674+ Op +
675675+ MalFormed
676676+ == from_gen ==
677677+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
678678+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
679679+ Ident asas
680680+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
681681+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
682682+ Number 123
683683+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
684684+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
685685+ Op +
686686+ MalFormed
687687+ == from_channel ==
688688+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
689689+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
690690+ Ident asas
691691+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
692692+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
693693+ Number 123
694694+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
695695+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
696696+ Op +
697697+ MalFormed |}];
698698+ let s1 = "12asd12\u{1F6F3}" in
699699+ let s = utf16_of_utf8 s1 in
700700+ test_utf16 s bo (fun lb -> token lb);
701701+ [%expect
702702+ {|
703703+ == from_string ==
704704+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
705705+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
706706+ Number 12
707707+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
708708+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
709709+ Ident asd12
710710+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
711711+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
712712+ Any 🛳
713713+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
714714+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
715715+ EOF
716716+ == from_gen ==
717717+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
718718+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
719719+ Number 12
720720+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
721721+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
722722+ Ident asd12
723723+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
724724+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
725725+ Any 🛳
726726+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
727727+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
728728+ EOF
729729+ == from_channel ==
730730+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
731731+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
732732+ Number 12
733733+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
734734+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
735735+ Ident asd12
736736+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
737737+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
738738+ Any 🛳
739739+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
740740+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
741741+ EOF |}];
742742+ test_utf16 (remove_last s 1) bo (fun lb -> token lb);
743743+ [%expect
744744+ {|
745745+ == from_string ==
746746+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
747747+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
748748+ Number 12
749749+ MalFormed
750750+ == from_gen ==
751751+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
752752+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
753753+ Number 12
754754+ MalFormed
755755+ == from_channel ==
756756+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
757757+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
758758+ Number 12
759759+ MalFormed |}];
760760+ test_utf16 (remove_last s 2) bo (fun lb -> token lb);
761761+ [%expect
762762+ {|
763763+ == from_string ==
764764+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
765765+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
766766+ Number 12
767767+ MalFormed
768768+ == from_gen ==
769769+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
770770+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
771771+ Number 12
772772+ MalFormed
773773+ == from_channel ==
774774+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
775775+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
776776+ Number 12
777777+ MalFormed |}];
778778+ test_utf16 (remove_last s 3) bo (fun lb -> token lb);
779779+ [%expect
780780+ {|
781781+ == from_string ==
782782+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
783783+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
784784+ Number 12
785785+ MalFormed
786786+ == from_gen ==
787787+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
788788+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
789789+ Number 12
790790+ MalFormed
791791+ == from_channel ==
792792+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
793793+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
794794+ Number 12
795795+ MalFormed |}];
796796+ test_utf16 (remove_last s 4) bo (fun lb -> token lb);
797797+ [%expect
798798+ {|
799799+ == from_string ==
800800+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
801801+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
802802+ Number 12
803803+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
804804+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
805805+ Ident asd12
806806+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
807807+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
808808+ EOF
809809+ == from_gen ==
810810+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
811811+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
812812+ Number 12
813813+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
814814+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
815815+ Ident asd12
816816+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
817817+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
818818+ EOF
819819+ == from_channel ==
820820+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
821821+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
822822+ Number 12
823823+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
824824+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
825825+ Ident asd12
826826+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
827827+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
828828+ EOF |}]
829829+830830+let%expect_test "utf16-le" =
831831+ let endian = Sedlexing.Utf16.Little_endian in
832832+ let utf16_of_utf8 = utf16_of_utf8 ~endian in
833833+ let bo = Some endian in
834834+ let s = utf16_of_utf8 "asas 123 + 2asd" in
835835+ test_utf16 s bo (fun lb -> token lb);
836836+ [%expect
837837+ {|
838838+ == from_string ==
839839+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
840840+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
841841+ Ident asas
842842+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
843843+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
844844+ Number 123
845845+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
846846+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
847847+ Op +
848848+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
849849+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
850850+ Number 2
851851+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
852852+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
853853+ Ident asd
854854+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
855855+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
856856+ EOF
857857+ == from_gen ==
858858+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
859859+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
860860+ Ident asas
861861+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
862862+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
863863+ Number 123
864864+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
865865+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
866866+ Op +
867867+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
868868+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
869869+ Number 2
870870+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
871871+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
872872+ Ident asd
873873+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
874874+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
875875+ EOF
876876+ == from_channel ==
877877+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
878878+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
879879+ Ident asas
880880+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
881881+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
882882+ Number 123
883883+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
884884+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
885885+ Op +
886886+ code point pos: [line=1:bol=0:cnum=11;line=1:bol=0:cnum=12]
887887+ bytes pos: [line=1:bol=0:cnum=22;line=1:bol=0:cnum=24]
888888+ Number 2
889889+ code point pos: [line=1:bol=0:cnum=12;line=1:bol=0:cnum=15]
890890+ bytes pos: [line=1:bol=0:cnum=24;line=1:bol=0:cnum=30]
891891+ Ident asd
892892+ code point pos: [line=1:bol=0:cnum=15;line=1:bol=0:cnum=15]
893893+ bytes pos: [line=1:bol=0:cnum=30;line=1:bol=0:cnum=30]
894894+ EOF |}];
895895+ let s = utf16_of_utf8 "asas 123 + 2" ^ "a" in
896896+ test_utf16 s bo (fun lb -> token lb);
897897+ [%expect
898898+ {|
899899+ == from_string ==
900900+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
901901+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
902902+ Ident asas
903903+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
904904+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
905905+ Number 123
906906+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
907907+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
908908+ Op +
909909+ MalFormed
910910+ == from_gen ==
911911+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
912912+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
913913+ Ident asas
914914+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
915915+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
916916+ Number 123
917917+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
918918+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
919919+ Op +
920920+ MalFormed
921921+ == from_channel ==
922922+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
923923+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=8]
924924+ Ident asas
925925+ code point pos: [line=1:bol=0:cnum=5;line=1:bol=0:cnum=8]
926926+ bytes pos: [line=1:bol=0:cnum=10;line=1:bol=0:cnum=16]
927927+ Number 123
928928+ code point pos: [line=1:bol=0:cnum=9;line=1:bol=0:cnum=10]
929929+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=20]
930930+ Op +
931931+ MalFormed |}];
932932+ let s1 = "12asd12\u{1F6F3}" in
933933+ let s = utf16_of_utf8 s1 in
934934+ test_utf16 s bo (fun lb -> token lb);
935935+ [%expect
936936+ {|
937937+ == from_string ==
938938+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
939939+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
940940+ Number 12
941941+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
942942+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
943943+ Ident asd12
944944+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
945945+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
946946+ Any 🛳
947947+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
948948+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
949949+ EOF
950950+ == from_gen ==
951951+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
952952+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
953953+ Number 12
954954+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
955955+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
956956+ Ident asd12
957957+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
958958+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
959959+ Any 🛳
960960+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
961961+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
962962+ EOF
963963+ == from_channel ==
964964+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
965965+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
966966+ Number 12
967967+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
968968+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
969969+ Ident asd12
970970+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=8]
971971+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=18]
972972+ Any 🛳
973973+ code point pos: [line=1:bol=0:cnum=8;line=1:bol=0:cnum=8]
974974+ bytes pos: [line=1:bol=0:cnum=18;line=1:bol=0:cnum=18]
975975+ EOF |}];
976976+ test_utf16 (remove_last s 1) bo (fun lb -> token lb);
977977+ [%expect
978978+ {|
979979+ == from_string ==
980980+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
981981+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
982982+ Number 12
983983+ MalFormed
984984+ == from_gen ==
985985+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
986986+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
987987+ Number 12
988988+ MalFormed
989989+ == from_channel ==
990990+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
991991+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
992992+ Number 12
993993+ MalFormed |}];
994994+ test_utf16 (remove_last s 2) bo (fun lb -> token lb);
995995+ [%expect
996996+ {|
997997+ == from_string ==
998998+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
999999+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10001000+ Number 12
10011001+ MalFormed
10021002+ == from_gen ==
10031003+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
10041004+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10051005+ Number 12
10061006+ MalFormed
10071007+ == from_channel ==
10081008+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
10091009+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10101010+ Number 12
10111011+ MalFormed |}];
10121012+ test_utf16 (remove_last s 3) bo (fun lb -> token lb);
10131013+ [%expect
10141014+ {|
10151015+ == from_string ==
10161016+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
10171017+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10181018+ Number 12
10191019+ MalFormed
10201020+ == from_gen ==
10211021+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
10221022+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10231023+ Number 12
10241024+ MalFormed
10251025+ == from_channel ==
10261026+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
10271027+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10281028+ Number 12
10291029+ MalFormed |}];
10301030+ test_utf16 (remove_last s 4) bo (fun lb -> token lb);
10311031+ [%expect
10321032+ {|
10331033+ == from_string ==
10341034+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
10351035+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10361036+ Number 12
10371037+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
10381038+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
10391039+ Ident asd12
10401040+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
10411041+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
10421042+ EOF
10431043+ == from_gen ==
10441044+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
10451045+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10461046+ Number 12
10471047+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
10481048+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
10491049+ Ident asd12
10501050+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
10511051+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
10521052+ EOF
10531053+ == from_channel ==
10541054+ code point pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=2]
10551055+ bytes pos: [line=1:bol=0:cnum=0;line=1:bol=0:cnum=4]
10561056+ Number 12
10571057+ code point pos: [line=1:bol=0:cnum=2;line=1:bol=0:cnum=7]
10581058+ bytes pos: [line=1:bol=0:cnum=4;line=1:bol=0:cnum=14]
10591059+ Ident asd12
10601060+ code point pos: [line=1:bol=0:cnum=7;line=1:bol=0:cnum=7]
10611061+ bytes pos: [line=1:bol=0:cnum=14;line=1:bol=0:cnum=14]
10621062+ EOF |}]