···11+v1.0.4 2025-03-10 La Forclaz (VS)
22+---------------------------------
33+44+- Implement `Uutf.Buffer.*` with `Stdlib.Buffer.*`, no need to bloat
55+ these executables with dozens of different UTF encoders.
66+- Require OCaml >= 4.08.
77+- `uutftrip`, handle `cmdliner` deprecations.
88+99+v1.0.3 2022-02-03 La Forclaz (VS)
1010+---------------------------------
1111+1212+- Support for OCaml 5.00, thanks to Kate (@kit-ty-kate) for
1313+ the patch.
1414+1515+v1.0.2 2019-02-05 La Forclaz (VS)
1616+---------------------------------
1717+1818+- Fix the substring folding functionality introduced in v1.0.0.
1919+ It never worked correctly.
2020+2121+v1.0.1 2017-03-07 La Forclaz (VS)
2222+---------------------------------
2323+2424+- OCaml 4.05.0 compatibility (removal of `Uchar.dump`).
2525+2626+v1.0.0 2016-11-23 Zagreb
2727+------------------------
2828+2929+- `Uutf.String.fold_utf_{8,16be,16le}`, allow substring folding via
3030+ optional arguments. Thanks to Raphaël Proust for the idea and the
3131+ patch.
3232+- OCaml standard library `Uchar.t` support.
3333+ - Removes and substitutes `type Uutf.uchar = int` by the (abstract)
3434+ `Uchar.t` type. `Uchar.{of,to}_int` allows to recover the previous
3535+ representation.
3636+ - Removes `Uutf.{is_uchar,cp_to_string,pp_cp}`. `Uchar.{is_valid,dump}`
3737+ can be used instead.
3838+- Safe string support. Manual sources and destinations now work on bytes
3939+ rather than strings.
4040+- Build depend on topkg.
4141+- Relicense from BSD3 to ISC.
4242+4343+v0.9.4 2015-01-23 La Forclaz (VS)
4444+---------------------------------
4545+4646+- Add `Uutf.decoder_byte_count` returning the bytes decoded so far.
4747+- The `utftrip` cli utility now uses `Cmdliner` which becomes an
4848+ optional dependency of the package. The cli interface is not
4949+ compatible with previous versions.
5050+5151+v0.9.3 2013-08-10 Cambridge (UK)
5252+--------------------------------
5353+5454+- Fix wrong decoding sequence when an UTF-8 encoding guess is based on
5555+ a two byte UTF-8 sequence. Thanks to Edwin Török for the report.
5656+- OPAM friendly workflow and drop OASIS support.
5757+5858+v0.9.2 2013-01-04 La Forclaz (VS)
5959+---------------------------------
6060+6161+- utftrip, better tool help.
6262+- Fix `Uutf.is_uchar` always returning false. Thanks to Edwin Török
6363+ for reporting and providing the fix and test.
6464+6565+v0.9.1 2012-08-05 Lausanne
6666+--------------------------
6767+6868+- OASIS 0.3.0 support.
6969+7070+v0.9.0 2012-05-05 La Forclaz (VS)
7171+---------------------------------
7272+7373+First release.
+6
vendor/opam/uutf/DEVEL.md
···11+This project uses (perhaps the development version of) [`b0`] for
22+development. Consult [b0 occasionally] for quick hints on how to
33+perform common development tasks.
44+55+[`b0`]: https://erratique.ch/software/b0
66+[b0 occasionally]: https://erratique.ch/software/b0/doc/occasionally.html
+13
vendor/opam/uutf/LICENSE.md
···11+Copyright (c) 2016 The uutf programmers
22+33+Permission to use, copy, modify, and/or distribute this software for any
44+purpose with or without fee is hereby granted, provided that the above
55+copyright notice and this permission notice appear in all copies.
66+77+THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
88+WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
99+MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
1010+ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
1111+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
1212+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
1313+OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+46
vendor/opam/uutf/README.md
···11+Uutf — Non-blocking streaming Unicode codec for OCaml
22+=====================================================
33+44+**Warning.** You are encouraged not to use this library.
55+66+- As of OCaml 4.14, both UTF encoding and decoding are available
77+ in the standard library, see the `String` and `Buffer` modules.
88+- If you are looking for a stream abstraction compatible with
99+ effect based concurrency look into [`bytesrw`] package.
1010+1111+---
1212+1313+Uutf is a non-blocking streaming codec to decode and encode the UTF-8,
1414+UTF-16, UTF-16LE and UTF-16BE encoding schemes. It can efficiently
1515+work character by character without blocking on IO. Decoders perform
1616+character position tracking and support newline normalization.
1717+1818+Functions are also provided to fold over the characters of UTF encoded
1919+OCaml string values and to directly encode characters in OCaml
2020+Buffer.t values.
2121+2222+Uutf has no dependency and is distributed under the ISC license.
2323+2424+Home page: <http://erratique.ch/software/uutf>
2525+2626+[`bytesrw`]: https://erratique.ch/software/bytesrw
2727+2828+2929+## Installation
3030+3131+Uutf can be installed with `opam`:
3232+3333+ opam install uutf
3434+3535+If you don't use `opam` consult the [`opam`](opam) file for build
3636+instructions.
3737+3838+## Documentation
3939+4040+The documentation can be consulted [online] or via `odig doc uutf`.
4141+4242+Questions are welcome but better asked on the [OCaml forum] than on the
4343+issue tracker.
4444+4545+[online]: http://erratique.ch/software/uutf/doc/
4646+[OCaml forum]: https://discuss.ocaml.org/
···11+{0 Uutf {%html: <span class="version">%%VERSION%%</span>%}}
22+33+{b Warning.} You are encouraged not to use this library.
44+55+{ul
66+{- As of OCaml 4.14, both UTF encoding and decoding are available
77+ in the standard library, see the {!String} and {!Buffer} modules.}
88+{- If you are looking for a stream abstraction compatible with
99+ effect based concurrency look into the
1010+ {{:https://erratique.ch/software/bytesrw}bytesrw} package.}}
1111+1212+Uutf is a non-blocking streaming codec to decode and encode the UTF-8,
1313+UTF-16, UTF-16LE and UTF-16BE encoding schemes. It can efficiently
1414+work character by character without blocking on IO. Decoders perform
1515+character position tracking and support newline normalization.
1616+1717+Functions are also provided to fold over the characters of UTF encoded
1818+OCaml string values and to directly encode characters in OCaml
1919+{!Buffer.t} values.
2020+2121+{1:library_uutf Library [uutf]}
2222+2323+{!modules:
2424+Uutf
2525+}
···11+opam-version: "2.0"
22+name: "uutf"
33+synopsis: "Non-blocking streaming Unicode codec for OCaml"
44+description: """\
55+**Warning.** You are encouraged not to use this library.
66+77+- As of OCaml 4.14, both UTF encoding and decoding are available
88+ in the standard library, see the `String` and `Buffer` modules.
99+- If you are looking for a stream abstraction compatible with
1010+ effect based concurrency look into [`bytesrw`] package."""
1111+maintainer: "Daniel Bünzli <daniel.buenzl i@erratique.ch>"
1212+authors: "The uutf programmers"
1313+license: "ISC"
1414+tags: ["unicode" "text" "utf-8" "utf-16" "codec" "org:erratique"]
1515+homepage: "https://erratique.ch/software/uutf"
1616+doc: "https://erratique.ch/software/uutf/doc/"
1717+bug-reports: "https://github.com/dbuenzli/uutf/issues"
1818+depends: [
1919+ "ocaml" {>= "4.08.0"}
2020+ "ocamlfind" {build}
2121+ "ocamlbuild" {build}
2222+ "topkg" {build & >= "1.1.0"}
2323+]
2424+depopts: ["cmdliner"]
2525+conflicts: [
2626+ "cmdliner" {< "1.3.0"}
2727+]
2828+build: [
2929+ "ocaml"
3030+ "pkg/pkg.ml"
3131+ "build"
3232+ "--dev-pkg"
3333+ "%{dev}%"
3434+ "--with-cmdliner"
3535+ "%{cmdliner:installed}%"
3636+]
3737+dev-repo: "git+https://erratique.ch/repos/uutf.git"
3838+x-maintenance-intent: ["(latest)"]
···11+(*---------------------------------------------------------------------------
22+ Copyright (c) 2012 The uutf programmers. All rights reserved.
33+ SPDX-License-Identifier: ISC
44+ ---------------------------------------------------------------------------*)
55+66+let io_buffer_size = 65536 (* IO_BUFFER_SIZE 4.0.0 *)
77+88+let pp = Format.fprintf
99+let invalid_encode () = invalid_arg "expected `Await encode"
1010+let invalid_bounds j l =
1111+ invalid_arg (Printf.sprintf "invalid bounds (index %d, length %d)" j l)
1212+1313+(* Unsafe string byte manipulations. If you don't believe the author's
1414+ invariants, replacing with safe versions makes everything safe in
1515+ the module. He won't be upset. *)
1616+1717+let unsafe_chr = Char.unsafe_chr
1818+let unsafe_blit = Bytes.unsafe_blit
1919+let unsafe_array_get = Array.unsafe_get
2020+let unsafe_byte s j = Char.code (Bytes.unsafe_get s j)
2121+let unsafe_set_byte s j byte = Bytes.unsafe_set s j (Char.unsafe_chr byte)
2222+2323+(* Unicode characters *)
2424+2525+let u_bom = Uchar.unsafe_of_int 0xFEFF (* BOM. *)
2626+let u_rep = Uchar.unsafe_of_int 0xFFFD (* replacement character. *)
2727+2828+(* Unicode encoding schemes *)
2929+3030+type encoding = [ `UTF_8 | `UTF_16 | `UTF_16BE | `UTF_16LE ]
3131+type decoder_encoding = [ encoding | `US_ASCII | `ISO_8859_1 ]
3232+3333+let encoding_of_string s = match String.uppercase_ascii s with (* IANA names. *)
3434+| "UTF-8" -> Some `UTF_8
3535+| "UTF-16" -> Some `UTF_16
3636+| "UTF-16LE" -> Some `UTF_16LE
3737+| "UTF-16BE" -> Some `UTF_16BE
3838+| "ANSI_X3.4-1968" | "ISO-IR-6" | "ANSI_X3.4-1986" | "ISO_646.IRV:1991"
3939+| "ASCII" | "ISO646-US" | "US-ASCII" | "US" | "IBM367" | "CP367" | "CSASCII" ->
4040+ Some `US_ASCII
4141+| "ISO_8859-1:1987" | "ISO-IR-100" | "ISO_8859-1" | "ISO-8859-1"
4242+| "LATIN1" | "L1" | "IBM819" | "CP819" | "CSISOLATIN1" ->
4343+ Some `ISO_8859_1
4444+| _ -> None
4545+4646+let encoding_to_string = function
4747+| `UTF_8 -> "UTF-8" | `UTF_16 -> "UTF-16" | `UTF_16BE -> "UTF-16BE"
4848+| `UTF_16LE -> "UTF-16LE" | `US_ASCII -> "US-ASCII"
4949+| `ISO_8859_1 -> "ISO-8859-1"
5050+5151+(* Base character decoders. They assume enough data. *)
5252+5353+let malformed s j l = `Malformed (Bytes.sub_string s j l)
5454+let malformed_pair be hi s j l = (* missing or half low surrogate at eoi. *)
5555+ let bs1 = Bytes.(sub s j l) in
5656+ let bs0 = Bytes.create 2 in
5757+ let j0, j1 = if be then (0, 1) else (1, 0) in
5858+ unsafe_set_byte bs0 j0 (hi lsr 8);
5959+ unsafe_set_byte bs0 j1 (hi land 0xFF);
6060+ `Malformed Bytes.(unsafe_to_string (cat bs0 bs1))
6161+6262+let r_us_ascii s j =
6363+ (* assert (0 <= j && j < String.length s); *)
6464+ let b0 = unsafe_byte s j in
6565+ if b0 <= 127 then `Uchar (Uchar.unsafe_of_int b0) else malformed s j 1
6666+6767+let r_iso_8859_1 s j =
6868+ (* assert (0 <= j && j < String.length s); *)
6969+ `Uchar (Uchar.unsafe_of_int @@ unsafe_byte s j)
7070+7171+let utf_8_len = [| (* uchar byte length according to first UTF-8 byte. *)
7272+ 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1;
7373+ 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1;
7474+ 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1;
7575+ 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1;
7676+ 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1;
7777+ 1; 1; 1; 1; 1; 1; 1; 1; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0;
7878+ 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0;
7979+ 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0;
8080+ 0; 0; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2;
8181+ 2; 2; 2; 2; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3;
8282+ 4; 4; 4; 4; 4; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0 |]
8383+8484+let r_utf_8 s j l =
8585+ (* assert (0 <= j && 0 <= l && j + l <= String.length s); *)
8686+ let uchar c = `Uchar (Uchar.unsafe_of_int c) in
8787+ match l with
8888+ | 1 -> uchar (unsafe_byte s j)
8989+ | 2 ->
9090+ let b0 = unsafe_byte s j in let b1 = unsafe_byte s (j + 1) in
9191+ if b1 lsr 6 != 0b10 then malformed s j l else
9292+ uchar (((b0 land 0x1F) lsl 6) lor (b1 land 0x3F))
9393+ | 3 ->
9494+ let b0 = unsafe_byte s j in let b1 = unsafe_byte s (j + 1) in
9595+ let b2 = unsafe_byte s (j + 2) in
9696+ let c = ((b0 land 0x0F) lsl 12) lor
9797+ ((b1 land 0x3F) lsl 6) lor
9898+ (b2 land 0x3F)
9999+ in
100100+ if b2 lsr 6 != 0b10 then malformed s j l else
101101+ begin match b0 with
102102+ | 0xE0 -> if b1 < 0xA0 || 0xBF < b1 then malformed s j l else uchar c
103103+ | 0xED -> if b1 < 0x80 || 0x9F < b1 then malformed s j l else uchar c
104104+ | _ -> if b1 lsr 6 != 0b10 then malformed s j l else uchar c
105105+ end
106106+ | 4 ->
107107+ let b0 = unsafe_byte s j in let b1 = unsafe_byte s (j + 1) in
108108+ let b2 = unsafe_byte s (j + 2) in let b3 = unsafe_byte s (j + 3) in
109109+ let c = (((b0 land 0x07) lsl 18) lor
110110+ ((b1 land 0x3F) lsl 12) lor
111111+ ((b2 land 0x3F) lsl 6) lor
112112+ (b3 land 0x3F))
113113+ in
114114+ if b3 lsr 6 != 0b10 || b2 lsr 6 != 0b10 then malformed s j l else
115115+ begin match b0 with
116116+ | 0xF0 -> if b1 < 0x90 || 0xBF < b1 then malformed s j l else uchar c
117117+ | 0xF4 -> if b1 < 0x80 || 0x8F < b1 then malformed s j l else uchar c
118118+ | _ -> if b1 lsr 6 != 0b10 then malformed s j l else uchar c
119119+ end
120120+ | _ -> assert false
121121+122122+let r_utf_16 s j0 j1 = (* May return a high surrogate. *)
123123+ (* assert (0 <= j0 && 0 <= j1 && max j0 j1 < String.length s); *)
124124+ let b0 = unsafe_byte s j0 in let b1 = unsafe_byte s j1 in
125125+ let u = (b0 lsl 8) lor b1 in
126126+ if u < 0xD800 || u > 0xDFFF then `Uchar (Uchar.unsafe_of_int u) else
127127+ if u > 0xDBFF then malformed s (min j0 j1) 2 else `Hi u
128128+129129+let r_utf_16_lo hi s j0 j1 = (* Combines [hi] with a low surrogate. *)
130130+ (* assert (0 <= j0 && 0 <= j1 && max j0 j1 < String.length s); *)
131131+ let b0 = unsafe_byte s j0 in
132132+ let b1 = unsafe_byte s j1 in
133133+ let lo = (b0 lsl 8) lor b1 in
134134+ if lo < 0xDC00 || lo > 0xDFFF
135135+ then malformed_pair (j0 < j1 (* true => be *)) hi s (min j0 j1) 2
136136+ else `Uchar (Uchar.unsafe_of_int ((((hi land 0x3FF) lsl 10) lor
137137+ (lo land 0x3FF)) + 0x10000))
138138+139139+let r_encoding s j l = (* guess encoding with max. 3 bytes. *)
140140+ (* assert (0 <= j && 0 <= l && j + l <= String.length s) *)
141141+ let some i = if i < l then Some (unsafe_byte s (j + i)) else None in
142142+ match (some 0), (some 1), (some 2) with
143143+ | Some 0xEF, Some 0xBB, Some 0xBF -> `UTF_8 `BOM
144144+ | Some 0xFE, Some 0xFF, _ -> `UTF_16BE `BOM
145145+ | Some 0xFF, Some 0xFE, _ -> `UTF_16LE `BOM
146146+ | Some 0x00, Some p, _ when p > 0 -> `UTF_16BE (`ASCII p)
147147+ | Some p, Some 0x00, _ when p > 0 -> `UTF_16LE (`ASCII p)
148148+ | Some u, _, _ when utf_8_len.(u) <> 0 -> `UTF_8 `Decode
149149+ | Some _, Some _, _ -> `UTF_16BE `Decode
150150+ | Some _, None , None -> `UTF_8 `Decode
151151+ | None , None , None -> `UTF_8 `End
152152+ | None , Some _, _ -> assert false
153153+ | Some _, None , Some _ -> assert false
154154+ | None , None , Some _ -> assert false
155155+156156+(* Decode *)
157157+158158+type src = [ `Channel of in_channel | `String of string | `Manual ]
159159+type nln = [ `ASCII of Uchar.t | `NLF of Uchar.t | `Readline of Uchar.t ]
160160+type decode = [ `Await | `End | `Malformed of string | `Uchar of Uchar.t]
161161+162162+let pp_decode ppf = function
163163+| `Uchar u -> pp ppf "@[`Uchar U+%04X@]" (Uchar.to_int u)
164164+| `End -> pp ppf "`End"
165165+| `Await -> pp ppf "`Await"
166166+| `Malformed bs ->
167167+ let l = String.length bs in
168168+ pp ppf "@[`Malformed (";
169169+ if l > 0 then pp ppf "%02X" (Char.code (bs.[0]));
170170+ for i = 1 to l - 1 do pp ppf " %02X" (Char.code (bs.[i])) done;
171171+ pp ppf ")@]"
172172+173173+type decoder =
174174+ { src : src; (* input source. *)
175175+ mutable encoding : decoder_encoding; (* decoded encoding. *)
176176+ nln : nln option; (* newline normalization (if any). *)
177177+ nl : Uchar.t; (* newline normalization character. *)
178178+ mutable i : Bytes.t; (* current input chunk. *)
179179+ mutable i_pos : int; (* input current position. *)
180180+ mutable i_max : int; (* input maximal position. *)
181181+ t : Bytes.t; (* four bytes temporary buffer for overlapping reads. *)
182182+ mutable t_len : int; (* current byte length of [t]. *)
183183+ mutable t_need : int; (* number of bytes needed in [t]. *)
184184+ mutable removed_bom : bool; (* [true] if an initial BOM was removed. *)
185185+ mutable last_cr : bool; (* [true] if last char was CR. *)
186186+ mutable line : int; (* line number. *)
187187+ mutable col : int; (* column number. *)
188188+ mutable byte_count : int; (* byte count. *)
189189+ mutable count : int; (* char count. *)
190190+ mutable pp : (* decoder post-processor for BOM, position and nln. *)
191191+ decoder -> [ `Malformed of string | `Uchar of Uchar.t ] -> decode;
192192+ mutable k : decoder -> decode } (* decoder continuation. *)
193193+194194+(* On decodes that overlap two (or more) [d.i] buffers, we use [t_fill] to copy
195195+ the input data to [d.t] and decode from there. If the [d.i] buffers are not
196196+ too small this is faster than continuation based byte per byte writes.
197197+198198+ End of input (eoi) is signalled by [d.i_pos = 0] and [d.i_max = min_int]
199199+ which implies that [i_rem d < 0] is [true]. *)
200200+201201+let i_rem d = d.i_max - d.i_pos + 1 (* remaining bytes to read in [d.i]. *)
202202+let eoi d =
203203+ d.i <- Bytes.empty; d.i_pos <- 0; d.i_max <- min_int (* set eoi in [d]. *)
204204+205205+let src d s j l = (* set [d.i] with [s]. *)
206206+ if (j < 0 || l < 0 || j + l > Bytes.length s) then invalid_bounds j l else
207207+ if (l = 0) then eoi d else
208208+ (d.i <- s; d.i_pos <- j; d.i_max <- j + l - 1)
209209+210210+let refill k d = match d.src with (* get new input in [d.i] and [k]ontinue. *)
211211+| `Manual -> d.k <- k; `Await
212212+| `String _ -> eoi d; k d
213213+| `Channel ic ->
214214+ let rc = input ic d.i 0 (Bytes.length d.i) in
215215+ (src d d.i 0 rc; k d)
216216+217217+let t_need d need = d.t_len <- 0; d.t_need <- need
218218+let rec t_fill k d = (* get [d.t_need] bytes (or less if eoi) in [i.t]. *)
219219+ let blit d l =
220220+ unsafe_blit d.i d.i_pos d.t d.t_len (* write pos. *) l;
221221+ d.i_pos <- d.i_pos + l; d.t_len <- d.t_len + l;
222222+ in
223223+ let rem = i_rem d in
224224+ if rem < 0 (* eoi *) then k d else
225225+ let need = d.t_need - d.t_len in
226226+ if rem < need then (blit d rem; refill (t_fill k) d) else (blit d need; k d)
227227+228228+let ret k v byte_count d = (* return post-processed [v]. *)
229229+ d.k <- k; d.byte_count <- d.byte_count + byte_count; d.pp d v
230230+231231+(* Decoders. *)
232232+233233+let rec decode_us_ascii d =
234234+ let rem = i_rem d in
235235+ if rem <= 0 then (if rem < 0 then `End else refill decode_us_ascii d) else
236236+ let j = d.i_pos in
237237+ d.i_pos <- d.i_pos + 1; ret decode_us_ascii (r_us_ascii d.i j) 1 d
238238+239239+let rec decode_iso_8859_1 d =
240240+ let rem = i_rem d in
241241+ if rem <= 0 then (if rem < 0 then `End else refill decode_iso_8859_1 d) else
242242+ let j = d.i_pos in
243243+ d.i_pos <- d.i_pos + 1; ret decode_iso_8859_1 (r_iso_8859_1 d.i j) 1 d
244244+245245+(* UTF-8 decoder *)
246246+247247+let rec t_decode_utf_8 d = (* decode from [d.t]. *)
248248+ if d.t_len < d.t_need
249249+ then ret decode_utf_8 (malformed d.t 0 d.t_len) d.t_len d
250250+ else ret decode_utf_8 (r_utf_8 d.t 0 d.t_len) d.t_len d
251251+252252+and decode_utf_8 d =
253253+ let rem = i_rem d in
254254+ if rem <= 0 then (if rem < 0 then `End else refill decode_utf_8 d) else
255255+ let need = unsafe_array_get utf_8_len (unsafe_byte d.i d.i_pos) in
256256+ if rem < need then (t_need d need; t_fill t_decode_utf_8 d) else
257257+ let j = d.i_pos in
258258+ if need = 0
259259+ then (d.i_pos <- d.i_pos + 1; ret decode_utf_8 (malformed d.i j 1) 1 d)
260260+ else (d.i_pos <- d.i_pos + need; ret decode_utf_8 (r_utf_8 d.i j need) need d)
261261+262262+(* UTF-16BE decoder *)
263263+264264+let rec t_decode_utf_16be_lo hi d = (* decode from [d.t]. *)
265265+ let bcount = d.t_len + 2 (* hi count *) in
266266+ if d.t_len < d.t_need
267267+ then ret decode_utf_16be (malformed_pair true hi d.t 0 d.t_len) bcount d
268268+ else ret decode_utf_16be (r_utf_16_lo hi d.t 0 1) bcount d
269269+270270+and t_decode_utf_16be d = (* decode from [d.t]. *)
271271+ if d.t_len < d.t_need
272272+ then ret decode_utf_16be (malformed d.t 0 d.t_len) d.t_len d
273273+ else decode_utf_16be_lo (r_utf_16 d.t 0 1) d
274274+275275+and decode_utf_16be_lo v d = match v with
276276+| `Uchar _ | `Malformed _ as v -> ret decode_utf_16be v 2 d
277277+| `Hi hi ->
278278+ let rem = i_rem d in
279279+ if rem < 2 then (t_need d 2; t_fill (t_decode_utf_16be_lo hi) d) else
280280+ let j = d.i_pos in
281281+ d.i_pos <- d.i_pos + 2;
282282+ ret decode_utf_16be (r_utf_16_lo hi d.i j (j + 1)) 4 d
283283+284284+and decode_utf_16be d =
285285+ let rem = i_rem d in
286286+ if rem <= 0 then (if rem < 0 then `End else refill decode_utf_16be d) else
287287+ if rem < 2 then (t_need d 2; t_fill t_decode_utf_16be d) else
288288+ let j = d.i_pos in
289289+ d.i_pos <- d.i_pos + 2; decode_utf_16be_lo (r_utf_16 d.i j (j + 1)) d
290290+291291+(* UTF-16LE decoder, same as UTF-16BE with byte swapped. *)
292292+293293+let rec t_decode_utf_16le_lo hi d = (* decode from [d.t]. *)
294294+ let bcount = d.t_len + 2 (* hi count *) in
295295+ if d.t_len < d.t_need
296296+ then ret decode_utf_16le (malformed_pair false hi d.t 0 d.t_len) bcount d
297297+ else ret decode_utf_16le (r_utf_16_lo hi d.t 1 0) bcount d
298298+299299+and t_decode_utf_16le d = (* decode from [d.t]. *)
300300+ if d.t_len < d.t_need
301301+ then ret decode_utf_16le (malformed d.t 0 d.t_len) d.t_len d
302302+ else decode_utf_16le_lo (r_utf_16 d.t 1 0) d
303303+304304+and decode_utf_16le_lo v d = match v with
305305+| `Uchar _ | `Malformed _ as v -> ret decode_utf_16le v 2 d
306306+| `Hi hi ->
307307+ let rem = i_rem d in
308308+ if rem < 2 then (t_need d 2; t_fill (t_decode_utf_16le_lo hi) d) else
309309+ let j = d.i_pos in
310310+ d.i_pos <- d.i_pos + 2;
311311+ ret decode_utf_16le (r_utf_16_lo hi d.i (j + 1) j) 4 d
312312+313313+and decode_utf_16le d =
314314+ let rem = i_rem d in
315315+ if rem <= 0 then (if rem < 0 then `End else refill decode_utf_16le d) else
316316+ if rem < 2 then (t_need d 2; t_fill t_decode_utf_16le d) else
317317+ let j = d.i_pos in
318318+ d.i_pos <- d.i_pos + 2; decode_utf_16le_lo (r_utf_16 d.i (j + 1) j) d
319319+320320+(* Encoding guessing. The guess is simple but starting the decoder
321321+ after is tedious, uutf's decoders are not designed to put bytes
322322+ back in the stream. *)
323323+324324+let guessed_utf_8 d = (* start decoder after `UTF_8 guess. *)
325325+ let b3 d = (* handles the third read byte. *)
326326+ let b3 = unsafe_byte d.t 2 in
327327+ match utf_8_len.(b3) with
328328+ | 0 -> ret decode_utf_8 (malformed d.t 2 1) 1 d
329329+ | n ->
330330+ d.t_need <- n; d.t_len <- 1; unsafe_set_byte d.t 0 b3;
331331+ t_fill t_decode_utf_8 d
332332+ in
333333+ let b2 d = (* handle second read byte. *)
334334+ let b2 = unsafe_byte d.t 1 in
335335+ let b3 = if d.t_len > 2 then b3 else decode_utf_8 (* decodes `End *) in
336336+ match utf_8_len.(b2) with
337337+ | 0 -> ret b3 (malformed d.t 1 1) 1 d
338338+ | 1 -> ret b3 (r_utf_8 d.t 1 1) 1 d
339339+ | n -> (* copy d.t.(1-2) to d.t.(0-1) and decode *)
340340+ d.t_need <- n;
341341+ unsafe_set_byte d.t 0 b2;
342342+ if (d.t_len < 3) then d.t_len <- 1 else
343343+ (d.t_len <- 2; unsafe_set_byte d.t 1 (unsafe_byte d.t 2); );
344344+ t_fill t_decode_utf_8 d
345345+ in
346346+ let b1 = unsafe_byte d.t 0 in (* handle first read byte. *)
347347+ let b2 = if d.t_len > 1 then b2 else decode_utf_8 (* decodes `End *) in
348348+ match utf_8_len.(b1) with
349349+ | 0 -> ret b2 (malformed d.t 0 1) 1 d
350350+ | 1 -> ret b2 (r_utf_8 d.t 0 1) 1 d
351351+ | 2 ->
352352+ if d.t_len < 2 then ret decode_utf_8 (malformed d.t 0 1) 1 d else
353353+ if d.t_len < 3 then ret decode_utf_8 (r_utf_8 d.t 0 2) 2 d else
354354+ ret b3 (r_utf_8 d.t 0 2) 2 d
355355+ | 3 ->
356356+ if d.t_len < 3
357357+ then ret decode_utf_8 (malformed d.t 0 d.t_len) d.t_len d
358358+ else ret decode_utf_8 (r_utf_8 d.t 0 3) 3 d
359359+ | 4 ->
360360+ if d.t_len < 3
361361+ then ret decode_utf_8 (malformed d.t 0 d.t_len) d.t_len d
362362+ else (d.t_need <- 4; t_fill t_decode_utf_8 d)
363363+ | n -> assert false
364364+365365+let guessed_utf_16 d be v = (* start decoder after `UTF_16{BE,LE} guess. *)
366366+ let decode_utf_16, t_decode_utf_16, t_decode_utf_16_lo, j0, j1 =
367367+ if be then decode_utf_16be, t_decode_utf_16be, t_decode_utf_16be_lo, 0, 1
368368+ else decode_utf_16le, t_decode_utf_16le, t_decode_utf_16le_lo, 1, 0
369369+ in
370370+ let b3 k d =
371371+ if d.t_len < 3 then decode_utf_16 d (* decodes `End *) else
372372+ begin (* copy d.t.(2) to d.t.(0) and decode. *)
373373+ d.t_need <- 2; d.t_len <- 1;
374374+ unsafe_set_byte d.t 0 (unsafe_byte d.t 2);
375375+ t_fill k d
376376+ end
377377+ in
378378+ match v with
379379+ | `BOM -> ret (b3 t_decode_utf_16) (`Uchar u_bom) 2 d
380380+ | `ASCII u -> ret (b3 t_decode_utf_16) (`Uchar (Uchar.unsafe_of_int u)) 2 d
381381+ | `Decode ->
382382+ match r_utf_16 d.t j0 j1 with
383383+ | `Malformed _ | `Uchar _ as v -> ret (b3 t_decode_utf_16) v 2 d
384384+ | `Hi hi ->
385385+ if d.t_len < 3
386386+ then ret decode_utf_16 (malformed_pair be hi Bytes.empty 0 0) d.t_len d
387387+ else (b3 (t_decode_utf_16_lo hi)) d
388388+389389+let guess_encoding d = (* guess encoding and start decoder. *)
390390+ let setup d = match r_encoding d.t 0 d.t_len with
391391+ | `UTF_8 r ->
392392+ d.encoding <- `UTF_8; d.k <- decode_utf_8;
393393+ begin match r with
394394+ | `BOM -> ret decode_utf_8 (`Uchar u_bom) 3 d
395395+ | `Decode -> guessed_utf_8 d
396396+ | `End -> `End
397397+ end
398398+ | `UTF_16BE r ->
399399+ d.encoding <- `UTF_16BE; d.k <- decode_utf_16be; guessed_utf_16 d true r
400400+ | `UTF_16LE r ->
401401+ d.encoding <- `UTF_16LE; d.k <- decode_utf_16le; guessed_utf_16 d false r
402402+403403+ in
404404+ (t_need d 3; t_fill setup d)
405405+406406+(* Character post-processors. Used for BOM handling, newline
407407+ normalization and position tracking. The [pp_remove_bom] is only
408408+ used for the first character to remove a possible initial BOM and
409409+ handle UTF-16 endianness recognition. *)
410410+411411+let nline d = d.col <- 0; d.line <- d.line + 1 (* inlined. *)
412412+let ncol d = d.col <- d.col + 1 (* inlined. *)
413413+let ncount d = d.count <- d.count + 1 (* inlined. *)
414414+let cr d b = d.last_cr <- b (* inlined. *)
415415+416416+let pp_remove_bom utf16 pp d = function(* removes init. BOM, handles UTF-16. *)
417417+| `Malformed _ as v -> d.removed_bom <- false; d.pp <- pp; d.pp d v
418418+| `Uchar u as v ->
419419+ match Uchar.to_int u with
420420+ | 0xFEFF (* BOM *) ->
421421+ if utf16 then (d.encoding <- `UTF_16BE; d.k <- decode_utf_16be);
422422+ d.removed_bom <- true; d.pp <- pp; d.k d
423423+ | 0xFFFE (* BOM reversed from decode_utf_16be *) when utf16 ->
424424+ d.encoding <- `UTF_16LE; d.k <- decode_utf_16le;
425425+ d.removed_bom <- true; d.pp <- pp; d.k d
426426+ | _ ->
427427+ d.removed_bom <- false; d.pp <- pp; d.pp d v
428428+429429+let pp_nln_none d = function
430430+| `Malformed _ as v -> cr d false; ncount d; ncol d; v
431431+| `Uchar u as v ->
432432+ match Uchar.to_int u with
433433+ | 0x000A (* LF *) ->
434434+ let last_cr = d.last_cr in
435435+ cr d false; ncount d; if last_cr then v else (nline d; v)
436436+ | 0x000D (* CR *) -> cr d true; ncount d; nline d; v
437437+ | (0x0085 | 0x000C | 0x2028 | 0x2029) (* NEL | FF | LS | PS *) ->
438438+ cr d false; ncount d; nline d; v
439439+ | _ ->
440440+ cr d false; ncount d; ncol d; v
441441+442442+let pp_nln_readline d = function
443443+| `Malformed _ as v -> cr d false; ncount d; ncol d; v
444444+| `Uchar u as v ->
445445+ match Uchar.to_int u with
446446+ | 0x000A (* LF *) ->
447447+ let last_cr = d.last_cr in
448448+ cr d false; if last_cr then d.k d else (ncount d; nline d; `Uchar d.nl)
449449+ | 0x000D (* CR *) -> cr d true; ncount d; nline d; `Uchar d.nl
450450+ | (0x0085 | 0x000C | 0x2028 | 0x2029) (* NEL | FF | LS | PS *) ->
451451+ cr d false; ncount d; nline d; `Uchar d.nl
452452+ | _ ->
453453+ cr d false; ncount d; ncol d; v
454454+455455+let pp_nln_nlf d = function
456456+| `Malformed _ as v -> cr d false; ncount d; ncol d; v
457457+| `Uchar u as v ->
458458+ match Uchar.to_int u with
459459+ | 0x000A (* LF *) ->
460460+ let last_cr = d.last_cr in
461461+ cr d false; if last_cr then d.k d else (ncount d; nline d; `Uchar d.nl)
462462+ | 0x000D (* CR *) -> cr d true; ncount d; nline d; `Uchar d.nl
463463+ | 0x0085 (* NEL *) -> cr d false; ncount d; nline d; `Uchar d.nl
464464+ | (0x000C | 0x2028 | 0x2029) (* FF | LS | PS *) ->
465465+ cr d false; ncount d; nline d; v
466466+ | _ ->
467467+ cr d false; ncount d; ncol d; v
468468+469469+let pp_nln_ascii d = function
470470+| `Malformed _ as v -> cr d false; ncount d; ncol d; v
471471+| `Uchar u as v ->
472472+ match Uchar.to_int u with
473473+ | 0x000A (* LF *) ->
474474+ let last_cr = d.last_cr in
475475+ cr d false; if last_cr then d.k d else (ncount d; nline d; `Uchar d.nl)
476476+ | 0x000D (* CR *) -> cr d true; ncount d; nline d; `Uchar d.nl
477477+ | (0x0085 | 0x000C | 0x2028 | 0x2029) (* NEL | FF | LS | PS *) ->
478478+ cr d false; ncount d; nline d; v
479479+ | _ ->
480480+ cr d false; ncount d; ncol d; v
481481+482482+let decode_fun = function
483483+| `UTF_8 -> decode_utf_8
484484+| `UTF_16 -> decode_utf_16be (* see [pp_remove_bom]. *)
485485+| `UTF_16BE -> decode_utf_16be
486486+| `UTF_16LE -> decode_utf_16le
487487+| `US_ASCII -> decode_us_ascii
488488+| `ISO_8859_1 -> decode_iso_8859_1
489489+490490+let decoder ?nln ?encoding src =
491491+ let pp, nl = match nln with
492492+ | None -> pp_nln_none, Uchar.unsafe_of_int 0x000A (* not used. *)
493493+ | Some (`ASCII nl) -> pp_nln_ascii, nl
494494+ | Some (`NLF nl) -> pp_nln_nlf, nl
495495+ | Some (`Readline nl) -> pp_nln_readline, nl
496496+ in
497497+ let encoding, k = match encoding with
498498+ | None -> `UTF_8, guess_encoding
499499+ | Some e -> (e :> decoder_encoding), decode_fun e
500500+ in
501501+ let i, i_pos, i_max = match src with
502502+ | `Manual -> Bytes.empty, 1, 0 (* implies src_rem d = 0. *)
503503+ | `Channel _ -> Bytes.create io_buffer_size, 1, 0 (* idem. *)
504504+ | `String s -> Bytes.unsafe_of_string s, 0, String.length s - 1
505505+ in
506506+ { src = (src :> src); encoding; nln = (nln :> nln option); nl;
507507+ i; i_pos; i_max; t = Bytes.create 4; t_len = 0; t_need = 0;
508508+ removed_bom = false; last_cr = false; line = 1; col = 0;
509509+ byte_count = 0; count = 0;
510510+ pp = pp_remove_bom (encoding = `UTF_16) pp; k }
511511+512512+let decode d = d.k d
513513+let decoder_line d = d.line
514514+let decoder_col d = d.col
515515+let decoder_byte_count d = d.byte_count
516516+let decoder_count d = d.count
517517+let decoder_removed_bom d = d.removed_bom
518518+let decoder_src d = d.src
519519+let decoder_nln d = d.nln
520520+let decoder_encoding d = d.encoding
521521+let set_decoder_encoding d e =
522522+ d.encoding <- (e :> decoder_encoding); d.k <- decode_fun e
523523+524524+(* Encode *)
525525+526526+type dst = [ `Channel of out_channel | `Buffer of Buffer.t | `Manual ]
527527+type encode = [ `Await | `End | `Uchar of Uchar.t ]
528528+type encoder =
529529+ { dst : dst; (* output destination. *)
530530+ encoding : encoding; (* encoded encoding. *)
531531+ mutable o : Bytes.t; (* current output chunk. *)
532532+ mutable o_pos : int; (* next output position to write. *)
533533+ mutable o_max : int; (* maximal output position to write. *)
534534+ t : Bytes.t; (* four bytes buffer for overlapping writes. *)
535535+ mutable t_pos : int; (* next position to read in [t]. *)
536536+ mutable t_max : int; (* maximal position to read in [t]. *)
537537+ mutable k : (* encoder continuation. *)
538538+ encoder -> encode -> [ `Ok | `Partial ] }
539539+540540+(* On encodes that overlap two (or more) [e.o] buffers, we encode the
541541+ character to the temporary buffer [o.t] and continue with
542542+ [tmp_flush] to write this data on the different [e.o] buffers. If
543543+ the [e.o] buffers are not too small this is faster than
544544+ continuation based byte per byte writes. *)
545545+546546+let o_rem e = e.o_max - e.o_pos + 1 (* remaining bytes to write in [e.o]. *)
547547+let dst e s j l = (* set [e.o] with [s]. *)
548548+ if (j < 0 || l < 0 || j + l > Bytes.length s) then invalid_bounds j l;
549549+ e.o <- s; e.o_pos <- j; e.o_max <- j + l - 1
550550+551551+let partial k e = function `Await -> k e | `Uchar _ | `End -> invalid_encode ()
552552+let flush k e = match e.dst with(* get free storage in [d.o] and [k]ontinue. *)
553553+| `Manual -> e.k <- partial k; `Partial
554554+| `Channel oc -> output oc e.o 0 e.o_pos; e.o_pos <- 0; k e
555555+| `Buffer b ->
556556+ let o = Bytes.unsafe_to_string e.o in
557557+ Buffer.add_substring b o 0 e.o_pos; e.o_pos <- 0; k e
558558+559559+560560+let t_range e max = e.t_pos <- 0; e.t_max <- max
561561+let rec t_flush k e = (* flush [d.t] up to [d.t_max] in [d.i]. *)
562562+ let blit e l =
563563+ unsafe_blit e.t e.t_pos e.o e.o_pos l;
564564+ e.o_pos <- e.o_pos + l; e.t_pos <- e.t_pos + l
565565+ in
566566+ let rem = o_rem e in
567567+ let len = e.t_max - e.t_pos + 1 in
568568+ if rem < len then (blit e rem; flush (t_flush k) e) else (blit e len; k e)
569569+570570+(* Encoders. *)
571571+572572+let rec encode_utf_8 e v =
573573+ let k e = e.k <- encode_utf_8; `Ok in
574574+ match v with
575575+ | `Await -> k e
576576+ | `End -> flush k e
577577+ | `Uchar u as v ->
578578+ let u = Uchar.to_int u in
579579+ let rem = o_rem e in
580580+ if u <= 0x007F then
581581+ if rem < 1 then flush (fun e -> encode_utf_8 e v) e else
582582+ (unsafe_set_byte e.o e.o_pos u; e.o_pos <- e.o_pos + 1; k e)
583583+ else if u <= 0x07FF then
584584+ begin
585585+ let s, j, k =
586586+ if rem < 2 then (t_range e 1; e.t, 0, t_flush k) else
587587+ let j = e.o_pos in (e.o_pos <- e.o_pos + 2; e.o, j, k)
588588+ in
589589+ unsafe_set_byte s j (0xC0 lor (u lsr 6));
590590+ unsafe_set_byte s (j + 1) (0x80 lor (u land 0x3F));
591591+ k e
592592+ end
593593+ else if u <= 0xFFFF then
594594+ begin
595595+ let s, j, k =
596596+ if rem < 3 then (t_range e 2; e.t, 0, t_flush k) else
597597+ let j = e.o_pos in (e.o_pos <- e.o_pos + 3; e.o, j, k)
598598+ in
599599+ unsafe_set_byte s j (0xE0 lor (u lsr 12));
600600+ unsafe_set_byte s (j + 1) (0x80 lor ((u lsr 6) land 0x3F));
601601+ unsafe_set_byte s (j + 2) (0x80 lor (u land 0x3F));
602602+ k e
603603+ end
604604+ else
605605+ begin
606606+ let s, j, k =
607607+ if rem < 4 then (t_range e 3; e.t, 0, t_flush k) else
608608+ let j = e.o_pos in (e.o_pos <- e.o_pos + 4; e.o, j, k)
609609+ in
610610+ unsafe_set_byte s j (0xF0 lor (u lsr 18));
611611+ unsafe_set_byte s (j + 1) (0x80 lor ((u lsr 12) land 0x3F));
612612+ unsafe_set_byte s (j + 2) (0x80 lor ((u lsr 6) land 0x3F));
613613+ unsafe_set_byte s (j + 3) (0x80 lor (u land 0x3F));
614614+ k e
615615+ end
616616+617617+let rec encode_utf_16be e v =
618618+ let k e = e.k <- encode_utf_16be; `Ok in
619619+ match v with
620620+ | `Await -> k e
621621+ | `End -> flush k e
622622+ | `Uchar u ->
623623+ let u = Uchar.to_int u in
624624+ let rem = o_rem e in
625625+ if u < 0x10000 then
626626+ begin
627627+ let s, j, k =
628628+ if rem < 2 then (t_range e 1; e.t, 0, t_flush k) else
629629+ let j = e.o_pos in (e.o_pos <- e.o_pos + 2; e.o, j, k)
630630+ in
631631+ unsafe_set_byte s j (u lsr 8);
632632+ unsafe_set_byte s (j + 1) (u land 0xFF);
633633+ k e
634634+ end else begin
635635+ let s, j, k =
636636+ if rem < 4 then (t_range e 3; e.t, 0, t_flush k) else
637637+ let j = e.o_pos in (e.o_pos <- e.o_pos + 4; e.o, j, k)
638638+ in
639639+ let u' = u - 0x10000 in
640640+ let hi = (0xD800 lor (u' lsr 10)) in
641641+ let lo = (0xDC00 lor (u' land 0x3FF)) in
642642+ unsafe_set_byte s j (hi lsr 8);
643643+ unsafe_set_byte s (j + 1) (hi land 0xFF);
644644+ unsafe_set_byte s (j + 2) (lo lsr 8);
645645+ unsafe_set_byte s (j + 3) (lo land 0xFF);
646646+ k e
647647+ end
648648+649649+let rec encode_utf_16le e v = (* encode_uft_16be with bytes swapped. *)
650650+ let k e = e.k <- encode_utf_16le; `Ok in
651651+ match v with
652652+ | `Await -> k e
653653+ | `End -> flush k e
654654+ | `Uchar u ->
655655+ let u = Uchar.to_int u in
656656+ let rem = o_rem e in
657657+ if u < 0x10000 then
658658+ begin
659659+ let s, j, k =
660660+ if rem < 2 then (t_range e 1; e.t, 0, t_flush k) else
661661+ let j = e.o_pos in (e.o_pos <- e.o_pos + 2; e.o, j, k)
662662+ in
663663+ unsafe_set_byte s j (u land 0xFF);
664664+ unsafe_set_byte s (j + 1) (u lsr 8);
665665+ k e
666666+ end
667667+ else
668668+ begin
669669+ let s, j, k =
670670+ if rem < 4 then (t_range e 3; e.t, 0, t_flush k) else
671671+ let j = e.o_pos in (e.o_pos <- e.o_pos + 4; e.o, j, k)
672672+ in
673673+ let u' = u - 0x10000 in
674674+ let hi = (0xD800 lor (u' lsr 10)) in
675675+ let lo = (0xDC00 lor (u' land 0x3FF)) in
676676+ unsafe_set_byte s j (hi land 0xFF);
677677+ unsafe_set_byte s (j + 1) (hi lsr 8);
678678+ unsafe_set_byte s (j + 2) (lo land 0xFF);
679679+ unsafe_set_byte s (j + 3) (lo lsr 8);
680680+ k e
681681+ end
682682+683683+let encode_fun = function
684684+| `UTF_8 -> encode_utf_8
685685+| `UTF_16 -> encode_utf_16be
686686+| `UTF_16BE -> encode_utf_16be
687687+| `UTF_16LE -> encode_utf_16le
688688+689689+let encoder encoding dst =
690690+ let o, o_pos, o_max = match dst with
691691+ | `Manual -> Bytes.empty, 1, 0 (* implies o_rem e = 0. *)
692692+ | `Buffer _
693693+ | `Channel _ -> Bytes.create io_buffer_size, 0, io_buffer_size - 1
694694+ in
695695+ { dst = (dst :> dst); encoding = (encoding :> encoding); o; o_pos; o_max;
696696+ t = Bytes.create 4; t_pos = 1; t_max = 0; k = encode_fun encoding}
697697+698698+let encode e v = e.k e (v :> encode)
699699+let encoder_encoding e = e.encoding
700700+let encoder_dst e = e.dst
701701+702702+(* Manual sources and destinations. *)
703703+704704+module Manual = struct
705705+ let src = src
706706+ let dst = dst
707707+ let dst_rem = o_rem
708708+end
709709+710710+(* Strings folders and Buffer encoders *)
711711+712712+module String = struct
713713+ let encoding_guess s =
714714+ let s = Bytes.unsafe_of_string s in
715715+ match r_encoding s 0 (max (Bytes.length s) 3) with
716716+ | `UTF_8 d -> `UTF_8, (d = `BOM)
717717+ | `UTF_16BE d -> `UTF_16BE, (d = `BOM)
718718+ | `UTF_16LE d -> `UTF_16LE, (d = `BOM)
719719+720720+ type 'a folder =
721721+ 'a -> int -> [ `Uchar of Uchar.t | `Malformed of string ] -> 'a
722722+723723+ let fold_utf_8 ?(pos = 0) ?len f acc s =
724724+ let rec loop acc f s i last =
725725+ if i > last then acc else
726726+ let need = unsafe_array_get utf_8_len (unsafe_byte s i) in
727727+ if need = 0 then loop (f acc i (malformed s i 1)) f s (i + 1) last else
728728+ let rem = last - i + 1 in
729729+ if rem < need then f acc i (malformed s i rem) else
730730+ loop (f acc i (r_utf_8 s i need)) f s (i + need) last
731731+ in
732732+ let len = match len with None -> String.length s - pos | Some l -> l in
733733+ let last = pos + len - 1 in
734734+ loop acc f (Bytes.unsafe_of_string s) pos last
735735+736736+ let fold_utf_16be ?(pos = 0) ?len f acc s =
737737+ let rec loop acc f s i last =
738738+ if i > last then acc else
739739+ let rem = last - i + 1 in
740740+ if rem < 2 then f acc i (malformed s i 1) else
741741+ match r_utf_16 s i (i + 1) with
742742+ | `Uchar _ | `Malformed _ as v -> loop (f acc i v) f s (i + 2) last
743743+ | `Hi hi ->
744744+ if rem < 4 then f acc i (malformed s i rem) else
745745+ loop (f acc i (r_utf_16_lo hi s (i + 2) (i + 3))) f s (i + 4) last
746746+ in
747747+ let len = match len with None -> String.length s - pos | Some l -> l in
748748+ let last = pos + len - 1 in
749749+ loop acc f (Bytes.unsafe_of_string s) pos last
750750+751751+ let fold_utf_16le ?(pos = 0) ?len f acc s =
752752+ (* [fold_utf_16be], bytes swapped. *)
753753+ let rec loop acc f s i last =
754754+ if i > last then acc else
755755+ let rem = last - i + 1 in
756756+ if rem < 2 then f acc i (malformed s i 1) else
757757+ match r_utf_16 s (i + 1) i with
758758+ | `Uchar _ | `Malformed _ as v -> loop (f acc i v) f s (i + 2) last
759759+ | `Hi hi ->
760760+ if rem < 4 then f acc i (malformed s i rem) else
761761+ loop (f acc i (r_utf_16_lo hi s (i + 3) (i + 2))) f s (i + 4) last
762762+ in
763763+ let len = match len with None -> String.length s - pos | Some l -> l in
764764+ let last = pos + len - 1 in
765765+ loop acc f (Bytes.unsafe_of_string s) pos last
766766+end
767767+768768+module Buffer = struct
769769+ let add_utf_8 = Buffer.add_utf_8_uchar
770770+ let add_utf_16be = Buffer.add_utf_16be_uchar
771771+ let add_utf_16le = Buffer.add_utf_16le_uchar
772772+end
+494
vendor/opam/uutf/src/uutf.mli
···11+(*---------------------------------------------------------------------------
22+ Copyright (c) 2012 The uutf programmers. All rights reserved.
33+ SPDX-License-Identifier: ISC
44+ ---------------------------------------------------------------------------*)
55+66+(** Non-blocking streaming Unicode codec.
77+88+ [Uutf] is a non-blocking streaming codec to {{:#decode}decode} and
99+ {{:#encode}encode} the {{:http://www.ietf.org/rfc/rfc3629.txt}
1010+ UTF-8}, {{:http://www.ietf.org/rfc/rfc2781.txt} UTF-16}, UTF-16LE
1111+ and UTF-16BE encoding schemes. It can efficiently work character by
1212+ character without blocking on IO. Decoders perform
1313+ character position tracking and support {{!nln}newline normalization}.
1414+1515+ Functions are also provided to {{!String} fold over} the characters
1616+ of UTF encoded OCaml string values and to {{!Buffer}directly encode}
1717+ characters in OCaml {!Stdlib.Buffer.t} values. {b Note} that since OCaml
1818+ 4.14, that functionality can be found in {!Stdlib.String} and
1919+ {!Stdlib.Buffer} and you are encouraged to migrate to it.
2020+2121+ See {{:#examples}examples} of use.
2222+2323+ {b References}
2424+ {ul
2525+ {- The Unicode Consortium.
2626+ {e {{:http://www.unicode.org/versions/latest}The Unicode Standard}}.
2727+ (latest version)}}
2828+*)
2929+3030+ (** {1:ucharcsts Special Unicode characters} *)
3131+3232+val u_bom : Uchar.t
3333+(** [u_bom] is the {{:http://unicode.org/glossary/#byte_order_mark}byte
3434+ order mark} (BOM) character ([U+FEFF]). From OCaml 4.06 on, use
3535+ {!Uchar.bom}. *)
3636+3737+val u_rep : Uchar.t
3838+(** [u_rep] is the
3939+ {{:http://unicode.org/glossary/#replacement_character}replacement}
4040+ character ([U+FFFD]). From OCaml 4.06 on, use
4141+ {!Uchar.rep}. *)
4242+4343+4444+(** {1:schemes Unicode encoding schemes} *)
4545+4646+type encoding = [ `UTF_16 | `UTF_16BE | `UTF_16LE | `UTF_8 ]
4747+(** The type for Unicode
4848+ {{:http://unicode.org/glossary/#character_encoding_scheme}encoding
4949+ schemes}. *)
5050+5151+type decoder_encoding = [ encoding | `US_ASCII | `ISO_8859_1 ]
5252+(** The type for encoding schemes {e decoded} by [Uutf]. Unicode encoding
5353+ schemes plus {{:http://tools.ietf.org/html/rfc20}US-ASCII} and
5454+ {{:http://www.ecma-international.org/publications/standards/Ecma-094.htm}
5555+ ISO/IEC 8859-1} (latin-1). *)
5656+5757+val encoding_of_string : string -> decoder_encoding option
5858+(** [encoding_of_string s] converts a (case insensitive)
5959+ {{:http://www.iana.org/assignments/character-sets}IANA character set name}
6060+ to an encoding. *)
6161+6262+val encoding_to_string : [< decoder_encoding] -> string
6363+(** [encoding_to_string e] is a
6464+ {{:http://www.iana.org/assignments/character-sets}IANA character set name}
6565+ for [e]. *)
6666+6767+(** {1:decode Decode} *)
6868+6969+type src = [ `Channel of in_channel | `String of string | `Manual ]
7070+(** The type for input sources. With a [`Manual] source the client
7171+ must provide input with {!Manual.src}. *)
7272+7373+type nln = [ `ASCII of Uchar.t | `NLF of Uchar.t | `Readline of Uchar.t ]
7474+(** The type for newline normalizations. The variant argument is the
7575+ normalization character.
7676+ {ul
7777+ {- [`ASCII], normalizes CR ([U+000D]), LF ([U+000A]) and CRLF
7878+ (<[U+000D], [U+000A]>).}
7979+ {- [`NLF], normalizes the Unicode newline function (NLF). This is
8080+ NEL ([U+0085]) and the normalizations of [`ASCII].}
8181+ {- [`Readline], normalizes for a Unicode readline function. This is FF
8282+ ([U+000C]), LS ([U+2028]), PS ([U+2029]), and the normalizations
8383+ of [`NLF].}}
8484+ Used with an appropriate normalization character the [`NLF] and
8585+ [`Readline] normalizations allow to implement all the different
8686+ recommendations of Unicode's newline guidelines (section 5.8 in
8787+ Unicode 9.0.0). *)
8888+8989+type decoder
9090+(** The type for decoders. *)
9191+9292+val decoder : ?nln:[< nln] -> ?encoding:[< decoder_encoding] -> [< src] ->
9393+ decoder
9494+(** [decoder nln encoding src] is a decoder that inputs from [src].
9595+9696+ {b Byte order mark.}
9797+ {{:http://unicode.org/glossary/#byte_order_mark}Byte order mark}
9898+ (BOM) constraints are application dependent and prone to
9999+ misunderstandings (see the
100100+ {{:http://www.unicode.org/faq/utf_bom.html#BOM}FAQ}). Hence,
101101+ [Uutf] decoders have a simple rule: an {e initial BOM is always
102102+ removed from the input and not counted in character position
103103+ tracking}. The function {!decoder_removed_bom} does however return
104104+ [true] if a BOM was removed so that all the information can be
105105+ recovered if needed.
106106+107107+ For UTF-16BE and UTF-16LE the above rule is a violation of
108108+ conformance D96 and D97 of the standard. [Uutf] favors the idea
109109+ that if there's a BOM, decoding with [`UTF_16] or the [`UTF_16XX]
110110+ corresponding to the BOM should decode the same character sequence
111111+ (this is not the case if you stick to the standard). The client
112112+ can however regain conformance by consulting the result of
113113+ {!decoder_removed_bom} and take appropriate action.
114114+115115+ {b Encoding.} [encoding] specifies the decoded encoding
116116+ scheme. If [`UTF_16] is used the endianness is determined
117117+ according to the standard: from a
118118+ {{:http://unicode.org/glossary/#byte_order_mark}BOM}
119119+ if there is one, [`UTF_16BE] otherwise.
120120+121121+ If [encoding] is unspecified it is guessed. The result of a guess
122122+ can only be [`UTF_8], [`UTF_16BE] or [`UTF_16LE]. The heuristic
123123+ looks at the first three bytes of input (or less if impossible)
124124+ and takes the {e first} matching byte pattern in the table below.
125125+{v
126126+xx = any byte
127127+.. = any byte or no byte (input too small)
128128+pp = positive byte
129129+uu = valid UTF-8 first byte
130130+131131+Bytes | Guess | Rationale
132132+---------+-----------+-----------------------------------------------
133133+EF BB BF | `UTF_8 | UTF-8 BOM
134134+FE FF .. | `UTF_16BE | UTF-16BE BOM
135135+FF FE .. | `UTF_16LE | UTF-16LE BOM
136136+00 pp .. | `UTF_16BE | ASCII UTF-16BE and U+0000 is often forbidden
137137+pp 00 .. | `UTF_16LE | ASCII UTF-16LE and U+0000 is often forbidden
138138+uu .. .. | `UTF_8 | ASCII UTF-8 or valid UTF-8 first byte.
139139+xx xx .. | `UTF_16BE | Not UTF-8 => UTF-16, no BOM => UTF-16BE
140140+.. .. .. | `UTF_8 | Single malformed UTF-8 byte or no input.
141141+v}
142142+ This heuristic is compatible both with BOM based
143143+ recognitition and
144144+ {{:http://tools.ietf.org/html/rfc4627#section-3}JSON-like encoding
145145+ recognition} that relies on ASCII being present at the beginning
146146+ of the stream. Also, {!decoder_removed_bom} will tell the client
147147+ if the guess was BOM based.
148148+149149+ {b Newline normalization.} If [nln] is specified, the given
150150+ newline normalization is performed, see {!nln}. Otherwise
151151+ all newlines are returned as found in the input.
152152+153153+ {b Character position.} The line number, column number, byte count
154154+ and character count of the last decoded character (including
155155+ [`Malformed] ones) are respectively returned by {!decoder_line},
156156+ {!decoder_col}, {!decoder_byte_count} and {!decoder_count}. Before
157157+ the first call to {!val-decode} the line number is [1] and the column
158158+ is [0]. Each {!val-decode} returning [`Uchar] or [`Malformed]
159159+ increments the column until a newline. On a newline, the line
160160+ number is incremented and the column set to zero. For example the
161161+ line is [2] and column [0] after the first newline was
162162+ decoded. This can be understood as if {!val-decode} was moving an
163163+ insertion point to the right in the data. A {e newline} is
164164+ anything normalized by [`Readline], see {!nln}.
165165+166166+ [Uutf] assumes that each Unicode scalar value has a column width
167167+ of 1. The same assumption may not be made by the display program
168168+ (e.g. for [emacs]' compilation mode you need to set
169169+ [compilation-error-screen-columns] to [nil]). The problem is in
170170+ general difficult to solve without interaction or convention with the
171171+ display program's rendering engine. Depending on the context better column
172172+ increments can be implemented by using {!Uucp.Break.tty_width_hint} or
173173+ {{:http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries}
174174+ grapheme cluster boundaries} (see {!Uuseg}). *)
175175+176176+val decode : decoder ->
177177+ [ `Await | `Uchar of Uchar.t | `End | `Malformed of string]
178178+(** [decode d] is:
179179+ {ul
180180+ {- [`Await] if [d] has a [`Manual] input source and awaits
181181+ for more input. The client must use {!Manual.src} to provide it.}
182182+ {- [`Uchar u] if a Unicode scalar value [u] was decoded.}
183183+ {- [`End] if the end of input was reached.}
184184+ {- [`Malformed bytes] if the [bytes] sequence is malformed according to
185185+ the decoded encoding scheme. If you are interested in a best-effort
186186+ decoding you can still continue to decode after an error until the
187187+ decoder synchronizes again on valid bytes. It may however be a good
188188+ idea to signal the malformed characters by adding an {!u_rep}
189189+ character to the parsed data, see the {{:#examples}examples}.}}
190190+191191+ {b Note.} Repeated invocation always eventually returns [`End], even
192192+ in case of errors. *)
193193+194194+val decoder_encoding : decoder -> decoder_encoding
195195+(** [decoder_encoding d] is [d]'s the decoded encoding scheme of [d].
196196+197197+ {b Warning.} If the decoder guesses the encoding or uses [`UTF_16],
198198+ rely on this value only after the first [`Uchar] was decoded. *)
199199+200200+(**/**)
201201+202202+(* This function is dangerous, it may destroy the current continuation.
203203+ But it's needed for things like XML parsers. *)
204204+205205+val set_decoder_encoding : decoder -> [< decoder_encoding] -> unit
206206+(** [set_decoder_encoding d enc] changes the decoded encoding
207207+ to [enc] after decoding started.
208208+209209+ {b Warning.} Call only after {!val-decode} was called on [d] and that the
210210+ last call to it returned something different from [`Await] or data may
211211+ be lost. After encoding guess wait for at least three [`Uchar]s. *)
212212+213213+(**/**)
214214+215215+val decoder_line : decoder -> int
216216+(** [decoder_line d] is the line number of the last
217217+ decoded (or malformed) character. See {!val-decoder} for details. *)
218218+219219+val decoder_col : decoder -> int
220220+(** [decoder_col d] is the column number of the last decoded
221221+ (or malformed) character. See {!val-decoder} for details. *)
222222+223223+val decoder_byte_count : decoder -> int
224224+(** [decoder_byte_count d] is the number of bytes already decoded on
225225+ [d] (including malformed ones). This is the last {!val-decode}'s
226226+ end byte offset counting from the beginning of the stream. *)
227227+228228+val decoder_count : decoder -> int
229229+(** [decoder_count d] is the number of characters already decoded on [d]
230230+ (including malformed ones). See {!val-decoder} for details. *)
231231+232232+val decoder_removed_bom : decoder -> bool
233233+(** [decoder_removed_bom d] is [true] iff an {e initial}
234234+ {{:http://unicode.org/glossary/#byte_order_mark}BOM} was
235235+ removed from the input stream. See {!val-decoder} for details. *)
236236+237237+val decoder_src : decoder -> src
238238+(** [decoder_src d] is [d]'s input source. *)
239239+240240+val decoder_nln : decoder -> nln option
241241+(** [decoder_nln d] returns [d]'s newline normalization (if any). *)
242242+243243+val pp_decode : Format.formatter ->
244244+ [< `Await | `Uchar of Uchar.t | `End | `Malformed of string] -> unit
245245+(** [pp_decode ppf v] prints an unspecified representation of [v] on
246246+ [ppf]. *)
247247+248248+(** {1:encode Encode} *)
249249+250250+type dst = [ `Channel of out_channel | `Buffer of Buffer.t | `Manual ]
251251+(** The type for output destinations. With a [`Manual] destination the client
252252+ must provide output storage with {!Manual.dst}. *)
253253+254254+type encoder
255255+(** The type for Unicode encoders. *)
256256+257257+val encoder : [< encoding] -> [< dst] -> encoder
258258+(** [encoder encoding dst] is an encoder for [encoding] that outputs
259259+ to [dst].
260260+261261+ {b Note.} No initial
262262+ {{:http://unicode.org/glossary/#byte_order_mark}BOM}
263263+ is encoded. If needed, this duty is left to the client. *)
264264+265265+val encode :
266266+ encoder -> [<`Await | `End | `Uchar of Uchar.t ] -> [`Ok | `Partial ]
267267+(** [encode e v] is :
268268+ {ul
269269+ {- [`Partial] iff [e] has a [`Manual] destination and needs more output
270270+ storage. The client must use {!Manual.dst} to provide a new buffer
271271+ and then call {!val-encode} with [`Await] until [`Ok] is returned.}
272272+ {- [`Ok] when the encoder is ready to encode a new [`Uchar] or [`End]}}
273273+274274+ For [`Manual] destination, encoding [`End] always returns
275275+ [`Partial], the client should continue as usual with [`Await]
276276+ until [`Ok] is returned at which point {!Manual.dst_rem} [e] is
277277+ guaranteed to be the size of the last provided buffer (i.e. nothing
278278+ was written).
279279+280280+ {b Raises.} [Invalid_argument] if an [`Uchar] or [`End] is encoded
281281+ after a [`Partial] encode. *)
282282+283283+val encoder_encoding : encoder -> encoding
284284+(** [encoder_encoding e] is [e]'s encoding. *)
285285+286286+val encoder_dst : encoder -> dst
287287+(** [encoder_dst e] is [e]'s output destination. *)
288288+289289+(** {1:manual Manual sources and destinations.} *)
290290+291291+(** Manual sources and destinations.
292292+293293+ {b Warning.} Use only with [`Manual] decoder and encoders. *)
294294+module Manual : sig
295295+ val src : decoder -> Bytes.t -> int -> int -> unit
296296+ (** [src d s j l] provides [d] with [l] bytes to read, starting at
297297+ [j] in [s]. This byte range is read by calls to {!val-decode} with [d]
298298+ until [`Await] is returned. To signal the end of input call the function
299299+ with [l = 0]. *)
300300+301301+ val dst : encoder -> Bytes.t -> int -> int -> unit
302302+ (** [dst e s j l] provides [e] with [l] bytes to write, starting
303303+ at [j] in [s]. This byte range is written by calls to
304304+ {!val-encode} with [e] until [`Partial] is returned. Use {!dst_rem} to
305305+ know the remaining number of non-written free bytes in [s]. *)
306306+307307+ val dst_rem : encoder -> int
308308+ (** [dst_rem e] is the remaining number of non-written, free bytes
309309+ in the last buffer provided with {!Manual.dst}. *)
310310+end
311311+312312+(** {1:strbuf String folders and Buffer encoders} *)
313313+314314+(** Fold over the characters of UTF encoded OCaml [string] values.
315315+316316+ {b Note.} Since OCaml 4.14, UTF decoders are available in
317317+ {!Stdlib.String}. You are encouraged to migrate to them. *)
318318+module String : sig
319319+320320+(** {1 Encoding guess} *)
321321+322322+ val encoding_guess : string -> [ `UTF_8 | `UTF_16BE | `UTF_16LE ] * bool
323323+ (** [encoding_guess s] is the encoding guessed for [s] coupled with
324324+ [true] iff there's an initial
325325+ {{:http://unicode.org/glossary/#byte_order_mark}BOM}. *)
326326+327327+(** {1 String folders}
328328+329329+ {b Note.} Initial {{:http://unicode.org/glossary/#byte_order_mark}BOM}s
330330+ are also folded over. *)
331331+332332+ type 'a folder = 'a -> int -> [ `Uchar of Uchar.t | `Malformed of string ] ->
333333+ 'a
334334+ (** The type for character folders. The integer is the index in the
335335+ string where the [`Uchar] or [`Malformed] starts. *)
336336+337337+ val fold_utf_8 : ?pos:int -> ?len:int -> 'a folder -> 'a -> string -> 'a
338338+ (** [fold_utf_8 f a s ?pos ?len ()] is
339339+ [f (] ... [(f (f a pos u]{_0}[) j]{_1}[ u]{_1}[)] ... [)] ... [)
340340+ j]{_n}[ u]{_n}
341341+ where [u]{_i}, [j]{_i} are characters and their start position
342342+ in the UTF-8 encoded substring [s] starting at [pos] and [len]
343343+ long. The default value for [pos] is [0] and [len] is
344344+ [String.length s - pos]. *)
345345+346346+ val fold_utf_16be : ?pos:int -> ?len:int -> 'a folder -> 'a -> string -> 'a
347347+ (** [fold_utf_16be f a s ?pos ?len ()] is
348348+ [f (] ... [(f (f a pos u]{_0}[) j]{_1}[ u]{_1}[)] ... [)] ... [)
349349+ j]{_n}[ u]{_n}
350350+ where [u]{_i}, [j]{_i} are characters and their start position
351351+ in the UTF-8 encoded substring [s] starting at [pos] and [len]
352352+ long. The default value for [pos] is [0] and [len] is
353353+ [String.length s - pos]. *)
354354+355355+ val fold_utf_16le : ?pos:int -> ?len:int -> 'a folder -> 'a -> string -> 'a
356356+ (** [fold_utf_16le f a s ?pos ?len ()] is
357357+ [f (] ... [(f (f a pos u]{_0}[) j]{_1}[ u]{_1}[)] ... [)] ... [)
358358+ j]{_n}[ u]{_n}
359359+ where [u]{_i}, [j]{_i} are characters and their start position
360360+ in the UTF-8 encoded substring [s] starting at [pos] and [len]
361361+ long. The default value for [pos] is [0] and [len] is
362362+ [String.length s - pos]. *)
363363+end
364364+365365+(** UTF encode characters in OCaml {!Buffer.t} values.
366366+367367+ {b Note.} Since OCaml 4.06, these encoders are available in
368368+ {!Stdlib.Buffer}. You are encouraged to migrate to them. *)
369369+module Buffer : sig
370370+371371+ (** {1 Buffer encoders} *)
372372+373373+ val add_utf_8 : Buffer.t -> Uchar.t -> unit
374374+ (** [add_utf_8 b u] adds the UTF-8 encoding of [u] to [b]. *)
375375+376376+ val add_utf_16be : Buffer.t -> Uchar.t -> unit
377377+ (** [add_utf_16be b u] adds the UTF-16BE encoding of [u] to [b]. *)
378378+379379+ val add_utf_16le : Buffer.t -> Uchar.t -> unit
380380+ (** [add_utf_16le b u] adds the UTF-16LE encoding of [u] to [b]. *)
381381+end
382382+383383+(** {1:examples Examples}
384384+385385+ {2:readlines Read lines}
386386+387387+ The value of [lines src] is the list of lines in [src] as UTF-8
388388+ encoded OCaml strings. Line breaks are determined according to the
389389+ recommendation R4 for a [readline] function in section 5.8 of
390390+ Unicode 9.0.0. If a decoding error occurs we silently replace the
391391+ malformed sequence by the replacement character {!u_rep} and continue.
392392+{[let lines ?encoding (src : [`Channel of in_channel | `String of string]) =
393393+ let rec loop d buf acc = match Uutf.decode d with
394394+ | `Uchar u ->
395395+ begin match Uchar.to_int u with
396396+ | 0x000A ->
397397+ let line = Buffer.contents buf in
398398+ Buffer.clear buf; loop d buf (line :: acc)
399399+ | _ ->
400400+ Uutf.Buffer.add_utf_8 buf u; loop d buf acc
401401+ end
402402+ | `End -> List.rev (Buffer.contents buf :: acc)
403403+ | `Malformed _ -> Uutf.Buffer.add_utf_8 buf Uutf.u_rep; loop d buf acc
404404+ | `Await -> assert false
405405+ in
406406+ let nln = `Readline (Uchar.of_int 0x000A) in
407407+ loop (Uutf.decoder ~nln ?encoding src) (Buffer.create 512) []
408408+]}
409409+ Using the [`Manual] interface, [lines_fd] does the same but on a Unix file
410410+ descriptor.
411411+{[let lines_fd ?encoding (fd : Unix.file_descr) =
412412+ let rec loop fd s d buf acc = match Uutf.decode d with
413413+ | `Uchar u ->
414414+ begin match Uchar.to_int u with
415415+ | 0x000A ->
416416+ let line = Buffer.contents buf in
417417+ Buffer.clear buf; loop fd s d buf (line :: acc)
418418+ | _ ->
419419+ Uutf.Buffer.add_utf_8 buf u; loop fd s d buf acc
420420+ end
421421+ | `End -> List.rev (Buffer.contents buf :: acc)
422422+ | `Malformed _ -> Uutf.Buffer.add_utf_8 buf Uutf.u_rep; loop fd s d buf acc
423423+ | `Await ->
424424+ let rec unix_read fd s j l = try Unix.read fd s j l with
425425+ | Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l
426426+ in
427427+ let rc = unix_read fd s 0 (Bytes.length s) in
428428+ Uutf.Manual.src d s 0 rc; loop fd s d buf acc
429429+ in
430430+ let s = Bytes.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
431431+ let nln = `Readline (Uchar.of_int 0x000A) in
432432+ loop fd s (Uutf.decoder ~nln ?encoding `Manual) (Buffer.create 512) []
433433+]}
434434+435435+ {2:recode Recode}
436436+437437+ The result of [recode src out_encoding dst] has the characters of
438438+ [src] written on [dst] with encoding [out_encoding]. If a
439439+ decoding error occurs we silently replace the malformed sequence
440440+ by the replacement character {!u_rep} and continue. Note that we
441441+ don't add an initial
442442+ {{:http://unicode.org/glossary/#byte_order_mark}BOM} to [dst],
443443+ recoding will thus loose the initial BOM [src] may have. Whether
444444+ this is a problem or not depends on the context.
445445+{[let recode ?nln ?encoding out_encoding
446446+ (src : [`Channel of in_channel | `String of string])
447447+ (dst : [`Channel of out_channel | `Buffer of Buffer.t])
448448+ =
449449+ let rec loop d e = match Uutf.decode d with
450450+ | `Uchar _ as u -> ignore (Uutf.encode e u); loop d e
451451+ | `End -> ignore (Uutf.encode e `End)
452452+ | `Malformed _ -> ignore (Uutf.encode e (`Uchar Uutf.u_rep)); loop d e
453453+ | `Await -> assert false
454454+ in
455455+ let d = Uutf.decoder ?nln ?encoding src in
456456+ let e = Uutf.encoder out_encoding dst in
457457+ loop d e]}
458458+ Using the [`Manual] interface, [recode_fd] does the same but between
459459+ Unix file descriptors.
460460+{[let recode_fd ?nln ?encoding out_encoding
461461+ (fdi : Unix.file_descr)
462462+ (fdo : Unix.file_descr)
463463+ =
464464+ let rec encode fd s e v = match Uutf.encode e v with `Ok -> ()
465465+ | `Partial ->
466466+ let rec unix_write fd s j l =
467467+ let rec write fd s j l = try Unix.single_write fd s j l with
468468+ | Unix.Unix_error (Unix.EINTR, _, _) -> write fd s j l
469469+ in
470470+ let wc = write fd s j l in
471471+ if wc < l then unix_write fd s (j + wc) (l - wc) else ()
472472+ in
473473+ unix_write fd s 0 (Bytes.length s - Uutf.Manual.dst_rem e);
474474+ Uutf.Manual.dst e s 0 (Bytes.length s);
475475+ encode fd s e `Await
476476+ in
477477+ let rec loop fdi fdo ds es d e = match Uutf.decode d with
478478+ | `Uchar _ as u -> encode fdo es e u; loop fdi fdo ds es d e
479479+ | `End -> encode fdo es e `End
480480+ | `Malformed _ -> encode fdo es e (`Uchar Uutf.u_rep); loop fdi fdo ds es d e
481481+ | `Await ->
482482+ let rec unix_read fd s j l = try Unix.read fd s j l with
483483+ | Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l
484484+ in
485485+ let rc = unix_read fdi ds 0 (Bytes.length ds) in
486486+ Uutf.Manual.src d ds 0 rc; loop fdi fdo ds es d e
487487+ in
488488+ let ds = Bytes.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
489489+ let es = Bytes.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
490490+ let d = Uutf.decoder ?nln ?encoding `Manual in
491491+ let e = Uutf.encoder out_encoding `Manual in
492492+ Uutf.Manual.dst e es 0 (Bytes.length es);
493493+ loop fdi fdo ds es d e]}
494494+*)
···11+(* Examples from the documentation, this code is in public domain. *)
22+33+(* Read lines *)
44+55+let lines ?encoding (src : [`Channel of in_channel | `String of string]) =
66+ let rec loop d buf acc = match Uutf.decode d with
77+ | `Uchar u ->
88+ begin match Uchar.to_int u with
99+ | 0x000A ->
1010+ let line = Buffer.contents buf in
1111+ Buffer.clear buf; loop d buf (line :: acc)
1212+ | _ ->
1313+ Uutf.Buffer.add_utf_8 buf u; loop d buf acc
1414+ end
1515+ | `End -> List.rev (Buffer.contents buf :: acc)
1616+ | `Malformed _ -> Uutf.Buffer.add_utf_8 buf Uutf.u_rep; loop d buf acc
1717+ | `Await -> assert false
1818+ in
1919+ let nln = `Readline (Uchar.of_int 0x000A) in
2020+ loop (Uutf.decoder ~nln ?encoding src) (Buffer.create 512) []
2121+2222+let lines_fd ?encoding (fd : Unix.file_descr) =
2323+ let rec loop fd s d buf acc = match Uutf.decode d with
2424+ | `Uchar u ->
2525+ begin match Uchar.to_int u with
2626+ | 0x000A ->
2727+ let line = Buffer.contents buf in
2828+ Buffer.clear buf; loop fd s d buf (line :: acc)
2929+ | _ ->
3030+ Uutf.Buffer.add_utf_8 buf u; loop fd s d buf acc
3131+ end
3232+ | `End -> List.rev (Buffer.contents buf :: acc)
3333+ | `Malformed _ -> Uutf.Buffer.add_utf_8 buf Uutf.u_rep; loop fd s d buf acc
3434+ | `Await ->
3535+ let rec unix_read fd s j l = try Unix.read fd s j l with
3636+ | Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l
3737+ in
3838+ let rc = unix_read fd s 0 (Bytes.length s) in
3939+ Uutf.Manual.src d s 0 rc; loop fd s d buf acc
4040+ in
4141+ let s = Bytes.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
4242+ let nln = `Readline (Uchar.of_int 0x000A) in
4343+ loop fd s (Uutf.decoder ~nln ?encoding `Manual) (Buffer.create 512) []
4444+4545+(* Recode *)
4646+4747+let recode ?nln ?encoding out_encoding
4848+ (src : [`Channel of in_channel | `String of string])
4949+ (dst : [`Channel of out_channel | `Buffer of Buffer.t])
5050+ =
5151+ let rec loop d e = match Uutf.decode d with
5252+ | `Uchar _ as u -> ignore (Uutf.encode e u); loop d e
5353+ | `End -> ignore (Uutf.encode e `End)
5454+ | `Malformed _ -> ignore (Uutf.encode e (`Uchar Uutf.u_rep)); loop d e
5555+ | `Await -> assert false
5656+ in
5757+ let d = Uutf.decoder ?nln ?encoding src in
5858+ let e = Uutf.encoder out_encoding dst in
5959+ loop d e
6060+6161+let recode_fd ?nln ?encoding out_encoding
6262+ (fdi : Unix.file_descr)
6363+ (fdo : Unix.file_descr)
6464+ =
6565+ let rec encode fd s e v = match Uutf.encode e v with `Ok -> ()
6666+ | `Partial ->
6767+ let rec unix_write fd s j l =
6868+ let rec write fd s j l = try Unix.single_write fd s j l with
6969+ | Unix.Unix_error (Unix.EINTR, _, _) -> write fd s j l
7070+ in
7171+ let wc = write fd s j l in
7272+ if wc < l then unix_write fd s (j + wc) (l - wc) else ()
7373+ in
7474+ unix_write fd s 0 (Bytes.length s - Uutf.Manual.dst_rem e);
7575+ Uutf.Manual.dst e s 0 (Bytes.length s);
7676+ encode fd s e `Await
7777+ in
7878+ let rec loop fdi fdo ds es d e = match Uutf.decode d with
7979+ | `Uchar _ as u -> encode fdo es e u; loop fdi fdo ds es d e
8080+ | `End -> encode fdo es e `End
8181+ | `Malformed _ -> encode fdo es e (`Uchar Uutf.u_rep); loop fdi fdo ds es d e
8282+ | `Await ->
8383+ let rec unix_read fd s j l = try Unix.read fd s j l with
8484+ | Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l
8585+ in
8686+ let rc = unix_read fdi ds 0 (Bytes.length ds) in
8787+ Uutf.Manual.src d ds 0 rc; loop fdi fdo ds es d e
8888+ in
8989+ let ds = Bytes.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
9090+ let es = Bytes.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
9191+ let d = Uutf.decoder ?nln ?encoding `Manual in
9292+ let e = Uutf.encoder out_encoding `Manual in
9393+ Uutf.Manual.dst e es 0 (Bytes.length es);
9494+ loop fdi fdo ds es d e
+376
vendor/opam/uutf/test/test_uutf.ml
···11+(*---------------------------------------------------------------------------
22+ Copyright (c) 2012 The uutf programmers. All rights reserved.
33+ SPDX-License-Identifier: ISC
44+ ---------------------------------------------------------------------------*)
55+66+let u_nl = Uchar.of_int 0x000A
77+let log f = Format.printf (f ^^ "@?")
88+let fail fmt =
99+ let fail _ = failwith (Format.flush_str_formatter ()) in
1010+ Format.kfprintf fail Format.str_formatter fmt
1111+1212+let fail_decode e f =
1313+ fail "expected %a, decoded %a" Uutf.pp_decode e Uutf.pp_decode f
1414+1515+let uchar_succ u = if Uchar.equal u Uchar.max then u else Uchar.succ u
1616+let iter_uchars f =
1717+ for u = 0x0000 to 0xD7FF do f (Uchar.unsafe_of_int u) done;
1818+ for u = 0xE000 to 0x10FFFF do f (Uchar.unsafe_of_int u) done
1919+2020+let codec_test () =
2121+ let codec_uchars encoding s bsize =
2222+ log "Codec every unicode scalar value in %s with buffer size %d.\n%!"
2323+ (Uutf.encoding_to_string encoding) bsize;
2424+ let encode_uchars encoding s bsize =
2525+ let spos = ref 0 in
2626+ let e = Uutf.encoder encoding `Manual in
2727+ let rec encode e v = match Uutf.encode e v with `Ok -> ()
2828+ | `Partial ->
2929+ let brem = Bytes.length s - !spos in
3030+ let drem = Uutf.Manual.dst_rem e in
3131+ let bsize = min bsize brem in
3232+ Uutf.Manual.dst e s !spos bsize;
3333+ spos := !spos + bsize - drem;
3434+ encode e `Await
3535+ in
3636+ let encode_u u = encode e (`Uchar u) in
3737+ iter_uchars encode_u; encode e `End;
3838+ !spos - Uutf.Manual.dst_rem e (* encoded length. *)
3939+ in
4040+ let decode_uchars encoding s slen bsize =
4141+ let spos = ref 0 in
4242+ let bsize = min bsize slen in
4343+ let d = Uutf.decoder ~encoding `Manual in
4444+ let rec decode d = match Uutf.decode d with
4545+ | `Malformed _ | `Uchar _ | `End as v -> v
4646+ | `Await ->
4747+ let rem = slen - !spos in
4848+ let bsize = min bsize rem in
4949+ Uutf.Manual.src d s !spos bsize;
5050+ spos := !spos + bsize;
5151+ decode d
5252+ in
5353+ let decode_u u = match decode d with
5454+ | `Uchar u' when u = u' -> ()
5555+ | v -> fail_decode (`Uchar u) v
5656+ in
5757+ iter_uchars decode_u;
5858+ match decode d with
5959+ | `End -> () | v -> fail_decode `End v
6060+ in
6161+ let slen = encode_uchars encoding s bsize in
6262+ decode_uchars encoding s slen bsize
6363+ in
6464+ let full = 4 * 0x10FFFF in (* will hold everything in any encoding. *)
6565+ let s = Bytes.create full in
6666+ let test encoding =
6767+ (* Test with various sizes to increase condition coverage. *)
6868+ for i = 1 to 11 do codec_uchars encoding s i done;
6969+ codec_uchars encoding s full;
7070+ in
7171+ test `UTF_8; test `UTF_16BE; test `UTF_16LE
7272+7373+let buffer_string_codec_test () =
7474+ let codec_uchars encoding encode decode b =
7575+ log "Buffer/String codec every unicode scalar value in %s.\n%!"
7676+ (Uutf.encoding_to_string encoding);
7777+ Buffer.clear b;
7878+ iter_uchars (encode b);
7979+ let s = Buffer.contents b in
8080+ let check uchar _ = function
8181+ | `Uchar u when Uchar.equal u uchar -> uchar_succ uchar
8282+ | v -> fail_decode (`Uchar uchar) v
8383+ in
8484+ ignore (decode ?pos:None ?len:None check (Uchar.of_int 0x0000) s)
8585+ in
8686+ let b = Buffer.create (4 * 0x10FFFF) in
8787+ codec_uchars `UTF_8 Uutf.Buffer.add_utf_8 Uutf.String.fold_utf_8 b;
8888+ codec_uchars `UTF_16BE Uutf.Buffer.add_utf_16be Uutf.String.fold_utf_16be b;
8989+ codec_uchars `UTF_16LE Uutf.Buffer.add_utf_16le Uutf.String.fold_utf_16le b
9090+9191+let pos_test () =
9292+ let test encoding s =
9393+ log "Test position tracking in %s.\n%!" (Uutf.encoding_to_string encoding);
9494+ let pos d (l, c, k) =
9595+ match Uutf.decoder_line d, Uutf.decoder_col d, Uutf.decoder_count d with
9696+ | (l', c', k') when l = l' && c = c' && k = k' -> ignore (Uutf.decode d)
9797+ | (l', c', k') ->
9898+ fail "Expected position (%d,%d,%d) found (%d,%d,%d)." l c k l' c' k'
9999+ in
100100+ let e = Uutf.decoder ~encoding (`String s) in
101101+ pos e (1, 0, 0); pos e (1, 1, 1); pos e (1, 2, 2); pos e (2, 0, 3);
102102+ pos e (2, 1, 4); pos e (3, 0, 5); pos e (3, 0, 6); pos e (3, 1, 7);
103103+ pos e (3, 2, 8); pos e (4, 0, 9); pos e (4, 0, 10); pos e (5, 0, 11);
104104+ pos e (6, 0, 12); pos e (6, 0, 12); pos e (6, 0, 12);
105105+ let e = Uutf.decoder ~nln:(`ASCII u_nl) ~encoding (`String s) in
106106+ pos e (1, 0, 0); pos e (1, 1, 1); pos e (1, 2, 2); pos e (2, 0, 3);
107107+ pos e (2, 1, 4); pos e (3, 0, 5); pos e (3, 1, 6); pos e (3, 2, 7);
108108+ pos e (4, 0, 8); pos e (5, 0, 9); pos e (6, 0, 10); pos e (6, 0, 10);
109109+ pos e (6, 0, 10);
110110+ let e = Uutf.decoder ~nln:(`NLF u_nl) ~encoding (`String s) in
111111+ pos e (1, 0, 0); pos e (1, 1, 1); pos e (1, 2, 2); pos e (2, 0, 3);
112112+ pos e (2, 1, 4); pos e (3, 0, 5); pos e (3, 1, 6); pos e (3, 2, 7);
113113+ pos e (4, 0, 8); pos e (5, 0, 9); pos e (6, 0, 10); pos e (6, 0, 10);
114114+ pos e (6, 0, 10);
115115+ let e = Uutf.decoder ~nln:(`Readline u_nl) ~encoding (`String s) in
116116+ pos e (1, 0, 0); pos e (1, 1, 1); pos e (1, 2, 2); pos e (2, 0, 3);
117117+ pos e (2, 1, 4); pos e (3, 0, 5); pos e (3, 1, 6); pos e (3, 2, 7);
118118+ pos e (4, 0, 8); pos e (5, 0, 9); pos e (6, 0, 10); pos e (6, 0, 10);
119119+ pos e (6, 0, 10);
120120+ in
121121+ test `UTF_8 "LL\nL\r\nLL\r\n\n\x0C";
122122+ test `UTF_16BE
123123+ "\x00\x4C\x00\x4C\x00\x0A\x00\x4C\x00\x0D\x00\x0A\x00\x4C\x00\x4C\
124124+ \x00\x0D\x00\x0A\x00\x0A\x00\x0C";
125125+ test `UTF_16LE
126126+ "\x4C\x00\x4C\x00\x0A\x00\x4C\x00\x0D\x00\x0A\x00\x4C\x00\x4C\x00\
127127+ \x0D\x00\x0A\x00\x0A\x00\x0C\x00";
128128+ ()
129129+130130+let guess_test () =
131131+ log "Test encoding guessing.\n%!";
132132+ let test (s, enc, removed_bom, seq) =
133133+ let d = Uutf.decoder (`String s) in
134134+ let rec test_seq seq d = match seq, Uutf.decode d with
135135+ | `Uchar u :: vs, `Uchar u' when Uchar.equal u u' -> test_seq vs d
136136+ | `Malformed bs :: vs, `Malformed bs' when bs = bs' -> test_seq vs d
137137+ | [], `End -> ()
138138+ | v :: _, v' -> fail_decode v v'
139139+ | _ , _ -> assert false
140140+ in
141141+ test_seq seq d;
142142+ let guess = Uutf.decoder_encoding d in
143143+ if guess <> enc then fail "expected encoding: %s guessed: %s"
144144+ (Uutf.encoding_to_string enc) (Uutf.encoding_to_string guess);
145145+ let rem_bom = Uutf.decoder_removed_bom d in
146146+ if rem_bom <> removed_bom then
147147+ fail "expected removed bom: %b found: %b" removed_bom rem_bom
148148+ in
149149+ let uchar u = `Uchar (Uchar.unsafe_of_int u) in
150150+ (* UTF-8 guess *)
151151+ test ("", `UTF_8, false, []);
152152+ test ("\xEF", `UTF_8, false, [`Malformed "\xEF";]);
153153+ test ("\xEF\xBB", `UTF_8, false, [`Malformed "\xEF\xBB";]);
154154+ test ("\xEF\xBB\x00", `UTF_8, false, [`Malformed "\xEF\xBB\x00";]);
155155+ test ("\xEF\xBB\xBF\xEF\xBB\xBF", `UTF_8, true, [`Uchar Uutf.u_bom;]);
156156+ test ("\n\r\n", `UTF_8, false, [`Uchar u_nl; uchar 0x0D; `Uchar u_nl;]);
157157+ test ("\n\x80\xEF\xBB\xBF\n", `UTF_8, false,
158158+ [`Uchar u_nl; `Malformed "\x80"; `Uchar Uutf.u_bom; `Uchar u_nl]);
159159+ test ("\n\n\xEF\xBB\x00\n", `UTF_8, false,
160160+ [`Uchar u_nl; `Uchar u_nl; `Malformed "\xEF\xBB\x00"; `Uchar u_nl;]);
161161+ test ("\n\xC8\x99", `UTF_8, false, [`Uchar u_nl; uchar 0x0219;]);
162162+ test ("\xC8\x99\n", `UTF_8, false, [uchar 0x0219; `Uchar u_nl;]);
163163+ test ("\xC8\x99\n\n", `UTF_8, false,
164164+ [uchar 0x0219; `Uchar u_nl; `Uchar u_nl]);
165165+ test ("\xC8\x99\xC8\x99", `UTF_8, false, [uchar 0x0219; uchar 0x0219]);
166166+ test ("\xC8\x99\xF0\x9F\x90\xAB", `UTF_8, false,
167167+ [uchar 0x0219; uchar 0x1F42B]);
168168+ test ("\xF0\x9F\x90\xAB\n", `UTF_8, false, [uchar 0x1F42B; `Uchar u_nl ]);
169169+ (* UTF-16BE guess *)
170170+ test ("\xFE\xFF\xDB\xFF\xDF\xFF\x00\x0A", `UTF_16BE, true,
171171+ [uchar 0x10FFFF; `Uchar u_nl;]);
172172+ test ("\xFE\xFF\xDB\xFF\x00\x0A\x00\x0A", `UTF_16BE, true,
173173+ [`Malformed "\xDB\xFF\x00\x0A"; `Uchar u_nl;]);
174174+ test ("\xFE\xFF\xDB\xFF\xDF", `UTF_16BE, true,
175175+ [`Malformed "\xDB\xFF\xDF";]);
176176+ test ("\x80\x81\xDB\xFF\xDF\xFF\xFE\xFF\xDF\xFF\xDB\xFF", `UTF_16BE, false,
177177+ [uchar 0x8081; uchar 0x10FFFF; `Uchar Uutf.u_bom;
178178+ `Malformed "\xDF\xFF"; `Malformed "\xDB\xFF"]);
179179+ test ("\x80\x81\xDF\xFF\xDB\xFF\xFE", `UTF_16BE, false,
180180+ [uchar 0x8081; `Malformed "\xDF\xFF"; `Malformed "\xDB\xFF\xFE";]);
181181+ test ("\x00\x0A", `UTF_16BE, false, [`Uchar u_nl]);
182182+ test ("\x00\x0A\xDB", `UTF_16BE, false, [`Uchar u_nl; `Malformed "\xDB"]);
183183+ test ("\x00\x0A\xDB\xFF", `UTF_16BE, false,
184184+ [`Uchar u_nl; `Malformed "\xDB\xFF"]);
185185+ test ("\x00\x0A\xDB\xFF\xDF", `UTF_16BE, false,
186186+ [`Uchar u_nl; `Malformed "\xDB\xFF\xDF"]);
187187+ test ("\x00\x0A\xDB\xFF\xDF\xFF", `UTF_16BE, false,
188188+ [`Uchar u_nl; uchar 0x10FFFF]);
189189+ test ("\x00\x0A\x00\x0A", `UTF_16BE, false,
190190+ [`Uchar u_nl; `Uchar u_nl]);
191191+ (* UTF-16LE guess *)
192192+ test ("\xFF\xFE\xFF\xDB\xFF\xDF\x0A\x00", `UTF_16LE, true,
193193+ [uchar 0x10FFFF; `Uchar u_nl;]);
194194+ test ("\xFF\xFE\xFF\xDB\x0A\x00\x0A\x00", `UTF_16LE, true,
195195+ [`Malformed "\xFF\xDB\x0A\x00"; `Uchar u_nl;]);
196196+ test ("\xFF\xFE\xFF\xDB\xDF", `UTF_16LE, true,
197197+ [`Malformed "\xFF\xDB\xDF";]);
198198+ test ("\x0A\x00", `UTF_16LE, false, [`Uchar u_nl]);
199199+ test ("\x0A\x00\xDB", `UTF_16LE, false, [`Uchar u_nl; `Malformed "\xDB"]);
200200+ test ("\x0A\x00\xFF\xDB", `UTF_16LE, false,
201201+ [`Uchar u_nl; `Malformed "\xFF\xDB"]);
202202+ test ("\x0A\x00\xFF\xDB\xDF", `UTF_16LE, false,
203203+ [`Uchar u_nl; `Malformed "\xFF\xDB\xDF"]);
204204+ test ("\x0A\x00\xFF\xDB\xFF\xDF", `UTF_16LE, false,
205205+ [`Uchar u_nl; uchar 0x10FFFF]);
206206+ test ("\x0A\x00\x0A\x00", `UTF_16LE, false,
207207+ [`Uchar u_nl; `Uchar u_nl]);
208208+ ()
209209+210210+let test_sub () =
211211+ log "Test Uutf.String.fold_utf_8 substring";
212212+ let trip fold ~pos ~len s =
213213+ let b = Buffer.create 100 in
214214+ let add _ _ = function
215215+ | `Uchar u -> Uutf.Buffer.add_utf_8 b u
216216+ | `Malformed _ -> assert false
217217+ in
218218+ fold ?pos:(Some pos) ?len:(Some len) add () s;
219219+ assert (String.sub s pos len = Buffer.contents b);
220220+ in
221221+ trip Uutf.String.fold_utf_8 ~pos:4 ~len:4 "hop hap mop";
222222+ trip Uutf.String.fold_utf_8 ~pos:0 ~len:1 "hop hap mop";
223223+ trip Uutf.String.fold_utf_8 ~pos:2 ~len:1 "hop";
224224+ ()
225225+226226+module Int = struct type t = int let compare : int -> int -> int = compare end
227227+module Umap = Map.Make (Uchar)
228228+module Bmap = Map.Make (Bytes)
229229+230230+(* Constructs from the specification, the map from uchars to their valid
231231+ UTF-8 byte sequence and the map reverse map from valid UTF-8 byte sequences
232232+ to their uchar. *)
233233+let utf8_maps () =
234234+ log "Building UTF-8 codec maps from specification.\n";
235235+ let spec = [ (* UTF-8 byte sequences cf. table 3.7 p. 94 Unicode 6. *)
236236+ (0x0000,0x007F), [|(0x00,0x7F)|];
237237+ (0x0080,0x07FF), [|(0xC2,0xDF); (0x80,0xBF)|];
238238+ (0x0800,0x0FFF), [|(0xE0,0xE0); (0xA0,0xBF); (0x80,0xBF)|];
239239+ (0x1000,0xCFFF), [|(0xE1,0xEC); (0x80,0xBF); (0x80,0xBF)|];
240240+ (0xD000,0xD7FF), [|(0xED,0xED); (0x80,0x9F); (0x80,0xBF)|];
241241+ (0xE000,0xFFFF), [|(0xEE,0xEF); (0x80,0xBF); (0x80,0xBF)|];
242242+ (0x10000,0x3FFFF), [|(0xF0,0xF0); (0x90,0xBF); (0x80,0xBF); (0x80,0xBF)|];
243243+ (0x40000,0xFFFFF), [|(0xF1,0xF3); (0x80,0xBF); (0x80,0xBF); (0x80,0xBF)|];
244244+ (0x100000,0x10FFFF), [|(0xF4,0xF4); (0x80,0x8F); (0x80,0xBF); (0x80,0xBF)|]]
245245+ in
246246+ let add_range (umap, bmap) ((umin, umax), bytes) =
247247+ let len = Array.length bytes in
248248+ let bmin i = if i < len then fst bytes.(i) else max_int in
249249+ let bmax i = if i < len then snd bytes.(i) else min_int in
250250+ let umap = ref umap in
251251+ let bmap = ref bmap in
252252+ let uchar = ref umin in
253253+ let buf = Bytes.create len in
254254+ let add len' =
255255+ if len <> len' then () else
256256+ begin
257257+ let bytes = Bytes.copy buf in
258258+ let u = Uchar.of_int !uchar in
259259+ umap := Umap.add u bytes !umap;
260260+ bmap := Bmap.add bytes u !bmap;
261261+ incr uchar;
262262+ end
263263+ in
264264+ for b0 = bmin 0 to bmax 0 do
265265+ Bytes.unsafe_set buf 0 (Char.chr b0);
266266+ for b1 = bmin 1 to bmax 1 do
267267+ Bytes.unsafe_set buf 1 (Char.chr b1);
268268+ for b2 = bmin 2 to bmax 2 do
269269+ Bytes.unsafe_set buf 2 (Char.chr b2);
270270+ for b3 = bmin 3 to bmax 3 do
271271+ Bytes.unsafe_set buf 3 (Char.chr b3);
272272+ add 4;
273273+ done;
274274+ add 3;
275275+ done;
276276+ add 2;
277277+ done;
278278+ add 1;
279279+ done;
280280+ assert (!uchar - 1 = umax);
281281+ (!umap, !bmap)
282282+ in
283283+ List.fold_left add_range (Umap.empty, Bmap.empty) spec
284284+285285+let utf8_encode_test umap =
286286+ log "Testing UTF-8 encoding of every unicode scalar value against spec.\n";
287287+ let buf = Buffer.create 4 in
288288+ let test u =
289289+ let u = Uchar.unsafe_of_int u in
290290+ let bytes = try Umap.find u umap with Not_found -> assert false in
291291+ let bytes = Bytes.unsafe_to_string bytes in
292292+ Buffer.clear buf; Uutf.Buffer.add_utf_8 buf u;
293293+ if bytes = Buffer.contents buf then () else
294294+ fail "UTF-8 encoding error (U+%04X)" (Uchar.to_int u)
295295+ in
296296+ for i = 0x0000 to 0xD7FF do test i done;
297297+ for i = 0xE000 to 0x10FFFF do test i done
298298+299299+let utf8_decode_test bmap =
300300+ log "Testing the UTF-8 decoding of all <= 4 bytes sequences (be patient).\n";
301301+ let spec seq = try `Uchar (Bmap.find seq bmap) with
302302+ | Not_found -> `Malformed (Bytes.unsafe_to_string seq)
303303+ in
304304+ let test seq =
305305+ let sseq = Bytes.unsafe_to_string seq in
306306+ let dec = List.rev (Uutf.String.fold_utf_8 (fun a _ c -> c :: a) [] sseq) in
307307+ match spec seq, dec with
308308+ | `Uchar u, [ `Uchar u' ] when u = u' -> `Decoded
309309+ | `Malformed _, (`Malformed _) :: _ -> `Malformed
310310+ | v, v' :: _ -> fail_decode v v'
311311+ | _ -> fail "This should not have happened on specification '%S'." sseq
312312+ in
313313+ let s1 = Bytes.create 1
314314+ and s2 = Bytes.create 2
315315+ and s3 = Bytes.create 3
316316+ and s4 = Bytes.create 4
317317+ in
318318+ for b0 = 0x00 to 0xFF do
319319+ Bytes.unsafe_set s1 0 (Char.unsafe_chr b0);
320320+ if test s1 = `Decoded then ()
321321+ else begin
322322+ Bytes.unsafe_set s2 0 (Char.unsafe_chr b0);
323323+ for b1 = 0x00 to 0xFF do
324324+ Bytes.unsafe_set s2 1 (Char.unsafe_chr b1);
325325+ if test s2 = `Decoded then ()
326326+ else begin
327327+ Bytes.unsafe_set s3 0 (Char.unsafe_chr b0);
328328+ Bytes.unsafe_set s3 1 (Char.unsafe_chr b1);
329329+ for b2 = 0x00 to 0xFF do
330330+ Bytes.unsafe_set s3 2 (Char.unsafe_chr b2);
331331+ if test s3 = `Decoded then ()
332332+ else begin
333333+ Bytes.unsafe_set s4 0 (Char.unsafe_chr b0);
334334+ Bytes.unsafe_set s4 1 (Char.unsafe_chr b1);
335335+ Bytes.unsafe_set s4 2 (Char.unsafe_chr b2);
336336+ for b3 = 0x00 to 0xFF do
337337+ Bytes.unsafe_set s4 3 (Char.unsafe_chr b3);
338338+ ignore (test s4)
339339+ done;
340340+ end
341341+ done;
342342+ end
343343+ done;
344344+ end
345345+ done
346346+347347+let utf8_test () = (* Proof by exhaustiveness... *)
348348+ let umap, bmap = utf8_maps () in
349349+ utf8_encode_test umap;
350350+(* utf8_decode_test bmap; *) (* too long, commented. *)
351351+ ()
352352+353353+let is_uchar_test () =
354354+ log "Testing Uchar.is_valid.\n";
355355+ let test cp expected =
356356+ let is = Uchar.is_valid cp in
357357+ if is <> expected then
358358+ fail "Uutf.is_uchar %04X = %b, expected %b" cp is expected
359359+ in
360360+ for cp = 0x0000 to 0xD7FF do test cp true done;
361361+ for cp = 0xD800 to 0xDFFF do test cp false done;
362362+ for cp = 0xE000 to 0x10FFFF do test cp true done;
363363+ for cp = 0x110000 to 0x120000 do test cp false done
364364+365365+let test () =
366366+ Printexc.record_backtrace true;
367367+ codec_test ();
368368+ buffer_string_codec_test ();
369369+ pos_test ();
370370+ guess_test ();
371371+ test_sub ();
372372+ utf8_test ();
373373+ is_uchar_test ();
374374+ log "All tests succeeded.\n"
375375+376376+let () = if not (!Sys.interactive) then test ()
+391
vendor/opam/uutf/test/utftrip.ml
···11+(*---------------------------------------------------------------------------
22+ Copyright (c) 2012 The uutf programmers. All rights reserved.
33+ SPDX-License-Identifier: ISC
44+ ---------------------------------------------------------------------------*)
55+66+let str = Printf.sprintf
77+let pp = Format.fprintf
88+let pp_pos ppf d = pp ppf "%d.%d:(%d,%06X) "
99+ (Uutf.decoder_line d) (Uutf.decoder_col d) (Uutf.decoder_count d)
1010+ (Uutf.decoder_byte_count d)
1111+1212+let pp_decode inf d ppf v =
1313+ pp ppf "@[<h>%s:%a%a@]@\n" inf pp_pos d Uutf.pp_decode v
1414+1515+let exec = Filename.basename Sys.executable_name
1616+let log f = Format.eprintf ("%s: " ^^ f ^^ "@?") exec
1717+1818+let input_malformed = ref false
1919+let log_malformed inf d v =
2020+ input_malformed := true; log "%a" (pp_decode inf d) v
2121+2222+(* IO tools *)
2323+2424+let io_buffer_size = 65536 (* IO_BUFFER_SIZE 4.0.0 *)
2525+let unix_buffer_size = 65536 (* UNIX_BUFFER_SIZE 4.0.0 *)
2626+2727+let rec unix_read fd s j l = try Unix.read fd s j l with
2828+| Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l
2929+3030+let rec unix_write fd s j l =
3131+ let rec write fd s j l = try Unix.single_write fd s j l with
3232+ | Unix.Unix_error (Unix.EINTR, _, _) -> write fd s j l
3333+ in
3434+ let wc = write fd s j l in
3535+ if wc < l then unix_write fd s (j + wc) (l - wc) else ()
3636+3737+let string_of_channel use_unix ic =
3838+ let b = Buffer.create unix_buffer_size in
3939+ let input, s =
4040+ if use_unix
4141+ then unix_read (Unix.descr_of_in_channel ic), Bytes.create unix_buffer_size
4242+ else input ic, Bytes.create io_buffer_size
4343+ in
4444+ let rec loop b input s =
4545+ let rc = input s 0 (Bytes.length s) in
4646+ if rc = 0 then Buffer.contents b else
4747+ (Buffer.add_substring b (Bytes.unsafe_to_string s) 0 rc; loop b input s)
4848+ in
4949+ loop b input s
5050+5151+let string_to_channel use_unix oc s =
5252+ if not use_unix then output_string oc s else
5353+ let s = Bytes.unsafe_of_string s in
5454+ unix_write (Unix.descr_of_out_channel oc) s 0 (Bytes.length s)
5555+5656+let dst_for sout = if sout then `Buffer (Buffer.create 512) else `Channel stdout
5757+let src_for inf sin use_unix =
5858+ try
5959+ let ic = if inf = "-" then stdin else open_in inf in
6060+ if sin then `String (string_of_channel use_unix ic) else `Channel ic
6161+ with Sys_error e -> log "%s\n" e; exit 1
6262+6363+let close_src src =
6464+ try match src with `Channel ic when ic <> stdin -> close_in ic | _ -> () with
6565+ | Sys_error e -> log "%s\n" e; exit 1
6666+6767+let src_for_unix inf =
6868+ try if inf = "-" then Unix.stdin else Unix.(openfile inf [O_RDONLY] 0) with
6969+ | Unix.Unix_error (e, _, v) -> log "%s: %s\n" (Unix.error_message e) v; exit 1
7070+7171+let close_src_unix fd = try if fd <> Unix.stdin then Unix.close fd with
7272+| Unix.Unix_error (e, _, v) -> log "%s: %s\n" (Unix.error_message e) v; exit 1
7373+7474+let rec encode_unix fd s e v = match Uutf.encode e v with `Ok -> ()
7575+| `Partial ->
7676+ unix_write fd s 0 (Bytes.length s - Uutf.Manual.dst_rem e);
7777+ Uutf.Manual.dst e s 0 (Bytes.length s);
7878+ encode_unix fd s e `Await
7979+8080+(* Dump *)
8181+8282+let dump_decode inf d v =
8383+ (match v with `Malformed _ -> input_malformed := true | _ -> ());
8484+ (pp_decode inf d) Format.std_formatter v
8585+8686+let dump_ inf encoding nln src =
8787+ let rec loop inf d = match Uutf.decode d with `Await -> assert false
8888+ | v ->
8989+ dump_decode inf d v;
9090+ if v <> `End then loop inf d
9191+ in
9292+ loop inf (Uutf.decoder ?nln ?encoding src)
9393+9494+let dump_unix inf encoding nln usize fd =
9595+ let rec loop fd s d = match Uutf.decode d with
9696+ | `Await ->
9797+ let rc = unix_read fd s 0 (Bytes.length s) in
9898+ Uutf.Manual.src d s 0 rc; loop fd s d
9999+ | v -> dump_decode inf d v; if v <> `End then loop fd s d
100100+ in
101101+ loop fd (Bytes.create usize) (Uutf.decoder ?nln ?encoding `Manual)
102102+103103+let dump inf sin use_unix usize ie nln =
104104+ if sin || not use_unix then dump_ inf ie nln (src_for inf sin use_unix) else
105105+ dump_unix inf ie nln usize (src_for_unix inf)
106106+107107+(* Guess only *)
108108+109109+let guess inf =
110110+ let d = Uutf.decoder (src_for inf false false) in
111111+ ignore (Uutf.decode d);
112112+ Format.printf "%s@." (Uutf.encoding_to_string (Uutf.decoder_encoding d))
113113+114114+(* Decode only *)
115115+116116+let decode_ inf encoding nln src =
117117+ let malformed = log_malformed inf in
118118+ let rec loop d = match Uutf.decode d with `Await -> assert false
119119+ | `Uchar _ -> loop d
120120+ | `End -> ()
121121+ | `Malformed _ as v -> malformed d v; loop d
122122+ in
123123+ loop (Uutf.decoder ?nln ?encoding src); close_src src
124124+125125+let decode_unix inf encoding nln usize fd =
126126+ let malformed = log_malformed inf in
127127+ let rec loop fd s d = match Uutf.decode d with
128128+ | `Uchar _ -> loop fd s d
129129+ | `End -> ()
130130+ | `Malformed _ as v -> malformed d v; loop fd s d
131131+ | `Await ->
132132+ let rc = unix_read fd s 0 (Bytes.length s) in
133133+ Uutf.Manual.src d s 0 rc; loop fd s d
134134+ in
135135+ loop fd (Bytes.create usize) (Uutf.decoder ?nln ?encoding `Manual);
136136+ close_src_unix fd
137137+138138+let decode inf sin use_unix usize ie nln =
139139+ if sin || not use_unix then decode_ inf ie nln (src_for inf sin use_unix) else
140140+ decode_unix inf ie nln usize (src_for_unix inf)
141141+142142+(* Random encode only *)
143143+144144+let u_surrogate_count = 0xDFFF - 0xD800 + 1
145145+let uchar_count = (0x10FFFF + 1) - u_surrogate_count
146146+let r_uchar () =
147147+ let n = Random.int uchar_count in
148148+ Uchar.of_int (if n > 0xD7FF then n + u_surrogate_count else n)
149149+150150+let r_text encoding encode_f rcount =
151151+ encode_f (`Uchar Uutf.u_bom);
152152+ for i = 1 to rcount do encode_f (`Uchar (r_uchar ())) done;
153153+ encode_f `End
154154+155155+let encode_f encoding dst =
156156+ let e = Uutf.encoder encoding dst in
157157+ fun v -> match Uutf.encode e v with `Ok -> () | `Partial -> assert false
158158+159159+let encode_f_unix usize encoding fd =
160160+ let e, s = Uutf.encoder encoding `Manual, Bytes.create usize in
161161+ Uutf.Manual.dst e s 0 (Bytes.length s);
162162+ encode_unix fd s e
163163+164164+let r_encode sout use_unix usize rseed rcount oe =
165165+ let rseed = match rseed with
166166+ | None -> Random.self_init (); Random.int (1 lsl 30 - 1)
167167+ | Some rseed -> rseed
168168+ in
169169+ let dst = dst_for sout in
170170+ let oe = match oe with None -> `UTF_8 | Some enc -> enc in
171171+ let encode_f =
172172+ if sout || not use_unix then encode_f oe dst else
173173+ encode_f_unix usize oe Unix.stdout
174174+ in
175175+ log "Encoding %d random characters with seed %d\n" rcount rseed;
176176+ Random.init rseed; r_text oe encode_f rcount;
177177+ match dst with `Channel _ -> ()
178178+ | `Buffer b -> string_to_channel use_unix stdout (Buffer.contents b)
179179+180180+(* Trip *)
181181+182182+let trip_ inf nln ie oe src dst =
183183+ let malformed d v e =
184184+ log_malformed inf d v; ignore (Uutf.encode e (`Uchar Uutf.u_rep))
185185+ in
186186+ let rec loop d e = function `Await -> assert false
187187+ | `Uchar _ as v -> ignore (Uutf.encode e v); loop d e (Uutf.decode d)
188188+ | `End -> ignore (Uutf.encode e `End)
189189+ | `Malformed _ as v -> malformed d v e; loop d e (Uutf.decode d)
190190+ in
191191+ let d = Uutf.decoder ?nln ?encoding:ie src in
192192+ let e, first = match oe with
193193+ | Some enc -> Uutf.encoder enc dst, (Uutf.decode d)
194194+ | None ->
195195+ let v = Uutf.decode d in (* get the encoding. *)
196196+ let enc = match Uutf.decoder_encoding d with
197197+ | #Uutf.encoding as enc -> enc | `ISO_8859_1 | `US_ASCII -> `UTF_8
198198+ in
199199+ Uutf.encoder enc dst, v
200200+ in
201201+ if (Uutf.encoder_encoding e = `UTF_16 || Uutf.decoder_removed_bom d)
202202+ then ignore (Uutf.encode e (`Uchar Uutf.u_bom));
203203+ loop d e first; close_src src
204204+205205+let trip_unix inf usize nln ie oe fdi fdo =
206206+ let malformed d v e =
207207+ log_malformed inf d v; ignore (Uutf.encode e (`Uchar Uutf.u_rep))
208208+ in
209209+ let rec loop fdi fdo ds es d e = function
210210+ | `Uchar _ as v ->
211211+ encode_unix fdo es e v; loop fdi fdo ds es d e (Uutf.decode d)
212212+ | `End -> encode_unix fdo es e `End
213213+ | `Malformed _ as v -> malformed d v e; loop fdi fdo ds es d e (Uutf.decode d)
214214+ | `Await ->
215215+ let rc = unix_read fdi ds 0 (Bytes.length ds) in
216216+ Uutf.Manual.src d ds 0 rc; loop fdi fdo ds es d e (Uutf.decode d)
217217+ in
218218+ let d, ds = Uutf.decoder ?nln ?encoding:ie `Manual, Bytes.create usize in
219219+ let e, es, first = match oe with
220220+ | Some enc -> Uutf.encoder enc `Manual, Bytes.create usize, (Uutf.decode d)
221221+ | None ->
222222+ let rec decode_past_await d = match Uutf.decode d with
223223+ | `Await ->
224224+ let rc = unix_read fdi ds 0 (Bytes.length ds) in
225225+ Uutf.Manual.src d ds 0 rc; decode_past_await d
226226+ | v -> v
227227+ in
228228+ let v = decode_past_await d in (* get encoding. *)
229229+ let enc = match Uutf.decoder_encoding d with
230230+ | #Uutf.encoding as enc -> enc | `ISO_8859_1 | `US_ASCII -> `UTF_8
231231+ in
232232+ Uutf.encoder enc `Manual, Bytes.create usize, v
233233+ in
234234+ Uutf.Manual.dst e es 0 (Bytes.length es);
235235+ if (Uutf.encoder_encoding e = `UTF_16 || Uutf.decoder_removed_bom d)
236236+ then encode_unix fdo es e (`Uchar Uutf.u_bom);
237237+ loop fdi fdo ds es d e first; close_src_unix fdi
238238+239239+let trip inf sin sout use_unix usize ie oe nln =
240240+ let src = src_for inf sin use_unix in
241241+ let dst = dst_for sout in
242242+ if sin || sout || not use_unix then trip_ inf nln ie oe src dst else
243243+ trip_unix inf usize nln ie oe (src_for_unix inf) Unix.stdout;
244244+ match dst with `Channel _ -> ()
245245+ | `Buffer b -> string_to_channel use_unix stdout (Buffer.contents b)
246246+247247+(* Cmd *)
248248+249249+let do_cmd cmd inf sin sout use_unix usize ie oe nln rseed rcount =
250250+ match cmd with
251251+ | `Ascii -> dump inf sin use_unix usize ie nln
252252+ | `Guess -> guess inf
253253+ | `Decode -> decode inf sin use_unix usize ie nln
254254+ | `Encode -> r_encode sout use_unix usize rseed rcount oe
255255+ | `Trip -> trip inf sin sout use_unix usize ie oe nln
256256+257257+(* Cmdline interface *)
258258+259259+open Cmdliner
260260+261261+let enc_enum =
262262+ [ "UTF-8", `UTF_8; "UTF-16", `UTF_16; "UTF-16LE", `UTF_16LE;
263263+ "UTF-16BE", `UTF_16BE; ]
264264+265265+let decode_enc_enum =
266266+ ("ASCII", `US_ASCII) :: ("latin1", `ISO_8859_1) :: enc_enum
267267+268268+let ienc =
269269+ let doc = str "Decoded (input) encoding, must %s. If unspecified the
270270+ encoding is guessed."
271271+ (Arg.doc_alts_enum decode_enc_enum)
272272+ in
273273+ Arg.(value & opt (some (enum decode_enc_enum)) None &
274274+ info ["d"; "input-encoding"] ~doc)
275275+276276+let oenc =
277277+ let doc = str "Encoded (output) encoding, must %s. If unspecified the output
278278+ encoding is the same as the input encoding except for ASCII
279279+ and latin1 where UTF-8 is output." (Arg.doc_alts_enum enc_enum)
280280+ in
281281+ Arg.(value & opt (some (enum enc_enum)) None &
282282+ info ["e"; "output-encoding"] ~doc)
283283+284284+let nln =
285285+ let lf = Uchar.of_int 0x000A in
286286+ let nln_enum = ["ascii", `ASCII lf; "nlf", `NLF lf; "readline", `Readline lf]
287287+ in
288288+ let doc = str "New line normalization to U+000A, must %s. ascii
289289+ normalizes CR (U+000D) and CRLF (<U+000D, U+000A>). nlf
290290+ normalizes like ascii plus NEL (U+0085). readline
291291+ normalizes like nlf plus FF (U+000C), LS (U+2028) and
292292+ PS (U+2029)."
293293+ (Arg.doc_alts_enum nln_enum)
294294+ in
295295+ let vopt = Some (`Readline lf) in
296296+ Arg.(value & opt ~vopt (some (enum nln_enum)) None & info ["nln"] ~doc)
297297+298298+let sin =
299299+ let doc = "Input everything in a string and decode the string." in
300300+ Arg.(value & flag & info [ "input-string" ] ~doc)
301301+302302+let sout =
303303+ let doc = "Encode everything in a string and output the string." in
304304+ Arg.(value & flag & info [ "output-string" ] ~doc)
305305+306306+let use_unix =
307307+ let doc = "Use Unix IO." in
308308+ Arg.(value & flag & info [ "use-unix" ] ~doc)
309309+310310+let usize =
311311+ let doc = "Unix IO buffer sizes in bytes." in
312312+ Arg.(value & opt int unix_buffer_size & info ["unix-size"] ~doc)
313313+314314+let nat =
315315+ let parse s =
316316+ try
317317+ let v = int_of_string s in
318318+ if v > 0 then Ok v else failwith (str "%s must be > 0" s)
319319+ with Failure e -> Error e
320320+ in
321321+ Arg.conv' ~docv:"NAT" (parse, Format.pp_print_int)
322322+323323+let rseed =
324324+ let doc = "Random seed." in
325325+ Arg.(value & opt (some nat) None & info ["rseed"] ~doc)
326326+327327+let rcount =
328328+ let doc = "Number of random characters to generate." in
329329+ Arg.(value & opt nat 1_000_000 & info ["rcount"] ~doc)
330330+331331+let file =
332332+ let doc = "The input file. Reads from stdin if unspecified." in
333333+ Arg.(value & pos 0 string "-" & info [] ~doc ~docv:"FILE")
334334+335335+let cmd =
336336+ let doc = "Output the input text as Unicode scalar values or malformed
337337+ sequences, one per line, in the US-ASCII charset with their
338338+ position (see POSITION INFORMATION for more details)."
339339+ in
340340+ let ascii = `Ascii, Arg.info ["a"; "ascii"] ~doc in
341341+ let doc = "Only guess an UTF encoding. The result of a guess can only be
342342+ UTF-8 or UTF-16{LE,BE}."
343343+ in
344344+ let guess = `Guess, Arg.info ["g"; "guess"] ~doc in
345345+ let doc = "Decode only, no encoding." in
346346+ let dec = `Decode, Arg.info ["decode"] ~doc in
347347+ let doc = "Encode only (random), no decoding. See option $(b,--rcount)." in
348348+ let enc = `Encode, Arg.info ["encode"] ~doc in
349349+ Arg.(value & vflag `Trip [ascii; guess; dec; enc])
350350+351351+let cmd =
352352+ let doc = "Recode UTF-{8,16,16LE,16BE} and latin1 from stdin to stdout." in
353353+ let man = [
354354+ `S "DESCRIPTION";
355355+ `P "$(tname) inputs Unicode text from stdin and rewrites it
356356+ to stdout in various ways. If no input encoding is specified,
357357+ it is guessed. If no output encoding is specified, the input
358358+ encoding is used.";
359359+ `P "Invalid byte sequences in the input are reported on stderr and
360360+ replaced by the Unicode replacement character (U+FFFD) in the output.";
361361+ `S "POSITION INFORMATION";
362362+ `P "The format for position information is:";
363363+ `P "filename:line.col:(count,byte)";
364364+ `I ("line", "one-based line number that increments with each newline.
365365+ A newline is always determined as being anything that would be
366366+ normalized by the option `$(b,--nln)=readline`.");
367367+ `I ("col", "zero-based column number that increment with each new
368368+ decoded character and zeroes after a newline
369369+ is decoded. Note that the column number may not correspond to
370370+ user-perceived columns, as any Unicode scalar value, including
371371+ combining characters are deemed to have a width of 1.");
372372+ `I ("count", "the one-based Unicode scalar value count.");
373373+ `I ("byte", "the zero-based end byte offset of the scalar value
374374+ in the input stream in hexadecimal.");
375375+ `S "EXIT STATUS";
376376+ `P "$(tname) exits with one of the following values:";
377377+ `I ("0", "no error occured");
378378+ `I ("1", "a command line parsing error occured");
379379+ `I ("2", "the input text was malformed");
380380+ `S "BUGS";
381381+ `P "This program is distributed with the Uutf OCaml library.
382382+ See http://erratique.ch/software/uutf for contact
383383+ information."; ]
384384+ in
385385+ Cmd.v (Cmd.info "utftrip" ~version:"%%VERSION%%" ~doc ~man)
386386+ Term.(const do_cmd $ cmd $ file $ sin $ sout $ use_unix $ usize $
387387+ ienc $ oenc $ nln $ rseed $ rcount)
388388+389389+let () = match Cmd.eval_value cmd with
390390+| Error _ -> exit 1
391391+| _ -> if !input_malformed then exit 2 else exit 0
+38
vendor/opam/uutf/uutf.opam
···11+opam-version: "2.0"
22+name: "uutf"
33+synopsis: "Non-blocking streaming Unicode codec for OCaml"
44+description: """\
55+**Warning.** You are encouraged not to use this library.
66+77+- As of OCaml 4.14, both UTF encoding and decoding are available
88+ in the standard library, see the `String` and `Buffer` modules.
99+- If you are looking for a stream abstraction compatible with
1010+ effect based concurrency look into [`bytesrw`] package."""
1111+maintainer: "Daniel Bünzli <daniel.buenzl i@erratique.ch>"
1212+authors: "The uutf programmers"
1313+license: "ISC"
1414+tags: ["unicode" "text" "utf-8" "utf-16" "codec" "org:erratique"]
1515+homepage: "https://erratique.ch/software/uutf"
1616+doc: "https://erratique.ch/software/uutf/doc/"
1717+bug-reports: "https://github.com/dbuenzli/uutf/issues"
1818+depends: [
1919+ "ocaml" {>= "4.08.0"}
2020+ "ocamlfind" {build}
2121+ "ocamlbuild" {build}
2222+ "topkg" {build & >= "1.1.0"}
2323+]
2424+depopts: ["cmdliner"]
2525+conflicts: [
2626+ "cmdliner" {< "1.3.0"}
2727+]
2828+build: [
2929+ "ocaml"
3030+ "pkg/pkg.ml"
3131+ "build"
3232+ "--dev-pkg"
3333+ "%{dev}%"
3434+ "--with-cmdliner"
3535+ "%{cmdliner:installed}%"
3636+]
3737+dev-repo: "git+https://erratique.ch/repos/uutf.git"
3838+x-maintenance-intent: ["(latest)"]