···11-ISC License
22-33-Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>
44-55-Permission to use, copy, modify, and distribute this software for any
66-purpose with or without fee is hereby granted, provided that the above
77-copyright notice and this permission notice appear in all copies.
88-99-THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
1010-WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
1111-MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
1212-ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
1313-WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
1414-ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
1515-OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
-112
ocaml-punycode/README.md
···11-# puny - RFC 3492 Punycode and IDNA for OCaml
22-33-High-quality implementation of RFC 3492 (Punycode) with IDNA (Internationalized Domain Names in Applications) support for OCaml. Enables encoding and decoding of internationalized domain names with proper Unicode normalization.
44-55-## Key Features
66-77-- **RFC 3492 Punycode**: Complete implementation of the Bootstring algorithm for encoding Unicode in ASCII-compatible form
88-- **IDNA Support**: ToASCII and ToUnicode operations per RFC 5891 (IDNA 2008) for internationalized domain names
99-- **Unicode Normalization**: Automatic NFC normalization using `uunf` for proper IDNA compliance
1010-- **Mixed-Case Annotation**: Optional case preservation through Punycode encoding round-trips
1111-- **Domain Integration**: Native support for the `domain-name` library
1212-- **Comprehensive Error Handling**: Detailed position tracking and RFC-compliant error reporting
1313-1414-## Usage
1515-1616-### Basic Punycode Encoding/Decoding
1717-1818-```ocaml
1919-(* Encode a UTF-8 string to Punycode *)
2020-let encoded = Punycode.encode_utf8 "münchen"
2121-(* = Ok "mnchen-3ya" *)
2222-2323-(* Decode Punycode back to UTF-8 *)
2424-let decoded = Punycode.decode_utf8 "mnchen-3ya"
2525-(* = Ok "münchen" *)
2626-```
2727-2828-### Domain Label Operations
2929-3030-```ocaml
3131-(* Encode a domain label with ACE prefix *)
3232-let label = Punycode.encode_label "münchen"
3333-(* = Ok "xn--mnchen-3ya" *)
3434-3535-(* Decode an ACE-prefixed label *)
3636-let original = Punycode.decode_label "xn--mnchen-3ya"
3737-(* = Ok "münchen" *)
3838-```
3939-4040-### IDNA Domain Name Conversion
4141-4242-```ocaml
4343-(* Convert internationalized domain to ASCII for DNS lookup *)
4444-let ascii_domain = Punycode_idna.to_ascii "münchen.example.com"
4545-(* = Ok "xn--mnchen-3ya.example.com" *)
4646-4747-(* Convert ASCII domain back to Unicode for display *)
4848-let unicode_domain = Punycode_idna.to_unicode "xn--mnchen-3ya.example.com"
4949-(* = Ok "münchen.example.com" *)
5050-```
5151-5252-### Working with Unicode Code Points
5353-5454-```ocaml
5555-(* Encode an array of Unicode code points *)
5656-let codepoints = [| Uchar.of_int 0x4ED6; Uchar.of_int 0x4EEC |]
5757-let encoded = Punycode.encode codepoints
5858-(* Result is Punycode string *)
5959-6060-(* Decode to code points *)
6161-let decoded = Punycode.decode "ihqwcrb4cv8a8dqg056pqjye"
6262-(* Result is Uchar.t array *)
6363-```
6464-6565-### Integration with domain-name Library
6666-6767-```ocaml
6868-(* Convert a Domain_name.t to ASCII *)
6969-let domain = Domain_name.of_string_exn "münchen.example.com" in
7070-let ascii = Punycode_idna.domain_to_ascii domain
7171-(* = Ok (Domain_name for "xn--mnchen-3ya.example.com") *)
7272-7373-(* Convert back to Unicode *)
7474-let unicode = Punycode_idna.domain_to_unicode ascii
7575-(* = Ok (original domain) *)
7676-```
7777-7878-## Installation
7979-8080-```
8181-opam install puny
8282-```
8383-8484-## Documentation
8585-8686-API documentation is available at https://tangled.org/@anil.recoil.org/ocaml-punycode or via:
8787-8888-```
8989-opam install puny
9090-odig doc puny
9191-```
9292-9393-## Limitations
9494-9595-The following IDNA 2008 features are not yet implemented:
9696-9797-- **Bidi rules** (RFC 5893): Bidirectional text validation for right-to-left scripts
9898-- **Contextual joiners** (RFC 5892 Appendix A.1): Zero-width joiner/non-joiner validation
9999-100100-These checks are disabled by default in the API. Most common use cases (European languages, CJK) work correctly without them.
101101-102102-## References
103103-104104-- [RFC 3492](https://datatracker.ietf.org/doc/html/rfc3492) - Punycode: A Bootstring encoding of Unicode for IDNA
105105-- [RFC 5891](https://datatracker.ietf.org/doc/html/rfc5891) - Internationalized Domain Names in Applications (IDNA): Protocol
106106-- [RFC 5892](https://datatracker.ietf.org/doc/html/rfc5892) - Unicode Code Points and IDNA
107107-- [RFC 5893](https://datatracker.ietf.org/doc/html/rfc5893) - Right-to-Left Scripts for IDNA
108108-- [RFC 1035](https://datatracker.ietf.org/doc/html/rfc1035) - Domain Names Implementation and Specification
109109-110110-## License
111111-112112-ISC
···11-(*---------------------------------------------------------------------------
22- Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
33- SPDX-License-Identifier: ISC
44- ---------------------------------------------------------------------------*)
55-66-(* RFC 3492 Punycode Implementation *)
77-88-(* {1 Bootstring Parameters for Punycode (RFC 3492 Section 5)} *)
99-1010-let base = 36
1111-let tmin = 1
1212-let tmax = 26
1313-let skew = 38
1414-let damp = 700
1515-let initial_bias = 72
1616-let initial_n = 0x80 (* 128 *)
1717-let delimiter = '-'
1818-let ace_prefix = "xn--"
1919-let max_label_length = 63
2020-2121-(* {1 Position Tracking} *)
2222-2323-type position = { byte_offset : int; char_index : int }
2424-2525-let position_byte_offset pos = pos.byte_offset
2626-let position_char_index pos = pos.char_index
2727-2828-let pp_position fmt pos =
2929- Format.fprintf fmt "byte %d, char %d" pos.byte_offset pos.char_index
3030-3131-(* {1 Error Types} *)
3232-3333-type error =
3434- | Overflow of position
3535- | Invalid_character of position * Uchar.t
3636- | Invalid_digit of position * char
3737- | Unexpected_end of position
3838- | Invalid_utf8 of position
3939- | Label_too_long of int
4040- | Empty_label
4141-4242-let pp_error fmt = function
4343- | Overflow pos ->
4444- Format.fprintf fmt "arithmetic overflow at %a" pp_position pos
4545- | Invalid_character (pos, u) ->
4646- Format.fprintf fmt "invalid character U+%04X at %a" (Uchar.to_int u)
4747- pp_position pos
4848- | Invalid_digit (pos, c) ->
4949- Format.fprintf fmt "invalid Punycode digit '%c' (0x%02X) at %a" c
5050- (Char.code c) pp_position pos
5151- | Unexpected_end pos ->
5252- Format.fprintf fmt "unexpected end of input at %a" pp_position pos
5353- | Invalid_utf8 pos ->
5454- Format.fprintf fmt "invalid UTF-8 sequence at %a" pp_position pos
5555- | Label_too_long len ->
5656- Format.fprintf fmt "label too long: %d bytes (max %d)" len
5757- max_label_length
5858- | Empty_label -> Format.fprintf fmt "empty label"
5959-6060-let error_to_string err = Format.asprintf "%a" pp_error err
6161-6262-(* {1 Error Constructors} *)
6363-6464-let overflow pos = Error (Overflow pos)
6565-let invalid_character pos u = Error (Invalid_character (pos, u))
6666-let invalid_digit pos c = Error (Invalid_digit (pos, c))
6767-let unexpected_end pos = Error (Unexpected_end pos)
6868-let _invalid_utf8 pos = Error (Invalid_utf8 pos)
6969-let label_too_long len = Error (Label_too_long len)
7070-let empty_label = Error Empty_label
7171-7272-(* {1 Case Flags} *)
7373-7474-type case_flag = Uppercase | Lowercase
7575-7676-(* {1 Basic Predicates} *)
7777-7878-let is_basic u = Uchar.to_int u < 0x80
7979-let is_ascii_string s = String.for_all (fun c -> Char.code c < 0x80) s
8080-8181-let has_ace_prefix s =
8282- let len = String.length s in
8383- len >= 4
8484- && (s.[0] = 'x' || s.[0] = 'X')
8585- && (s.[1] = 'n' || s.[1] = 'N')
8686- && s.[2] = '-'
8787- && s.[3] = '-'
8888-8989-(* {1 Digit Encoding/Decoding (RFC 3492 Section 5)}
9090-9191- Digit values:
9292- - 0-25: a-z (or A-Z)
9393- - 26-35: 0-9
9494-*)
9595-9696-let encode_digit d case_flag =
9797- if d < 26 then Char.chr (d + if case_flag = Uppercase then 0x41 else 0x61)
9898- else Char.chr (d - 26 + 0x30)
9999-100100-let decode_digit c =
101101- let code = Char.code c in
102102- if code >= 0x30 && code <= 0x39 then Some (code - 0x30 + 26)
103103- (* '0'-'9' -> 26-35 *)
104104- else if code >= 0x41 && code <= 0x5A then Some (code - 0x41)
105105- (* 'A'-'Z' -> 0-25 *)
106106- else if code >= 0x61 && code <= 0x7A then Some (code - 0x61)
107107- (* 'a'-'z' -> 0-25 *)
108108- else None
109109-110110-(* Check if a character is "flagged" (uppercase) for case annotation *)
111111-let is_flagged c =
112112- let code = Char.code c in
113113- code >= 0x41 && code <= 0x5A (* 'A'-'Z' *)
114114-115115-(* {1 Bias Adaptation (RFC 3492 Section 6.1)} *)
116116-117117-let adapt ~delta ~numpoints ~firsttime =
118118- let delta = if firsttime then delta / damp else delta / 2 in
119119- let delta = delta + (delta / numpoints) in
120120- let threshold = (base - tmin) * tmax / 2 in
121121- let rec loop delta k =
122122- if delta > threshold then loop (delta / (base - tmin)) (k + base)
123123- else k + ((base - tmin + 1) * delta / (delta + skew))
124124- in
125125- loop delta 0
126126-127127-(* {1 Overflow-Safe Arithmetic}
128128-129129- RFC 3492 Section 6.4: Use detection to avoid overflow.
130130- A + B overflows iff B > maxint - A
131131- A + B*C overflows iff B > (maxint - A) / C
132132-*)
133133-134134-let max_int_value = max_int
135135-136136-let safe_mul_add a b c pos =
137137- if c = 0 then Ok a
138138- else if b > (max_int_value - a) / c then overflow pos
139139- else Ok (a + (b * c))
140140-141141-(* {1 UTF-8 to Code Points Conversion} *)
142142-143143-let utf8_to_codepoints s =
144144- let len = String.length s in
145145- let acc = ref [] in
146146- let byte_offset = ref 0 in
147147- let char_index = ref 0 in
148148- let error = ref None in
149149- while !byte_offset < len && !error = None do
150150- let pos = { byte_offset = !byte_offset; char_index = !char_index } in
151151- let dec = String.get_utf_8_uchar s !byte_offset in
152152- if Uchar.utf_decode_is_valid dec then begin
153153- acc := Uchar.utf_decode_uchar dec :: !acc;
154154- byte_offset := !byte_offset + Uchar.utf_decode_length dec;
155155- incr char_index
156156- end
157157- else begin
158158- error := Some (Invalid_utf8 pos)
159159- end
160160- done;
161161- match !error with
162162- | Some e -> Error e
163163- | None -> Ok (Array.of_list (List.rev !acc))
164164-165165-(* {1 Code Points to UTF-8 Conversion} *)
166166-167167-let codepoints_to_utf8 codepoints =
168168- let buf = Buffer.create (Array.length codepoints * 2) in
169169- Array.iter (Buffer.add_utf_8_uchar buf) codepoints;
170170- Buffer.contents buf
171171-172172-(* {1 Punycode Encoding (RFC 3492 Section 6.3)} *)
173173-174174-let encode_impl codepoints case_flags =
175175- let input_length = Array.length codepoints in
176176- if input_length = 0 then Ok ""
177177- else begin
178178- let output = Buffer.create (input_length * 2) in
179179-180180- (* Copy basic code points to output *)
181181- let basic_count = ref 0 in
182182- for j = 0 to input_length - 1 do
183183- let cp = codepoints.(j) in
184184- if is_basic cp then begin
185185- let c = Uchar.to_int cp in
186186- let case =
187187- match case_flags with Some flags -> flags.(j) | None -> Lowercase
188188- in
189189- (* Preserve or apply case for ASCII letters *)
190190- let c' =
191191- if c >= 0x41 && c <= 0x5A then (* 'A'-'Z' *)
192192- if case = Lowercase then c + 0x20 else c
193193- else if c >= 0x61 && c <= 0x7A then (* 'a'-'z' *)
194194- if case = Uppercase then c - 0x20 else c
195195- else c
196196- in
197197- Buffer.add_char output (Char.chr c');
198198- incr basic_count
199199- end
200200- done;
201201-202202- let b = !basic_count in
203203- let h = ref b in
204204-205205- (* Add delimiter if there were basic code points *)
206206- if b > 0 then Buffer.add_char output delimiter;
207207-208208- (* Main encoding loop *)
209209- let n = ref initial_n in
210210- let delta = ref 0 in
211211- let bias = ref initial_bias in
212212-213213- let result = ref (Ok ()) in
214214-215215- while !h < input_length && !result = Ok () do
216216- (* Find minimum code point >= n *)
217217- let m =
218218- Array.fold_left
219219- (fun acc cp ->
220220- let cp_val = Uchar.to_int cp in
221221- if cp_val >= !n && cp_val < acc then cp_val else acc)
222222- max_int_value codepoints
223223- in
224224-225225- (* Increase delta to advance state to <m, 0> *)
226226- let pos = { byte_offset = 0; char_index = !h } in
227227- match safe_mul_add !delta (m - !n) (!h + 1) pos with
228228- | Error e -> result := Error e
229229- | Ok new_delta ->
230230- delta := new_delta;
231231- n := m;
232232-233233- (* Process each code point *)
234234- let j = ref 0 in
235235- while !j < input_length && !result = Ok () do
236236- let cp = Uchar.to_int codepoints.(!j) in
237237- let pos = { byte_offset = 0; char_index = !j } in
238238-239239- if cp < !n then begin
240240- incr delta;
241241- if !delta = 0 then (* Overflow *)
242242- result := overflow pos
243243- end
244244- else if cp = !n then begin
245245- (* Encode delta as variable-length integer *)
246246- let q = ref !delta in
247247- let k = ref base in
248248- let done_encoding = ref false in
249249-250250- while not !done_encoding do
251251- let t =
252252- if !k <= !bias then tmin
253253- else if !k >= !bias + tmax then tmax
254254- else !k - !bias
255255- in
256256- if !q < t then begin
257257- (* Output final digit *)
258258- let case =
259259- match case_flags with
260260- | Some flags -> flags.(!j)
261261- | None -> Lowercase
262262- in
263263- Buffer.add_char output (encode_digit !q case);
264264- done_encoding := true
265265- end
266266- else begin
267267- (* Output intermediate digit and continue *)
268268- let digit = t + ((!q - t) mod (base - t)) in
269269- Buffer.add_char output (encode_digit digit Lowercase);
270270- q := (!q - t) / (base - t);
271271- k := !k + base
272272- end
273273- done;
274274-275275- bias := adapt ~delta:!delta ~numpoints:(!h + 1) ~firsttime:(!h = b);
276276- delta := 0;
277277- incr h
278278- end;
279279- incr j
280280- done;
281281-282282- incr delta;
283283- incr n
284284- done;
285285-286286- match !result with
287287- | Error e -> Error e
288288- | Ok () -> Ok (Buffer.contents output)
289289- end
290290-291291-let encode codepoints = encode_impl codepoints None
292292-293293-let encode_with_case codepoints case_flags =
294294- if Array.length codepoints <> Array.length case_flags then
295295- invalid_arg "encode_with_case: array lengths must match";
296296- encode_impl codepoints (Some case_flags)
297297-298298-(* {1 Punycode Decoding (RFC 3492 Section 6.2)} *)
299299-300300-let decode_impl input =
301301- let input_length = String.length input in
302302- if input_length = 0 then Ok ([||], [||])
303303- else begin
304304- (* Find last delimiter *)
305305- let b = Option.value ~default:0 (String.rindex_opt input delimiter) in
306306-307307- (* Copy basic code points and extract case flags *)
308308- let output = ref [] in
309309- let case_output = ref [] in
310310- let error = ref None in
311311-312312- for j = 0 to b - 1 do
313313- if !error = None then begin
314314- let c = input.[j] in
315315- let pos = { byte_offset = j; char_index = j } in
316316- let code = Char.code c in
317317- if code >= 0x80 then
318318- error := Some (Invalid_character (pos, Uchar.of_int code))
319319- else begin
320320- output := Uchar.of_int code :: !output;
321321- case_output :=
322322- (if is_flagged c then Uppercase else Lowercase) :: !case_output
323323- end
324324- end
325325- done;
326326-327327- match !error with
328328- | Some e -> Error e
329329- | None -> (
330330- let output = ref (Array.of_list (List.rev !output)) in
331331- let case_output = ref (Array.of_list (List.rev !case_output)) in
332332-333333- (* Main decoding loop *)
334334- let n = ref initial_n in
335335- let i = ref 0 in
336336- let bias = ref initial_bias in
337337- let in_pos = ref (if b > 0 then b + 1 else 0) in
338338- let result = ref (Ok ()) in
339339-340340- while !in_pos < input_length && !result = Ok () do
341341- let oldi = !i in
342342- let w = ref 1 in
343343- let k = ref base in
344344- let done_decoding = ref false in
345345-346346- while (not !done_decoding) && !result = Ok () do
347347- let pos =
348348- { byte_offset = !in_pos; char_index = Array.length !output }
349349- in
350350-351351- if !in_pos >= input_length then begin
352352- result := unexpected_end pos;
353353- done_decoding := true
354354- end
355355- else begin
356356- let c = input.[!in_pos] in
357357- incr in_pos;
358358-359359- match decode_digit c with
360360- | None ->
361361- result := invalid_digit pos c;
362362- done_decoding := true
363363- | Some digit -> (
364364- (* i = i + digit * w, with overflow check *)
365365- match safe_mul_add !i digit !w pos with
366366- | Error e ->
367367- result := Error e;
368368- done_decoding := true
369369- | Ok new_i ->
370370- i := new_i;
371371-372372- let t =
373373- if !k <= !bias then tmin
374374- else if !k >= !bias + tmax then tmax
375375- else !k - !bias
376376- in
377377-378378- if digit < t then begin
379379- (* Record case flag from this final digit *)
380380- done_decoding := true
381381- end
382382- else begin
383383- (* w = w * (base - t), with overflow check *)
384384- let base_minus_t = base - t in
385385- if !w > max_int_value / base_minus_t then begin
386386- result := overflow pos;
387387- done_decoding := true
388388- end
389389- else begin
390390- w := !w * base_minus_t;
391391- k := !k + base
392392- end
393393- end)
394394- end
395395- done;
396396-397397- if !result = Ok () then begin
398398- let out_len = Array.length !output in
399399- bias :=
400400- adapt ~delta:(!i - oldi) ~numpoints:(out_len + 1)
401401- ~firsttime:(oldi = 0);
402402-403403- let pos = { byte_offset = !in_pos - 1; char_index = out_len } in
404404-405405- (* n = n + i / (out_len + 1), with overflow check *)
406406- let increment = !i / (out_len + 1) in
407407- if increment > max_int_value - !n then result := overflow pos
408408- else begin
409409- n := !n + increment;
410410- i := !i mod (out_len + 1);
411411-412412- (* Validate that n is a valid Unicode scalar value *)
413413- if not (Uchar.is_valid !n) then
414414- result := invalid_character pos Uchar.rep
415415- else begin
416416- (* Insert n at position i *)
417417- let new_output = Array.make (out_len + 1) (Uchar.of_int 0) in
418418- let new_case = Array.make (out_len + 1) Lowercase in
419419-420420- for j = 0 to !i - 1 do
421421- new_output.(j) <- !output.(j);
422422- new_case.(j) <- !case_output.(j)
423423- done;
424424- new_output.(!i) <- Uchar.of_int !n;
425425- (* Case flag from final digit of this delta *)
426426- new_case.(!i) <-
427427- (if !in_pos > 0 && is_flagged input.[!in_pos - 1] then
428428- Uppercase
429429- else Lowercase);
430430- for j = !i to out_len - 1 do
431431- new_output.(j + 1) <- !output.(j);
432432- new_case.(j + 1) <- !case_output.(j)
433433- done;
434434-435435- output := new_output;
436436- case_output := new_case;
437437- incr i
438438- end
439439- end
440440- end
441441- done;
442442-443443- match !result with
444444- | Error e -> Error e
445445- | Ok () -> Ok (!output, !case_output))
446446- end
447447-448448-let decode input = Result.map fst (decode_impl input)
449449-let decode_with_case input = decode_impl input
450450-451451-(* {1 UTF-8 String Operations} *)
452452-453453-let encode_utf8 s =
454454- let open Result.Syntax in
455455- let* codepoints = utf8_to_codepoints s in
456456- encode codepoints
457457-458458-let decode_utf8 punycode =
459459- let open Result.Syntax in
460460- let+ codepoints = decode punycode in
461461- codepoints_to_utf8 codepoints
462462-463463-(* {1 Domain Label Operations} *)
464464-465465-let encode_label label =
466466- if String.length label = 0 then empty_label
467467- else if is_ascii_string label then begin
468468- (* All ASCII - return as-is, but check length *)
469469- let len = String.length label in
470470- if len > max_label_length then label_too_long len else Ok label
471471- end
472472- else
473473- (* Has non-ASCII - encode with Punycode *)
474474- let open Result.Syntax in
475475- let* encoded = encode_utf8 label in
476476- let result = ace_prefix ^ encoded in
477477- let len = String.length result in
478478- if len > max_label_length then label_too_long len else Ok result
479479-480480-let decode_label label =
481481- if String.length label = 0 then empty_label
482482- else if has_ace_prefix label then begin
483483- (* Remove ACE prefix and decode *)
484484- let punycode = String.sub label 4 (String.length label - 4) in
485485- decode_utf8 punycode
486486- end
487487- else begin
488488- (* No ACE prefix - validate and return *)
489489- if is_ascii_string label then Ok label
490490- else
491491- (* Has non-ASCII but no ACE prefix - return as-is *)
492492- Ok label
493493- end
-267
ocaml-punycode/lib/punycode.mli
···11-(*---------------------------------------------------------------------------
22- Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
33- SPDX-License-Identifier: ISC
44- ---------------------------------------------------------------------------*)
55-66-(** RFC 3492 Punycode: A Bootstring encoding of Unicode for IDNA.
77-88- This module implements the Punycode algorithm as specified in
99- {{:https://datatracker.ietf.org/doc/html/rfc3492}RFC 3492}, providing
1010- encoding and decoding of Unicode strings to/from ASCII-compatible encoding
1111- suitable for use in internationalized domain names.
1212-1313- Punycode is an instance of Bootstring that uses particular parameter values
1414- appropriate for IDNA. See
1515- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-5}RFC 3492 Section
1616- 5} for the specific parameter values.
1717-1818- {2 References}
1919- - {{:https://datatracker.ietf.org/doc/html/rfc3492}RFC 3492} - Punycode: A
2020- Bootstring encoding of Unicode for IDNA
2121- - {{:https://datatracker.ietf.org/doc/html/rfc5891}RFC 5891} - IDNA Protocol
2222-*)
2323-2424-(** {1 Position Tracking} *)
2525-2626-type position
2727-(** Abstract type representing a position in input for error reporting.
2828- Positions track both byte offset and Unicode character index. *)
2929-3030-val position_byte_offset : position -> int
3131-(** [position_byte_offset pos] returns the byte offset in the input. *)
3232-3333-val position_char_index : position -> int
3434-(** [position_char_index pos] returns the Unicode character index (0-based). *)
3535-3636-val pp_position : Format.formatter -> position -> unit
3737-(** [pp_position fmt pos] pretty-prints a position as "byte N, char M". *)
3838-3939-(** {1 Error Types} *)
4040-4141-type error =
4242- | Overflow of position
4343- (** Arithmetic overflow during encode/decode. This can occur with very
4444- long strings or extreme Unicode code point values. See
4545- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-6.4} RFC 3492
4646- Section 6.4} for overflow handling requirements. *)
4747- | Invalid_character of position * Uchar.t
4848- (** A non-basic code point appeared where only basic code points (ASCII <
4949- 128) are allowed. Per
5050- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-3.1} RFC 3492
5151- Section 3.1}, basic code points must be segregated at the beginning
5252- of the encoded string. *)
5353- | Invalid_digit of position * char
5454- (** An invalid Punycode digit was encountered during decoding. Valid
5555- digits are a-z, A-Z (values 0-25) and 0-9 (values 26-35). See
5656- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-5} RFC 3492
5757- Section 5} for digit-value mappings. *)
5858- | Unexpected_end of position
5959- (** The input ended prematurely during decoding of a delta value. See
6060- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-6.2} RFC 3492
6161- Section 6.2} decoding procedure. *)
6262- | Invalid_utf8 of position (** Malformed UTF-8 sequence in input string. *)
6363- | Label_too_long of int
6464- (** Encoded label exceeds 63 bytes (DNS limit per
6565- {{:https://datatracker.ietf.org/doc/html/rfc1035}RFC 1035}). The int
6666- is the actual length. *)
6767- | Empty_label (** Empty label is not valid for encoding. *)
6868-6969-val pp_error : Format.formatter -> error -> unit
7070-(** [pp_error fmt e] pretty-prints an error with position information. *)
7171-7272-val error_to_string : error -> string
7373-(** [error_to_string e] converts an error to a human-readable string. *)
7474-7575-(** {1 Constants}
7676-7777- Punycode parameters as specified in
7878- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-5}RFC 3492 Section
7979- 5}. *)
8080-8181-val ace_prefix : string
8282-(** The ACE prefix ["xn--"] used for Punycode-encoded domain labels. See
8383- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-5} RFC 3492 Section
8484- 5} which notes that IDNA prepends this prefix. *)
8585-8686-val max_label_length : int
8787-(** Maximum length of a domain label in bytes (63), per
8888- {{:https://datatracker.ietf.org/doc/html/rfc1035}RFC 1035}. *)
8989-9090-(** {1 Case Flags for Mixed-Case Annotation}
9191-9292- {{:https://datatracker.ietf.org/doc/html/rfc3492#appendix-A}RFC 3492
9393- Appendix A} describes an optional mechanism for preserving case information
9494- through the encoding/decoding round-trip. This is useful when the original
9595- string's case should be recoverable.
9696-9797- Note: Mixed-case annotation is not used by the ToASCII and ToUnicode
9898- operations of IDNA. *)
9999-100100-type case_flag =
101101- | Uppercase
102102- | Lowercase (** Case annotation for a character. *)
103103-104104-(** {1 Core Punycode Operations}
105105-106106- These functions implement the Bootstring algorithms from
107107- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-6}RFC 3492 Section
108108- 6}. They operate on arrays of Unicode code points ([Uchar.t array]). The
109109- encoded output is a plain ASCII string without the ACE prefix. *)
110110-111111-val encode : Uchar.t array -> (string, error) result
112112-(** [encode codepoints] encodes an array of Unicode code points to a Punycode
113113- ASCII string.
114114-115115- Implements the encoding procedure from
116116- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-6.3}RFC 3492
117117- Section 6.3}:
118118-119119- 1. Basic code points (ASCII < 128) are copied literally to the beginning of
120120- the output per
121121- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-3.1} Section 3.1
122122- (Basic code point segregation)} 2. A delimiter ('-') is appended if there
123123- are any basic code points 3. Non-basic code points are encoded as deltas
124124- using the generalized variable-length integer representation from
125125- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-3.3}Section 3.3}
126126-127127- Example:
128128- {[
129129- encode [| Uchar.of_int 0x4ED6; Uchar.of_int 0x4EEC; ... |]
130130- (* = Ok "ihqwcrb4cv8a8dqg056pqjye" *)
131131- ]} *)
132132-133133-val decode : string -> (Uchar.t array, error) result
134134-(** [decode punycode] decodes a Punycode ASCII string to an array of Unicode
135135- code points.
136136-137137- Implements the decoding procedure from
138138- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-6.2}RFC 3492
139139- Section 6.2}.
140140-141141- The input should be the Punycode portion only, without the ACE prefix. The
142142- decoder is case-insensitive for the encoded portion, as required by
143143- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-5}RFC 3492 Section
144144- 5}: "A decoder MUST recognize the letters in both uppercase and lowercase
145145- forms".
146146-147147- Example:
148148- {[
149149- decode "ihqwcrb4cv8a8dqg056pqjye"
150150- (* = Ok [| U+4ED6; U+4EEC; U+4E3A; ... |] (Chinese simplified) *)
151151- ]} *)
152152-153153-(** {1 Mixed-Case Annotation}
154154-155155- These functions support round-trip case preservation as described in
156156- {{:https://datatracker.ietf.org/doc/html/rfc3492#appendix-A}RFC 3492
157157- Appendix A}. *)
158158-159159-val encode_with_case :
160160- Uchar.t array -> case_flag array -> (string, error) result
161161-(** [encode_with_case codepoints case_flags] encodes with case annotation.
162162-163163- Per
164164- {{:https://datatracker.ietf.org/doc/html/rfc3492#appendix-A}RFC 3492
165165- Appendix A}:
166166- - For basic (ASCII) letters, the output preserves the case flag directly
167167- - For non-ASCII characters, the case of the final digit in each delta
168168- encoding indicates the flag (uppercase = suggested uppercase)
169169-170170- The [case_flags] array must have the same length as [codepoints].
171171-172172- @raise Invalid_argument if array lengths don't match. *)
173173-174174-val decode_with_case : string -> (Uchar.t array * case_flag array, error) result
175175-(** [decode_with_case punycode] decodes and extracts case annotations.
176176-177177- Per
178178- {{:https://datatracker.ietf.org/doc/html/rfc3492#appendix-A}RFC 3492
179179- Appendix A}, returns both the decoded code points and an array of case
180180- flags indicating the suggested case for each character based on the
181181- uppercase/lowercase form of the encoding digits. *)
182182-183183-(** {1 UTF-8 String Operations}
184184-185185- Convenience functions that work directly with UTF-8 encoded OCaml strings.
186186- These combine UTF-8 decoding/encoding with the core Punycode operations. *)
187187-188188-val encode_utf8 : string -> (string, error) result
189189-(** [encode_utf8 s] encodes a UTF-8 string to Punycode (no ACE prefix).
190190-191191- This is equivalent to decoding [s] from UTF-8 to code points, then calling
192192- {!encode}.
193193-194194- Example:
195195- {[
196196- encode_utf8 "münchen"
197197- (* = Ok "mnchen-3ya" *)
198198- ]} *)
199199-200200-val decode_utf8 : string -> (string, error) result
201201-(** [decode_utf8 punycode] decodes Punycode to a UTF-8 string (no ACE prefix).
202202-203203- This is equivalent to calling {!decode} then encoding the result as UTF-8.
204204-205205- Example:
206206- {[
207207- decode_utf8 "mnchen-3ya"
208208- (* = Ok "münchen" *)
209209- ]} *)
210210-211211-(** {1 Domain Label Operations}
212212-213213- These functions handle the ACE prefix automatically and enforce DNS label
214214- length limits per
215215- {{:https://datatracker.ietf.org/doc/html/rfc1035}RFC 1035}. *)
216216-217217-val encode_label : string -> (string, error) result
218218-(** [encode_label label] encodes a domain label for use in DNS.
219219-220220- If the label contains only ASCII characters, it is returned unchanged.
221221- Otherwise, it is Punycode-encoded with the ACE prefix ("xn--") prepended, as
222222- specified in
223223- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-5} RFC 3492 Section
224224- 5}.
225225-226226- Returns {!Error} {!Label_too_long} if the result exceeds 63 bytes.
227227-228228- Example:
229229- {[
230230- encode_label "münchen"
231231- (* = Ok "xn--mnchen-3ya" *)
232232- encode_label "example"
233233- (* = Ok "example" *)
234234- ]} *)
235235-236236-val decode_label : string -> (string, error) result
237237-(** [decode_label label] decodes a domain label.
238238-239239- If the label starts with the ACE prefix ("xn--", case-insensitive), it is
240240- Punycode-decoded. Otherwise, it is returned unchanged.
241241-242242- Example:
243243- {[
244244- decode_label "xn--mnchen-3ya"
245245- (* = Ok "münchen" *)
246246- decode_label "example"
247247- (* = Ok "example" *)
248248- ]} *)
249249-250250-(** {1 Validation}
251251-252252- Predicate functions for checking code point and string properties. *)
253253-254254-val is_basic : Uchar.t -> bool
255255-(** [is_basic u] is [true] if [u] is a basic code point (ASCII, < 128).
256256-257257- Per
258258- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-5}RFC 3492 Section
259259- 5}, basic code points for Punycode are the ASCII code points (0..7F). *)
260260-261261-val is_ascii_string : string -> bool
262262-(** [is_ascii_string s] is [true] if [s] contains only ASCII characters (all
263263- bytes < 128). *)
264264-265265-val has_ace_prefix : string -> bool
266266-(** [has_ace_prefix s] is [true] if [s] starts with the ACE prefix "xn--"
267267- (case-insensitive comparison). *)
-183
ocaml-punycode/lib/punycode_idna.ml
···11-(*---------------------------------------------------------------------------
22- Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
33- SPDX-License-Identifier: ISC
44- ---------------------------------------------------------------------------*)
55-66-(* IDNA (Internationalized Domain Names in Applications) Implementation *)
77-88-let max_domain_length = 253
99-1010-(* {1 Error Types} *)
1111-1212-type error =
1313- | Punycode_error of Punycode.error
1414- | Invalid_label of string
1515- | Domain_too_long of int
1616- | Normalization_failed
1717- | Verification_failed
1818-1919-let pp_error fmt = function
2020- | Punycode_error e ->
2121- Format.fprintf fmt "Punycode error: %a" Punycode.pp_error e
2222- | Invalid_label msg -> Format.fprintf fmt "invalid label: %s" msg
2323- | Domain_too_long len ->
2424- Format.fprintf fmt "domain too long: %d bytes (max %d)" len
2525- max_domain_length
2626- | Normalization_failed -> Format.fprintf fmt "Unicode normalization failed"
2727- | Verification_failed ->
2828- Format.fprintf fmt "IDNA verification failed (round-trip mismatch)"
2929-3030-let error_to_string err = Format.asprintf "%a" pp_error err
3131-3232-(* {1 Error Constructors} *)
3333-3434-let punycode_error e = Error (Punycode_error e)
3535-let invalid_label msg = Error (Invalid_label msg)
3636-let domain_too_long len = Error (Domain_too_long len)
3737-let _normalization_failed = Error Normalization_failed
3838-let verification_failed = Error Verification_failed
3939-4040-(* {1 Unicode Normalization} *)
4141-4242-let normalize_nfc s = Uunf_string.normalize_utf_8 `NFC s
4343-4444-(* {1 Validation Helpers} *)
4545-4646-let is_ace_label label = Punycode.has_ace_prefix label
4747-4848-(* Check if a label follows STD3 rules (hostname restrictions):
4949- - Only LDH (letters, digits, hyphens)
5050- - Cannot start or end with hyphen *)
5151-let is_std3_valid label =
5252- let len = String.length label in
5353- let is_ldh c =
5454- (c >= 'a' && c <= 'z')
5555- || (c >= 'A' && c <= 'Z')
5656- || (c >= '0' && c <= '9')
5757- || c = '-'
5858- in
5959- len > 0
6060- && label.[0] <> '-'
6161- && label.[len - 1] <> '-'
6262- && String.for_all is_ldh label
6363-6464-(* Check hyphen placement: hyphens not in positions 3 and 4 (except for ACE) *)
6565-let check_hyphen_rules label =
6666- let len = String.length label in
6767- if len >= 4 && label.[2] = '-' && label.[3] = '-' then
6868- (* Hyphens in positions 3 and 4 - only valid for ACE prefix *)
6969- is_ace_label label
7070- else true
7171-7272-(* {1 Label Operations} *)
7373-7474-let label_to_ascii_impl ~check_hyphens ~use_std3_rules label =
7575- let len = String.length label in
7676- if len = 0 then invalid_label "empty label"
7777- else if len > Punycode.max_label_length then
7878- punycode_error (Punycode.Label_too_long len)
7979- else if Punycode.is_ascii_string label then begin
8080- (* All ASCII - validate and pass through *)
8181- if use_std3_rules && not (is_std3_valid label) then
8282- invalid_label "STD3 rules violation"
8383- else if check_hyphens && not (check_hyphen_rules label) then
8484- invalid_label "invalid hyphen placement"
8585- else Ok label
8686- end
8787- else begin
8888- (* Has non-ASCII - normalize and encode *)
8989- let normalized = normalize_nfc label in
9090-9191- (* Encode to Punycode *)
9292- match Punycode.encode_utf8 normalized with
9393- | Error e -> punycode_error e
9494- | Ok encoded -> (
9595- let result = Punycode.ace_prefix ^ encoded in
9696- let result_len = String.length result in
9797- if result_len > Punycode.max_label_length then
9898- punycode_error (Punycode.Label_too_long result_len)
9999- else if check_hyphens && not (check_hyphen_rules result) then
100100- invalid_label "invalid hyphen placement in encoded label"
101101- else
102102- (* Verification: decode and compare to original normalized form *)
103103- match Punycode.decode_utf8 encoded with
104104- | Error _ -> verification_failed
105105- | Ok decoded ->
106106- if decoded <> normalized then verification_failed else Ok result)
107107- end
108108-109109-let label_to_ascii ?(check_hyphens = true) ?(use_std3_rules = false) label =
110110- label_to_ascii_impl ~check_hyphens ~use_std3_rules label
111111-112112-let label_to_unicode label =
113113- if is_ace_label label then begin
114114- let encoded = String.sub label 4 (String.length label - 4) in
115115- match Punycode.decode_utf8 encoded with
116116- | Error e -> punycode_error e
117117- | Ok decoded -> Ok decoded
118118- end
119119- else Ok label
120120-121121-(* {1 Domain Operations} *)
122122-123123-(* Split domain into labels *)
124124-let split_domain domain = String.split_on_char '.' domain
125125-126126-(* Join labels into domain *)
127127-let join_labels labels = String.concat "." labels
128128-129129-(* Map a function returning Result over a list, short-circuiting on first Error *)
130130-let map_result f lst =
131131- List.fold_right
132132- (fun x acc ->
133133- let open Result.Syntax in
134134- let* y = f x in
135135- let+ ys = acc in
136136- y :: ys)
137137- lst (Ok [])
138138-139139-let to_ascii ?(check_hyphens = true) ?(check_bidi = false)
140140- ?(check_joiners = false) ?(use_std3_rules = false) ?(transitional = false)
141141- domain =
142142- (* Note: check_bidi, check_joiners, and transitional are accepted but
143143- not fully implemented - they would require additional Unicode data *)
144144- let _ = check_bidi in
145145- let _ = check_joiners in
146146- let _ = transitional in
147147-148148- let open Result.Syntax in
149149- let labels = split_domain domain in
150150- let* encoded_labels =
151151- map_result (label_to_ascii_impl ~check_hyphens ~use_std3_rules) labels
152152- in
153153- let result = join_labels encoded_labels in
154154- let len = String.length result in
155155- if len > max_domain_length then domain_too_long len else Ok result
156156-157157-let to_unicode domain =
158158- let open Result.Syntax in
159159- let labels = split_domain domain in
160160- let+ decoded_labels = map_result label_to_unicode labels in
161161- join_labels decoded_labels
162162-163163-(* {1 Domain Name Library Integration} *)
164164-165165-let domain_to_ascii ?(check_hyphens = true) ?(use_std3_rules = false) domain =
166166- let open Result.Syntax in
167167- let s = Domain_name.to_string domain in
168168- let* ascii = to_ascii ~check_hyphens ~use_std3_rules s in
169169- match Domain_name.of_string ascii with
170170- | Error (`Msg msg) -> invalid_label msg
171171- | Ok d -> Ok d
172172-173173-let domain_to_unicode domain =
174174- let open Result.Syntax in
175175- let s = Domain_name.to_string domain in
176176- let* unicode = to_unicode s in
177177- match Domain_name.of_string unicode with
178178- | Error (`Msg msg) -> invalid_label msg
179179- | Ok d -> Ok d
180180-181181-(* {1 Validation} *)
182182-183183-let is_idna_valid domain = Result.is_ok (to_ascii domain)
-215
ocaml-punycode/lib/punycode_idna.mli
···11-(*---------------------------------------------------------------------------
22- Copyright (c) 2025 Anil Madhavapeddy <anil@recoil.org>. All rights reserved.
33- SPDX-License-Identifier: ISC
44- ---------------------------------------------------------------------------*)
55-66-(** IDNA (Internationalized Domain Names in Applications) support.
77-88- This module provides ToASCII and ToUnicode operations as specified in
99- {{:https://datatracker.ietf.org/doc/html/rfc5891}RFC 5891} (IDNA 2008),
1010- using Punycode ({{:https://datatracker.ietf.org/doc/html/rfc3492}RFC 3492})
1111- for encoding.
1212-1313- IDNA allows domain names to contain non-ASCII Unicode characters by encoding
1414- them using Punycode with an ACE prefix. This module handles the conversion
1515- between Unicode domain names and their ASCII-compatible encoding (ACE) form.
1616-1717- {2 References}
1818- - {{:https://datatracker.ietf.org/doc/html/rfc5891}RFC 5891} -
1919- Internationalized Domain Names in Applications (IDNA): Protocol
2020- - {{:https://datatracker.ietf.org/doc/html/rfc5892}RFC 5892} - The Unicode
2121- Code Points and Internationalized Domain Names for Applications (IDNA)
2222- - {{:https://datatracker.ietf.org/doc/html/rfc5893}RFC 5893} - Right-to-Left
2323- Scripts for Internationalized Domain Names for Applications (IDNA)
2424- - {{:https://datatracker.ietf.org/doc/html/rfc3492}RFC 3492} - Punycode: A
2525- Bootstring encoding of Unicode for IDNA *)
2626-2727-(** {1 Error Types} *)
2828-2929-type error =
3030- | Punycode_error of Punycode.error
3131- (** Error during Punycode encoding/decoding. See {!Punycode.error} for
3232- details. *)
3333- | Invalid_label of string
3434- (** Label violates IDNA constraints. The string describes the violation.
3535- See
3636- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4} RFC 5891
3737- Section 4} for label validation requirements. *)
3838- | Domain_too_long of int
3939- (** Domain name exceeds 253 bytes, per
4040- {{:https://datatracker.ietf.org/doc/html/rfc1035}RFC 1035}. The int is
4141- the actual length. *)
4242- | Normalization_failed
4343- (** Unicode normalization (NFC) failed. Per
4444- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.1} RFC
4545- 5891 Section 4.2.1}, labels must be in NFC form. *)
4646- | Verification_failed
4747- (** ToASCII/ToUnicode verification step failed (round-trip check). Per
4848- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2} RFC 5891
4949- Section 4.2}, the result of encoding must decode back to the original
5050- input. *)
5151-5252-val pp_error : Format.formatter -> error -> unit
5353-(** [pp_error fmt e] pretty-prints an error. *)
5454-5555-val error_to_string : error -> string
5656-(** [error_to_string e] converts an error to a human-readable string. *)
5757-5858-(** {1 Constants} *)
5959-6060-val max_domain_length : int
6161-(** Maximum length of a domain name in bytes (253), per
6262- {{:https://datatracker.ietf.org/doc/html/rfc1035}RFC 1035}. *)
6363-6464-(** {1 ToASCII Operation}
6565-6666- Converts an internationalized domain name to its ASCII-compatible encoding
6767- (ACE) form suitable for DNS lookup.
6868-6969- See
7070- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4} RFC 5891 Section
7171- 4} for the complete ToASCII specification. *)
7272-7373-val to_ascii :
7474- ?check_hyphens:bool ->
7575- ?check_bidi:bool ->
7676- ?check_joiners:bool ->
7777- ?use_std3_rules:bool ->
7878- ?transitional:bool ->
7979- string ->
8080- (string, error) result
8181-(** [to_ascii domain] converts an internationalized domain name to ASCII.
8282-8383- Implements the ToASCII operation from
8484- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.1}RFC 5891
8585- Section 4.1}.
8686-8787- For each label in the domain: 1. If all ASCII, pass through (with optional
8888- STD3 validation) 2. Otherwise, normalize to NFC per
8989- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.1}Section
9090- 4.2.1} and Punycode-encode with ACE prefix
9191-9292- Optional parameters (per
9393- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4} RFC 5891 Section
9494- 4} processing options):
9595- - [check_hyphens]: Validate hyphen placement per
9696- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.1}Section
9797- 4.2.3.1} (default: true)
9898- - [check_bidi]: Check bidirectional text rules per
9999- {{:https://datatracker.ietf.org/doc/html/rfc5893}RFC 5893} (default:
100100- false, not implemented)
101101- - [check_joiners]: Check contextual joiner rules per
102102- {{:https://datatracker.ietf.org/doc/html/rfc5892#appendix-A.1}RFC 5892
103103- Appendix A.1} (default: false, not implemented)
104104- - [use_std3_rules]: Apply STD3 hostname rules per
105105- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.2}Section
106106- 4.2.3.2} (default: false)
107107- - [transitional]: Use IDNA 2003 transitional processing (default: false)
108108-109109- Example:
110110- {[
111111- to_ascii "münchen.example.com"
112112- (* = Ok "xn--mnchen-3ya.example.com" *)
113113- ]} *)
114114-115115-val label_to_ascii :
116116- ?check_hyphens:bool ->
117117- ?use_std3_rules:bool ->
118118- string ->
119119- (string, error) result
120120-(** [label_to_ascii label] converts a single label to ASCII.
121121-122122- This implements the core ToASCII operation for one label, as described in
123123- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.1}RFC 5891
124124- Section 4.1}. *)
125125-126126-(** {1 ToUnicode Operation}
127127-128128- Converts an ASCII-compatible encoded domain name back to Unicode.
129129-130130- See
131131- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2} RFC 5891
132132- Section 4.2} for the complete ToUnicode specification. *)
133133-134134-val to_unicode : string -> (string, error) result
135135-(** [to_unicode domain] converts an ACE domain name to Unicode.
136136-137137- Implements the ToUnicode operation from
138138- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2}RFC 5891
139139- Section 4.2}.
140140-141141- For each label in the domain: 1. If it has the ACE prefix ("xn--"),
142142- Punycode-decode it per
143143- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-6.2}RFC 3492
144144- Section 6.2} 2. Otherwise, pass through unchanged
145145-146146- Example:
147147- {[
148148- to_unicode "xn--mnchen-3ya.example.com"
149149- (* = Ok "münchen.example.com" *)
150150- ]} *)
151151-152152-val label_to_unicode : string -> (string, error) result
153153-(** [label_to_unicode label] converts a single ACE label to Unicode.
154154-155155- This implements the core ToUnicode operation for one label, as described in
156156- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2}RFC 5891
157157- Section 4.2}. *)
158158-159159-(** {1 Domain Name Integration}
160160-161161- Functions that work with the
162162- {{:https://github.com/hannesm/domain-name}domain-name} library types.
163163-164164- These provide integration with the [Domain_name] module for applications
165165- that use that library for domain name handling. *)
166166-167167-val domain_to_ascii :
168168- ?check_hyphens:bool ->
169169- ?use_std3_rules:bool ->
170170- [ `raw ] Domain_name.t ->
171171- ([ `raw ] Domain_name.t, error) result
172172-(** [domain_to_ascii domain] converts a domain name to ASCII form.
173173-174174- Applies {!to_ascii} to the string representation and returns the result as a
175175- [Domain_name.t].
176176-177177- Example:
178178- {[
179179- let d = Domain_name.of_string_exn "münchen.example.com" in
180180- domain_to_ascii d
181181- (* = Ok (Domain_name.of_string_exn "xn--mnchen-3ya.example.com") *)
182182- ]} *)
183183-184184-val domain_to_unicode :
185185- [ `raw ] Domain_name.t -> ([ `raw ] Domain_name.t, error) result
186186-(** [domain_to_unicode domain] converts a domain name to Unicode form.
187187-188188- Applies {!to_unicode} to the string representation and returns the result as
189189- a [Domain_name.t]. *)
190190-191191-(** {1 Validation} *)
192192-193193-val is_idna_valid : string -> bool
194194-(** [is_idna_valid domain] checks if a domain name is valid for IDNA processing.
195195-196196- Returns [true] if {!to_ascii} would succeed on the domain. *)
197197-198198-val is_ace_label : string -> bool
199199-(** [is_ace_label label] is [true] if the label has the ACE prefix "xn--"
200200- (case-insensitive). This indicates the label is Punycode-encoded per
201201- {{:https://datatracker.ietf.org/doc/html/rfc3492#section-5}RFC 3492 Section
202202- 5}. *)
203203-204204-(** {1 Normalization} *)
205205-206206-val normalize_nfc : string -> string
207207-(** [normalize_nfc s] returns the NFC-normalized form of UTF-8 string [s].
208208-209209- Per
210210- {{:https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.1} RFC 5891
211211- Section 4.2.1}, domain labels must be normalized to NFC (Unicode
212212- Normalization Form C) before encoding.
213213-214214- See {{:http://www.unicode.org/reports/tr15/}Unicode Standard Annex #15} for
215215- details on Unicode normalization forms. *)
-36
ocaml-punycode/punycode.opam
···11-# This file is generated by dune, edit dune-project instead
22-opam-version: "2.0"
33-synopsis: "RFC 3492 Punycode and IDNA implementation for OCaml"
44-description: """
55-A high-quality implementation of RFC 3492 (Punycode) with IDNA support.
66- Provides encoding and decoding of internationalized domain names,
77- with proper Unicode normalization and mixed-case annotation support."""
88-maintainer: ["Anil Madhavapeddy <anil@recoil.org>"]
99-authors: ["Anil Madhavapeddy"]
1010-license: "ISC"
1111-homepage: "https://tangled.org/anil.recoil.org/ocaml-punycode"
1212-bug-reports: "https://tangled.org/anil.recoil.org/ocaml-punycode/issues"
1313-depends: [
1414- "ocaml" {>= "5.4.0"}
1515- "dune" {>= "3.20" & >= "3.0"}
1616- "uutf" {>= "1.0.0"}
1717- "uunf" {>= "15.0.0"}
1818- "domain-name" {>= "0.4.0"}
1919- "odoc" {with-doc}
2020- "alcotest" {with-test}
2121-]
2222-build: [
2323- ["dune" "subst"] {dev}
2424- [
2525- "dune"
2626- "build"
2727- "-p"
2828- name
2929- "-j"
3030- jobs
3131- "@install"
3232- "@runtest" {with-test}
3333- "@doc" {with-doc}
3434- ]
3535-]
3636-x-maintenance-intent: ["(latest)"]
-3077
ocaml-punycode/spec/rfc1035.txt
···11-Network Working Group P. Mockapetris
22-Request for Comments: 1035 ISI
33- November 1987
44-Obsoletes: RFCs 882, 883, 973
55-66- DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION
77-88-99-1. STATUS OF THIS MEMO
1010-1111-This RFC describes the details of the domain system and protocol, and
1212-assumes that the reader is familiar with the concepts discussed in a
1313-companion RFC, "Domain Names - Concepts and Facilities" [RFC-1034].
1414-1515-The domain system is a mixture of functions and data types which are an
1616-official protocol and functions and data types which are still
1717-experimental. Since the domain system is intentionally extensible, new
1818-data types and experimental behavior should always be expected in parts
1919-of the system beyond the official protocol. The official protocol parts
2020-include standard queries, responses and the Internet class RR data
2121-formats (e.g., host addresses). Since the previous RFC set, several
2222-definitions have changed, so some previous definitions are obsolete.
2323-2424-Experimental or obsolete features are clearly marked in these RFCs, and
2525-such information should be used with caution.
2626-2727-The reader is especially cautioned not to depend on the values which
2828-appear in examples to be current or complete, since their purpose is
2929-primarily pedagogical. Distribution of this memo is unlimited.
3030-3131- Table of Contents
3232-3333- 1. STATUS OF THIS MEMO 1
3434- 2. INTRODUCTION 3
3535- 2.1. Overview 3
3636- 2.2. Common configurations 4
3737- 2.3. Conventions 7
3838- 2.3.1. Preferred name syntax 7
3939- 2.3.2. Data Transmission Order 8
4040- 2.3.3. Character Case 9
4141- 2.3.4. Size limits 10
4242- 3. DOMAIN NAME SPACE AND RR DEFINITIONS 10
4343- 3.1. Name space definitions 10
4444- 3.2. RR definitions 11
4545- 3.2.1. Format 11
4646- 3.2.2. TYPE values 12
4747- 3.2.3. QTYPE values 12
4848- 3.2.4. CLASS values 13
4949-5050-5151-5252-Mockapetris [Page 1]
5353-5454-RFC 1035 Domain Implementation and Specification November 1987
5555-5656-5757- 3.2.5. QCLASS values 13
5858- 3.3. Standard RRs 13
5959- 3.3.1. CNAME RDATA format 14
6060- 3.3.2. HINFO RDATA format 14
6161- 3.3.3. MB RDATA format (EXPERIMENTAL) 14
6262- 3.3.4. MD RDATA format (Obsolete) 15
6363- 3.3.5. MF RDATA format (Obsolete) 15
6464- 3.3.6. MG RDATA format (EXPERIMENTAL) 16
6565- 3.3.7. MINFO RDATA format (EXPERIMENTAL) 16
6666- 3.3.8. MR RDATA format (EXPERIMENTAL) 17
6767- 3.3.9. MX RDATA format 17
6868- 3.3.10. NULL RDATA format (EXPERIMENTAL) 17
6969- 3.3.11. NS RDATA format 18
7070- 3.3.12. PTR RDATA format 18
7171- 3.3.13. SOA RDATA format 19
7272- 3.3.14. TXT RDATA format 20
7373- 3.4. ARPA Internet specific RRs 20
7474- 3.4.1. A RDATA format 20
7575- 3.4.2. WKS RDATA format 21
7676- 3.5. IN-ADDR.ARPA domain 22
7777- 3.6. Defining new types, classes, and special namespaces 24
7878- 4. MESSAGES 25
7979- 4.1. Format 25
8080- 4.1.1. Header section format 26
8181- 4.1.2. Question section format 28
8282- 4.1.3. Resource record format 29
8383- 4.1.4. Message compression 30
8484- 4.2. Transport 32
8585- 4.2.1. UDP usage 32
8686- 4.2.2. TCP usage 32
8787- 5. MASTER FILES 33
8888- 5.1. Format 33
8989- 5.2. Use of master files to define zones 35
9090- 5.3. Master file example 36
9191- 6. NAME SERVER IMPLEMENTATION 37
9292- 6.1. Architecture 37
9393- 6.1.1. Control 37
9494- 6.1.2. Database 37
9595- 6.1.3. Time 39
9696- 6.2. Standard query processing 39
9797- 6.3. Zone refresh and reload processing 39
9898- 6.4. Inverse queries (Optional) 40
9999- 6.4.1. The contents of inverse queries and responses 40
100100- 6.4.2. Inverse query and response example 41
101101- 6.4.3. Inverse query processing 42
102102-103103-104104-105105-106106-107107-108108-Mockapetris [Page 2]
109109-110110-RFC 1035 Domain Implementation and Specification November 1987
111111-112112-113113- 6.5. Completion queries and responses 42
114114- 7. RESOLVER IMPLEMENTATION 43
115115- 7.1. Transforming a user request into a query 43
116116- 7.2. Sending the queries 44
117117- 7.3. Processing responses 46
118118- 7.4. Using the cache 47
119119- 8. MAIL SUPPORT 47
120120- 8.1. Mail exchange binding 48
121121- 8.2. Mailbox binding (Experimental) 48
122122- 9. REFERENCES and BIBLIOGRAPHY 50
123123- Index 54
124124-125125-2. INTRODUCTION
126126-127127-2.1. Overview
128128-129129-The goal of domain names is to provide a mechanism for naming resources
130130-in such a way that the names are usable in different hosts, networks,
131131-protocol families, internets, and administrative organizations.
132132-133133-From the user's point of view, domain names are useful as arguments to a
134134-local agent, called a resolver, which retrieves information associated
135135-with the domain name. Thus a user might ask for the host address or
136136-mail information associated with a particular domain name. To enable
137137-the user to request a particular type of information, an appropriate
138138-query type is passed to the resolver with the domain name. To the user,
139139-the domain tree is a single information space; the resolver is
140140-responsible for hiding the distribution of data among name servers from
141141-the user.
142142-143143-From the resolver's point of view, the database that makes up the domain
144144-space is distributed among various name servers. Different parts of the
145145-domain space are stored in different name servers, although a particular
146146-data item will be stored redundantly in two or more name servers. The
147147-resolver starts with knowledge of at least one name server. When the
148148-resolver processes a user query it asks a known name server for the
149149-information; in return, the resolver either receives the desired
150150-information or a referral to another name server. Using these
151151-referrals, resolvers learn the identities and contents of other name
152152-servers. Resolvers are responsible for dealing with the distribution of
153153-the domain space and dealing with the effects of name server failure by
154154-consulting redundant databases in other servers.
155155-156156-Name servers manage two kinds of data. The first kind of data held in
157157-sets called zones; each zone is the complete database for a particular
158158-"pruned" subtree of the domain space. This data is called
159159-authoritative. A name server periodically checks to make sure that its
160160-zones are up to date, and if not, obtains a new copy of updated zones
161161-162162-163163-164164-Mockapetris [Page 3]
165165-166166-RFC 1035 Domain Implementation and Specification November 1987
167167-168168-169169-from master files stored locally or in another name server. The second
170170-kind of data is cached data which was acquired by a local resolver.
171171-This data may be incomplete, but improves the performance of the
172172-retrieval process when non-local data is repeatedly accessed. Cached
173173-data is eventually discarded by a timeout mechanism.
174174-175175-This functional structure isolates the problems of user interface,
176176-failure recovery, and distribution in the resolvers and isolates the
177177-database update and refresh problems in the name servers.
178178-179179-2.2. Common configurations
180180-181181-A host can participate in the domain name system in a number of ways,
182182-depending on whether the host runs programs that retrieve information
183183-from the domain system, name servers that answer queries from other
184184-hosts, or various combinations of both functions. The simplest, and
185185-perhaps most typical, configuration is shown below:
186186-187187- Local Host | Foreign
188188- |
189189- +---------+ +----------+ | +--------+
190190- | | user queries | |queries | | |
191191- | User |-------------->| |---------|->|Foreign |
192192- | Program | | Resolver | | | Name |
193193- | |<--------------| |<--------|--| Server |
194194- | | user responses| |responses| | |
195195- +---------+ +----------+ | +--------+
196196- | A |
197197- cache additions | | references |
198198- V | |
199199- +----------+ |
200200- | cache | |
201201- +----------+ |
202202-203203-User programs interact with the domain name space through resolvers; the
204204-format of user queries and user responses is specific to the host and
205205-its operating system. User queries will typically be operating system
206206-calls, and the resolver and its cache will be part of the host operating
207207-system. Less capable hosts may choose to implement the resolver as a
208208-subroutine to be linked in with every program that needs its services.
209209-Resolvers answer user queries with information they acquire via queries
210210-to foreign name servers and the local cache.
211211-212212-Note that the resolver may have to make several queries to several
213213-different foreign name servers to answer a particular user query, and
214214-hence the resolution of a user query may involve several network
215215-accesses and an arbitrary amount of time. The queries to foreign name
216216-servers and the corresponding responses have a standard format described
217217-218218-219219-220220-Mockapetris [Page 4]
221221-222222-RFC 1035 Domain Implementation and Specification November 1987
223223-224224-225225-in this memo, and may be datagrams.
226226-227227-Depending on its capabilities, a name server could be a stand alone
228228-program on a dedicated machine or a process or processes on a large
229229-timeshared host. A simple configuration might be:
230230-231231- Local Host | Foreign
232232- |
233233- +---------+ |
234234- / /| |
235235- +---------+ | +----------+ | +--------+
236236- | | | | |responses| | |
237237- | | | | Name |---------|->|Foreign |
238238- | Master |-------------->| Server | | |Resolver|
239239- | files | | | |<--------|--| |
240240- | |/ | | queries | +--------+
241241- +---------+ +----------+ |
242242-243243-Here a primary name server acquires information about one or more zones
244244-by reading master files from its local file system, and answers queries
245245-about those zones that arrive from foreign resolvers.
246246-247247-The DNS requires that all zones be redundantly supported by more than
248248-one name server. Designated secondary servers can acquire zones and
249249-check for updates from the primary server using the zone transfer
250250-protocol of the DNS. This configuration is shown below:
251251-252252- Local Host | Foreign
253253- |
254254- +---------+ |
255255- / /| |
256256- +---------+ | +----------+ | +--------+
257257- | | | | |responses| | |
258258- | | | | Name |---------|->|Foreign |
259259- | Master |-------------->| Server | | |Resolver|
260260- | files | | | |<--------|--| |
261261- | |/ | | queries | +--------+
262262- +---------+ +----------+ |
263263- A |maintenance | +--------+
264264- | +------------|->| |
265265- | queries | |Foreign |
266266- | | | Name |
267267- +------------------|--| Server |
268268- maintenance responses | +--------+
269269-270270-In this configuration, the name server periodically establishes a
271271-virtual circuit to a foreign name server to acquire a copy of a zone or
272272-to check that an existing copy has not changed. The messages sent for
273273-274274-275275-276276-Mockapetris [Page 5]
277277-278278-RFC 1035 Domain Implementation and Specification November 1987
279279-280280-281281-these maintenance activities follow the same form as queries and
282282-responses, but the message sequences are somewhat different.
283283-284284-The information flow in a host that supports all aspects of the domain
285285-name system is shown below:
286286-287287- Local Host | Foreign
288288- |
289289- +---------+ +----------+ | +--------+
290290- | | user queries | |queries | | |
291291- | User |-------------->| |---------|->|Foreign |
292292- | Program | | Resolver | | | Name |
293293- | |<--------------| |<--------|--| Server |
294294- | | user responses| |responses| | |
295295- +---------+ +----------+ | +--------+
296296- | A |
297297- cache additions | | references |
298298- V | |
299299- +----------+ |
300300- | Shared | |
301301- | database | |
302302- +----------+ |
303303- A | |
304304- +---------+ refreshes | | references |
305305- / /| | V |
306306- +---------+ | +----------+ | +--------+
307307- | | | | |responses| | |
308308- | | | | Name |---------|->|Foreign |
309309- | Master |-------------->| Server | | |Resolver|
310310- | files | | | |<--------|--| |
311311- | |/ | | queries | +--------+
312312- +---------+ +----------+ |
313313- A |maintenance | +--------+
314314- | +------------|->| |
315315- | queries | |Foreign |
316316- | | | Name |
317317- +------------------|--| Server |
318318- maintenance responses | +--------+
319319-320320-The shared database holds domain space data for the local name server
321321-and resolver. The contents of the shared database will typically be a
322322-mixture of authoritative data maintained by the periodic refresh
323323-operations of the name server and cached data from previous resolver
324324-requests. The structure of the domain data and the necessity for
325325-synchronization between name servers and resolvers imply the general
326326-characteristics of this database, but the actual format is up to the
327327-local implementor.
328328-329329-330330-331331-332332-Mockapetris [Page 6]
333333-334334-RFC 1035 Domain Implementation and Specification November 1987
335335-336336-337337-Information flow can also be tailored so that a group of hosts act
338338-together to optimize activities. Sometimes this is done to offload less
339339-capable hosts so that they do not have to implement a full resolver.
340340-This can be appropriate for PCs or hosts which want to minimize the
341341-amount of new network code which is required. This scheme can also
342342-allow a group of hosts can share a small number of caches rather than
343343-maintaining a large number of separate caches, on the premise that the
344344-centralized caches will have a higher hit ratio. In either case,
345345-resolvers are replaced with stub resolvers which act as front ends to
346346-resolvers located in a recursive server in one or more name servers
347347-known to perform that service:
348348-349349- Local Hosts | Foreign
350350- |
351351- +---------+ |
352352- | | responses |
353353- | Stub |<--------------------+ |
354354- | Resolver| | |
355355- | |----------------+ | |
356356- +---------+ recursive | | |
357357- queries | | |
358358- V | |
359359- +---------+ recursive +----------+ | +--------+
360360- | | queries | |queries | | |
361361- | Stub |-------------->| Recursive|---------|->|Foreign |
362362- | Resolver| | Server | | | Name |
363363- | |<--------------| |<--------|--| Server |
364364- +---------+ responses | |responses| | |
365365- +----------+ | +--------+
366366- | Central | |
367367- | cache | |
368368- +----------+ |
369369-370370-In any case, note that domain components are always replicated for
371371-reliability whenever possible.
372372-373373-2.3. Conventions
374374-375375-The domain system has several conventions dealing with low-level, but
376376-fundamental, issues. While the implementor is free to violate these
377377-conventions WITHIN HIS OWN SYSTEM, he must observe these conventions in
378378-ALL behavior observed from other hosts.
379379-380380-2.3.1. Preferred name syntax
381381-382382-The DNS specifications attempt to be as general as possible in the rules
383383-for constructing domain names. The idea is that the name of any
384384-existing object can be expressed as a domain name with minimal changes.
385385-386386-387387-388388-Mockapetris [Page 7]
389389-390390-RFC 1035 Domain Implementation and Specification November 1987
391391-392392-393393-However, when assigning a domain name for an object, the prudent user
394394-will select a name which satisfies both the rules of the domain system
395395-and any existing rules for the object, whether these rules are published
396396-or implied by existing programs.
397397-398398-For example, when naming a mail domain, the user should satisfy both the
399399-rules of this memo and those in RFC-822. When creating a new host name,
400400-the old rules for HOSTS.TXT should be followed. This avoids problems
401401-when old software is converted to use domain names.
402402-403403-The following syntax will result in fewer problems with many
404404-405405-applications that use domain names (e.g., mail, TELNET).
406406-407407-<domain> ::= <subdomain> | " "
408408-409409-<subdomain> ::= <label> | <subdomain> "." <label>
410410-411411-<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
412412-413413-<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
414414-415415-<let-dig-hyp> ::= <let-dig> | "-"
416416-417417-<let-dig> ::= <letter> | <digit>
418418-419419-<letter> ::= any one of the 52 alphabetic characters A through Z in
420420-upper case and a through z in lower case
421421-422422-<digit> ::= any one of the ten digits 0 through 9
423423-424424-Note that while upper and lower case letters are allowed in domain
425425-names, no significance is attached to the case. That is, two names with
426426-the same spelling but different case are to be treated as if identical.
427427-428428-The labels must follow the rules for ARPANET host names. They must
429429-start with a letter, end with a letter or digit, and have as interior
430430-characters only letters, digits, and hyphen. There are also some
431431-restrictions on the length. Labels must be 63 characters or less.
432432-433433-For example, the following strings identify hosts in the Internet:
434434-435435-A.ISI.EDU XX.LCS.MIT.EDU SRI-NIC.ARPA
436436-437437-2.3.2. Data Transmission Order
438438-439439-The order of transmission of the header and data described in this
440440-document is resolved to the octet level. Whenever a diagram shows a
441441-442442-443443-444444-Mockapetris [Page 8]
445445-446446-RFC 1035 Domain Implementation and Specification November 1987
447447-448448-449449-group of octets, the order of transmission of those octets is the normal
450450-order in which they are read in English. For example, in the following
451451-diagram, the octets are transmitted in the order they are numbered.
452452-453453- 0 1
454454- 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
455455- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
456456- | 1 | 2 |
457457- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
458458- | 3 | 4 |
459459- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
460460- | 5 | 6 |
461461- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
462462-463463-Whenever an octet represents a numeric quantity, the left most bit in
464464-the diagram is the high order or most significant bit. That is, the bit
465465-labeled 0 is the most significant bit. For example, the following
466466-diagram represents the value 170 (decimal).
467467-468468- 0 1 2 3 4 5 6 7
469469- +-+-+-+-+-+-+-+-+
470470- |1 0 1 0 1 0 1 0|
471471- +-+-+-+-+-+-+-+-+
472472-473473-Similarly, whenever a multi-octet field represents a numeric quantity
474474-the left most bit of the whole field is the most significant bit. When
475475-a multi-octet quantity is transmitted the most significant octet is
476476-transmitted first.
477477-478478-2.3.3. Character Case
479479-480480-For all parts of the DNS that are part of the official protocol, all
481481-comparisons between character strings (e.g., labels, domain names, etc.)
482482-are done in a case-insensitive manner. At present, this rule is in
483483-force throughout the domain system without exception. However, future
484484-additions beyond current usage may need to use the full binary octet
485485-capabilities in names, so attempts to store domain names in 7-bit ASCII
486486-or use of special bytes to terminate labels, etc., should be avoided.
487487-488488-When data enters the domain system, its original case should be
489489-preserved whenever possible. In certain circumstances this cannot be
490490-done. For example, if two RRs are stored in a database, one at x.y and
491491-one at X.Y, they are actually stored at the same place in the database,
492492-and hence only one casing would be preserved. The basic rule is that
493493-case can be discarded only when data is used to define structure in a
494494-database, and two names are identical when compared in a case
495495-insensitive manner.
496496-497497-498498-499499-500500-Mockapetris [Page 9]
501501-502502-RFC 1035 Domain Implementation and Specification November 1987
503503-504504-505505-Loss of case sensitive data must be minimized. Thus while data for x.y
506506-and X.Y may both be stored under a single location x.y or X.Y, data for
507507-a.x and B.X would never be stored under A.x, A.X, b.x, or b.X. In
508508-general, this preserves the case of the first label of a domain name,
509509-but forces standardization of interior node labels.
510510-511511-Systems administrators who enter data into the domain database should
512512-take care to represent the data they supply to the domain system in a
513513-case-consistent manner if their system is case-sensitive. The data
514514-distribution system in the domain system will ensure that consistent
515515-representations are preserved.
516516-517517-2.3.4. Size limits
518518-519519-Various objects and parameters in the DNS have size limits. They are
520520-listed below. Some could be easily changed, others are more
521521-fundamental.
522522-523523-labels 63 octets or less
524524-525525-names 255 octets or less
526526-527527-TTL positive values of a signed 32 bit number.
528528-529529-UDP messages 512 octets or less
530530-531531-3. DOMAIN NAME SPACE AND RR DEFINITIONS
532532-533533-3.1. Name space definitions
534534-535535-Domain names in messages are expressed in terms of a sequence of labels.
536536-Each label is represented as a one octet length field followed by that
537537-number of octets. Since every domain name ends with the null label of
538538-the root, a domain name is terminated by a length byte of zero. The
539539-high order two bits of every length octet must be zero, and the
540540-remaining six bits of the length field limit the label to 63 octets or
541541-less.
542542-543543-To simplify implementations, the total length of a domain name (i.e.,
544544-label octets and label length octets) is restricted to 255 octets or
545545-less.
546546-547547-Although labels can contain any 8 bit values in octets that make up a
548548-label, it is strongly recommended that labels follow the preferred
549549-syntax described elsewhere in this memo, which is compatible with
550550-existing host naming conventions. Name servers and resolvers must
551551-compare labels in a case-insensitive manner (i.e., A=a), assuming ASCII
552552-with zero parity. Non-alphabetic codes must match exactly.
553553-554554-555555-556556-Mockapetris [Page 10]
557557-558558-RFC 1035 Domain Implementation and Specification November 1987
559559-560560-561561-3.2. RR definitions
562562-563563-3.2.1. Format
564564-565565-All RRs have the same top level format shown below:
566566-567567- 1 1 1 1 1 1
568568- 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
569569- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
570570- | |
571571- / /
572572- / NAME /
573573- | |
574574- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
575575- | TYPE |
576576- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
577577- | CLASS |
578578- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
579579- | TTL |
580580- | |
581581- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
582582- | RDLENGTH |
583583- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
584584- / RDATA /
585585- / /
586586- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
587587-588588-589589-where:
590590-591591-NAME an owner name, i.e., the name of the node to which this
592592- resource record pertains.
593593-594594-TYPE two octets containing one of the RR TYPE codes.
595595-596596-CLASS two octets containing one of the RR CLASS codes.
597597-598598-TTL a 32 bit signed integer that specifies the time interval
599599- that the resource record may be cached before the source
600600- of the information should again be consulted. Zero
601601- values are interpreted to mean that the RR can only be
602602- used for the transaction in progress, and should not be
603603- cached. For example, SOA records are always distributed
604604- with a zero TTL to prohibit caching. Zero values can
605605- also be used for extremely volatile data.
606606-607607-RDLENGTH an unsigned 16 bit integer that specifies the length in
608608- octets of the RDATA field.
609609-610610-611611-612612-Mockapetris [Page 11]
613613-614614-RFC 1035 Domain Implementation and Specification November 1987
615615-616616-617617-RDATA a variable length string of octets that describes the
618618- resource. The format of this information varies
619619- according to the TYPE and CLASS of the resource record.
620620-621621-3.2.2. TYPE values
622622-623623-TYPE fields are used in resource records. Note that these types are a
624624-subset of QTYPEs.
625625-626626-TYPE value and meaning
627627-628628-A 1 a host address
629629-630630-NS 2 an authoritative name server
631631-632632-MD 3 a mail destination (Obsolete - use MX)
633633-634634-MF 4 a mail forwarder (Obsolete - use MX)
635635-636636-CNAME 5 the canonical name for an alias
637637-638638-SOA 6 marks the start of a zone of authority
639639-640640-MB 7 a mailbox domain name (EXPERIMENTAL)
641641-642642-MG 8 a mail group member (EXPERIMENTAL)
643643-644644-MR 9 a mail rename domain name (EXPERIMENTAL)
645645-646646-NULL 10 a null RR (EXPERIMENTAL)
647647-648648-WKS 11 a well known service description
649649-650650-PTR 12 a domain name pointer
651651-652652-HINFO 13 host information
653653-654654-MINFO 14 mailbox or mail list information
655655-656656-MX 15 mail exchange
657657-658658-TXT 16 text strings
659659-660660-3.2.3. QTYPE values
661661-662662-QTYPE fields appear in the question part of a query. QTYPES are a
663663-superset of TYPEs, hence all TYPEs are valid QTYPEs. In addition, the
664664-following QTYPEs are defined:
665665-666666-667667-668668-Mockapetris [Page 12]
669669-670670-RFC 1035 Domain Implementation and Specification November 1987
671671-672672-673673-AXFR 252 A request for a transfer of an entire zone
674674-675675-MAILB 253 A request for mailbox-related records (MB, MG or MR)
676676-677677-MAILA 254 A request for mail agent RRs (Obsolete - see MX)
678678-679679-* 255 A request for all records
680680-681681-3.2.4. CLASS values
682682-683683-CLASS fields appear in resource records. The following CLASS mnemonics
684684-and values are defined:
685685-686686-IN 1 the Internet
687687-688688-CS 2 the CSNET class (Obsolete - used only for examples in
689689- some obsolete RFCs)
690690-691691-CH 3 the CHAOS class
692692-693693-HS 4 Hesiod [Dyer 87]
694694-695695-3.2.5. QCLASS values
696696-697697-QCLASS fields appear in the question section of a query. QCLASS values
698698-are a superset of CLASS values; every CLASS is a valid QCLASS. In
699699-addition to CLASS values, the following QCLASSes are defined:
700700-701701-* 255 any class
702702-703703-3.3. Standard RRs
704704-705705-The following RR definitions are expected to occur, at least
706706-potentially, in all classes. In particular, NS, SOA, CNAME, and PTR
707707-will be used in all classes, and have the same format in all classes.
708708-Because their RDATA format is known, all domain names in the RDATA
709709-section of these RRs may be compressed.
710710-711711-<domain-name> is a domain name represented as a series of labels, and
712712-terminated by a label with zero length. <character-string> is a single
713713-length octet followed by that number of characters. <character-string>
714714-is treated as binary information, and can be up to 256 characters in
715715-length (including the length octet).
716716-717717-718718-719719-720720-721721-722722-723723-724724-Mockapetris [Page 13]
725725-726726-RFC 1035 Domain Implementation and Specification November 1987
727727-728728-729729-3.3.1. CNAME RDATA format
730730-731731- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
732732- / CNAME /
733733- / /
734734- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
735735-736736-where:
737737-738738-CNAME A <domain-name> which specifies the canonical or primary
739739- name for the owner. The owner name is an alias.
740740-741741-CNAME RRs cause no additional section processing, but name servers may
742742-choose to restart the query at the canonical name in certain cases. See
743743-the description of name server logic in [RFC-1034] for details.
744744-745745-3.3.2. HINFO RDATA format
746746-747747- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
748748- / CPU /
749749- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
750750- / OS /
751751- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
752752-753753-where:
754754-755755-CPU A <character-string> which specifies the CPU type.
756756-757757-OS A <character-string> which specifies the operating
758758- system type.
759759-760760-Standard values for CPU and OS can be found in [RFC-1010].
761761-762762-HINFO records are used to acquire general information about a host. The
763763-main use is for protocols such as FTP that can use special procedures
764764-when talking between machines or operating systems of the same type.
765765-766766-3.3.3. MB RDATA format (EXPERIMENTAL)
767767-768768- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
769769- / MADNAME /
770770- / /
771771- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
772772-773773-where:
774774-775775-MADNAME A <domain-name> which specifies a host which has the
776776- specified mailbox.
777777-778778-779779-780780-Mockapetris [Page 14]
781781-782782-RFC 1035 Domain Implementation and Specification November 1987
783783-784784-785785-MB records cause additional section processing which looks up an A type
786786-RRs corresponding to MADNAME.
787787-788788-3.3.4. MD RDATA format (Obsolete)
789789-790790- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
791791- / MADNAME /
792792- / /
793793- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
794794-795795-where:
796796-797797-MADNAME A <domain-name> which specifies a host which has a mail
798798- agent for the domain which should be able to deliver
799799- mail for the domain.
800800-801801-MD records cause additional section processing which looks up an A type
802802-record corresponding to MADNAME.
803803-804804-MD is obsolete. See the definition of MX and [RFC-974] for details of
805805-the new scheme. The recommended policy for dealing with MD RRs found in
806806-a master file is to reject them, or to convert them to MX RRs with a
807807-preference of 0.
808808-809809-3.3.5. MF RDATA format (Obsolete)
810810-811811- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
812812- / MADNAME /
813813- / /
814814- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
815815-816816-where:
817817-818818-MADNAME A <domain-name> which specifies a host which has a mail
819819- agent for the domain which will accept mail for
820820- forwarding to the domain.
821821-822822-MF records cause additional section processing which looks up an A type
823823-record corresponding to MADNAME.
824824-825825-MF is obsolete. See the definition of MX and [RFC-974] for details ofw
826826-the new scheme. The recommended policy for dealing with MD RRs found in
827827-a master file is to reject them, or to convert them to MX RRs with a
828828-preference of 10.
829829-830830-831831-832832-833833-834834-835835-836836-Mockapetris [Page 15]
837837-838838-RFC 1035 Domain Implementation and Specification November 1987
839839-840840-841841-3.3.6. MG RDATA format (EXPERIMENTAL)
842842-843843- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
844844- / MGMNAME /
845845- / /
846846- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
847847-848848-where:
849849-850850-MGMNAME A <domain-name> which specifies a mailbox which is a
851851- member of the mail group specified by the domain name.
852852-853853-MG records cause no additional section processing.
854854-855855-3.3.7. MINFO RDATA format (EXPERIMENTAL)
856856-857857- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
858858- / RMAILBX /
859859- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
860860- / EMAILBX /
861861- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
862862-863863-where:
864864-865865-RMAILBX A <domain-name> which specifies a mailbox which is
866866- responsible for the mailing list or mailbox. If this
867867- domain name names the root, the owner of the MINFO RR is
868868- responsible for itself. Note that many existing mailing
869869- lists use a mailbox X-request for the RMAILBX field of
870870- mailing list X, e.g., Msgroup-request for Msgroup. This
871871- field provides a more general mechanism.
872872-873873-874874-EMAILBX A <domain-name> which specifies a mailbox which is to
875875- receive error messages related to the mailing list or
876876- mailbox specified by the owner of the MINFO RR (similar
877877- to the ERRORS-TO: field which has been proposed). If
878878- this domain name names the root, errors should be
879879- returned to the sender of the message.
880880-881881-MINFO records cause no additional section processing. Although these
882882-records can be associated with a simple mailbox, they are usually used
883883-with a mailing list.
884884-885885-886886-887887-888888-889889-890890-891891-892892-Mockapetris [Page 16]
893893-894894-RFC 1035 Domain Implementation and Specification November 1987
895895-896896-897897-3.3.8. MR RDATA format (EXPERIMENTAL)
898898-899899- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
900900- / NEWNAME /
901901- / /
902902- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
903903-904904-where:
905905-906906-NEWNAME A <domain-name> which specifies a mailbox which is the
907907- proper rename of the specified mailbox.
908908-909909-MR records cause no additional section processing. The main use for MR
910910-is as a forwarding entry for a user who has moved to a different
911911-mailbox.
912912-913913-3.3.9. MX RDATA format
914914-915915- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
916916- | PREFERENCE |
917917- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
918918- / EXCHANGE /
919919- / /
920920- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
921921-922922-where:
923923-924924-PREFERENCE A 16 bit integer which specifies the preference given to
925925- this RR among others at the same owner. Lower values
926926- are preferred.
927927-928928-EXCHANGE A <domain-name> which specifies a host willing to act as
929929- a mail exchange for the owner name.
930930-931931-MX records cause type A additional section processing for the host
932932-specified by EXCHANGE. The use of MX RRs is explained in detail in
933933-[RFC-974].
934934-935935-3.3.10. NULL RDATA format (EXPERIMENTAL)
936936-937937- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
938938- / <anything> /
939939- / /
940940- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
941941-942942-Anything at all may be in the RDATA field so long as it is 65535 octets
943943-or less.
944944-945945-946946-947947-948948-Mockapetris [Page 17]
949949-950950-RFC 1035 Domain Implementation and Specification November 1987
951951-952952-953953-NULL records cause no additional section processing. NULL RRs are not
954954-allowed in master files. NULLs are used as placeholders in some
955955-experimental extensions of the DNS.
956956-957957-3.3.11. NS RDATA format
958958-959959- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
960960- / NSDNAME /
961961- / /
962962- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
963963-964964-where:
965965-966966-NSDNAME A <domain-name> which specifies a host which should be
967967- authoritative for the specified class and domain.
968968-969969-NS records cause both the usual additional section processing to locate
970970-a type A record, and, when used in a referral, a special search of the
971971-zone in which they reside for glue information.
972972-973973-The NS RR states that the named host should be expected to have a zone
974974-starting at owner name of the specified class. Note that the class may
975975-not indicate the protocol family which should be used to communicate
976976-with the host, although it is typically a strong hint. For example,
977977-hosts which are name servers for either Internet (IN) or Hesiod (HS)
978978-class information are normally queried using IN class protocols.
979979-980980-3.3.12. PTR RDATA format
981981-982982- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
983983- / PTRDNAME /
984984- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
985985-986986-where:
987987-988988-PTRDNAME A <domain-name> which points to some location in the
989989- domain name space.
990990-991991-PTR records cause no additional section processing. These RRs are used
992992-in special domains to point to some other location in the domain space.
993993-These records are simple data, and don't imply any special processing
994994-similar to that performed by CNAME, which identifies aliases. See the
995995-description of the IN-ADDR.ARPA domain for an example.
996996-997997-998998-999999-10001000-10011001-10021002-10031003-10041004-Mockapetris [Page 18]
10051005-10061006-RFC 1035 Domain Implementation and Specification November 1987
10071007-10081008-10091009-3.3.13. SOA RDATA format
10101010-10111011- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10121012- / MNAME /
10131013- / /
10141014- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10151015- / RNAME /
10161016- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10171017- | SERIAL |
10181018- | |
10191019- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10201020- | REFRESH |
10211021- | |
10221022- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10231023- | RETRY |
10241024- | |
10251025- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10261026- | EXPIRE |
10271027- | |
10281028- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10291029- | MINIMUM |
10301030- | |
10311031- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10321032-10331033-where:
10341034-10351035-MNAME The <domain-name> of the name server that was the
10361036- original or primary source of data for this zone.
10371037-10381038-RNAME A <domain-name> which specifies the mailbox of the
10391039- person responsible for this zone.
10401040-10411041-SERIAL The unsigned 32 bit version number of the original copy
10421042- of the zone. Zone transfers preserve this value. This
10431043- value wraps and should be compared using sequence space
10441044- arithmetic.
10451045-10461046-REFRESH A 32 bit time interval before the zone should be
10471047- refreshed.
10481048-10491049-RETRY A 32 bit time interval that should elapse before a
10501050- failed refresh should be retried.
10511051-10521052-EXPIRE A 32 bit time value that specifies the upper limit on
10531053- the time interval that can elapse before the zone is no
10541054- longer authoritative.
10551055-10561056-10571057-10581058-10591059-10601060-Mockapetris [Page 19]
10611061-10621062-RFC 1035 Domain Implementation and Specification November 1987
10631063-10641064-10651065-MINIMUM The unsigned 32 bit minimum TTL field that should be
10661066- exported with any RR from this zone.
10671067-10681068-SOA records cause no additional section processing.
10691069-10701070-All times are in units of seconds.
10711071-10721072-Most of these fields are pertinent only for name server maintenance
10731073-operations. However, MINIMUM is used in all query operations that
10741074-retrieve RRs from a zone. Whenever a RR is sent in a response to a
10751075-query, the TTL field is set to the maximum of the TTL field from the RR
10761076-and the MINIMUM field in the appropriate SOA. Thus MINIMUM is a lower
10771077-bound on the TTL field for all RRs in a zone. Note that this use of
10781078-MINIMUM should occur when the RRs are copied into the response and not
10791079-when the zone is loaded from a master file or via a zone transfer. The
10801080-reason for this provison is to allow future dynamic update facilities to
10811081-change the SOA RR with known semantics.
10821082-10831083-10841084-3.3.14. TXT RDATA format
10851085-10861086- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10871087- / TXT-DATA /
10881088- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
10891089-10901090-where:
10911091-10921092-TXT-DATA One or more <character-string>s.
10931093-10941094-TXT RRs are used to hold descriptive text. The semantics of the text
10951095-depends on the domain where it is found.
10961096-10971097-3.4. Internet specific RRs
10981098-10991099-3.4.1. A RDATA format
11001100-11011101- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
11021102- | ADDRESS |
11031103- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
11041104-11051105-where:
11061106-11071107-ADDRESS A 32 bit Internet address.
11081108-11091109-Hosts that have multiple Internet addresses will have multiple A
11101110-records.
11111111-11121112-11131113-11141114-11151115-11161116-Mockapetris [Page 20]
11171117-11181118-RFC 1035 Domain Implementation and Specification November 1987
11191119-11201120-11211121-A records cause no additional section processing. The RDATA section of
11221122-an A line in a master file is an Internet address expressed as four
11231123-decimal numbers separated by dots without any imbedded spaces (e.g.,
11241124-"10.2.0.52" or "192.0.5.6").
11251125-11261126-3.4.2. WKS RDATA format
11271127-11281128- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
11291129- | ADDRESS |
11301130- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
11311131- | PROTOCOL | |
11321132- +--+--+--+--+--+--+--+--+ |
11331133- | |
11341134- / <BIT MAP> /
11351135- / /
11361136- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
11371137-11381138-where:
11391139-11401140-ADDRESS An 32 bit Internet address
11411141-11421142-PROTOCOL An 8 bit IP protocol number
11431143-11441144-<BIT MAP> A variable length bit map. The bit map must be a
11451145- multiple of 8 bits long.
11461146-11471147-The WKS record is used to describe the well known services supported by
11481148-a particular protocol on a particular internet address. The PROTOCOL
11491149-field specifies an IP protocol number, and the bit map has one bit per
11501150-port of the specified protocol. The first bit corresponds to port 0,
11511151-the second to port 1, etc. If the bit map does not include a bit for a
11521152-protocol of interest, that bit is assumed zero. The appropriate values
11531153-and mnemonics for ports and protocols are specified in [RFC-1010].
11541154-11551155-For example, if PROTOCOL=TCP (6), the 26th bit corresponds to TCP port
11561156-25 (SMTP). If this bit is set, a SMTP server should be listening on TCP
11571157-port 25; if zero, SMTP service is not supported on the specified
11581158-address.
11591159-11601160-The purpose of WKS RRs is to provide availability information for
11611161-servers for TCP and UDP. If a server supports both TCP and UDP, or has
11621162-multiple Internet addresses, then multiple WKS RRs are used.
11631163-11641164-WKS RRs cause no additional section processing.
11651165-11661166-In master files, both ports and protocols are expressed using mnemonics
11671167-or decimal numbers.
11681168-11691169-11701170-11711171-11721172-Mockapetris [Page 21]
11731173-11741174-RFC 1035 Domain Implementation and Specification November 1987
11751175-11761176-11771177-3.5. IN-ADDR.ARPA domain
11781178-11791179-The Internet uses a special domain to support gateway location and
11801180-Internet address to host mapping. Other classes may employ a similar
11811181-strategy in other domains. The intent of this domain is to provide a
11821182-guaranteed method to perform host address to host name mapping, and to
11831183-facilitate queries to locate all gateways on a particular network in the
11841184-Internet.
11851185-11861186-Note that both of these services are similar to functions that could be
11871187-performed by inverse queries; the difference is that this part of the
11881188-domain name space is structured according to address, and hence can
11891189-guarantee that the appropriate data can be located without an exhaustive
11901190-search of the domain space.
11911191-11921192-The domain begins at IN-ADDR.ARPA and has a substructure which follows
11931193-the Internet addressing structure.
11941194-11951195-Domain names in the IN-ADDR.ARPA domain are defined to have up to four
11961196-labels in addition to the IN-ADDR.ARPA suffix. Each label represents
11971197-one octet of an Internet address, and is expressed as a character string
11981198-for a decimal value in the range 0-255 (with leading zeros omitted
11991199-except in the case of a zero octet which is represented by a single
12001200-zero).
12011201-12021202-Host addresses are represented by domain names that have all four labels
12031203-specified. Thus data for Internet address 10.2.0.52 is located at
12041204-domain name 52.0.2.10.IN-ADDR.ARPA. The reversal, though awkward to
12051205-read, allows zones to be delegated which are exactly one network of
12061206-address space. For example, 10.IN-ADDR.ARPA can be a zone containing
12071207-data for the ARPANET, while 26.IN-ADDR.ARPA can be a separate zone for
12081208-MILNET. Address nodes are used to hold pointers to primary host names
12091209-in the normal domain space.
12101210-12111211-Network numbers correspond to some non-terminal nodes at various depths
12121212-in the IN-ADDR.ARPA domain, since Internet network numbers are either 1,
12131213-2, or 3 octets. Network nodes are used to hold pointers to the primary
12141214-host names of gateways attached to that network. Since a gateway is, by
12151215-definition, on more than one network, it will typically have two or more
12161216-network nodes which point at it. Gateways will also have host level
12171217-pointers at their fully qualified addresses.
12181218-12191219-Both the gateway pointers at network nodes and the normal host pointers
12201220-at full address nodes use the PTR RR to point back to the primary domain
12211221-names of the corresponding hosts.
12221222-12231223-For example, the IN-ADDR.ARPA domain will contain information about the
12241224-ISI gateway between net 10 and 26, an MIT gateway from net 10 to MIT's
12251225-12261226-12271227-12281228-Mockapetris [Page 22]
12291229-12301230-RFC 1035 Domain Implementation and Specification November 1987
12311231-12321232-12331233-net 18, and hosts A.ISI.EDU and MULTICS.MIT.EDU. Assuming that ISI
12341234-gateway has addresses 10.2.0.22 and 26.0.0.103, and a name MILNET-
12351235-GW.ISI.EDU, and the MIT gateway has addresses 10.0.0.77 and 18.10.0.4
12361236-and a name GW.LCS.MIT.EDU, the domain database would contain:
12371237-12381238- 10.IN-ADDR.ARPA. PTR MILNET-GW.ISI.EDU.
12391239- 10.IN-ADDR.ARPA. PTR GW.LCS.MIT.EDU.
12401240- 18.IN-ADDR.ARPA. PTR GW.LCS.MIT.EDU.
12411241- 26.IN-ADDR.ARPA. PTR MILNET-GW.ISI.EDU.
12421242- 22.0.2.10.IN-ADDR.ARPA. PTR MILNET-GW.ISI.EDU.
12431243- 103.0.0.26.IN-ADDR.ARPA. PTR MILNET-GW.ISI.EDU.
12441244- 77.0.0.10.IN-ADDR.ARPA. PTR GW.LCS.MIT.EDU.
12451245- 4.0.10.18.IN-ADDR.ARPA. PTR GW.LCS.MIT.EDU.
12461246- 103.0.3.26.IN-ADDR.ARPA. PTR A.ISI.EDU.
12471247- 6.0.0.10.IN-ADDR.ARPA. PTR MULTICS.MIT.EDU.
12481248-12491249-Thus a program which wanted to locate gateways on net 10 would originate
12501250-a query of the form QTYPE=PTR, QCLASS=IN, QNAME=10.IN-ADDR.ARPA. It
12511251-would receive two RRs in response:
12521252-12531253- 10.IN-ADDR.ARPA. PTR MILNET-GW.ISI.EDU.
12541254- 10.IN-ADDR.ARPA. PTR GW.LCS.MIT.EDU.
12551255-12561256-The program could then originate QTYPE=A, QCLASS=IN queries for MILNET-
12571257-GW.ISI.EDU. and GW.LCS.MIT.EDU. to discover the Internet addresses of
12581258-these gateways.
12591259-12601260-A resolver which wanted to find the host name corresponding to Internet
12611261-host address 10.0.0.6 would pursue a query of the form QTYPE=PTR,
12621262-QCLASS=IN, QNAME=6.0.0.10.IN-ADDR.ARPA, and would receive:
12631263-12641264- 6.0.0.10.IN-ADDR.ARPA. PTR MULTICS.MIT.EDU.
12651265-12661266-Several cautions apply to the use of these services:
12671267- - Since the IN-ADDR.ARPA special domain and the normal domain
12681268- for a particular host or gateway will be in different zones,
12691269- the possibility exists that that the data may be inconsistent.
12701270-12711271- - Gateways will often have two names in separate domains, only
12721272- one of which can be primary.
12731273-12741274- - Systems that use the domain database to initialize their
12751275- routing tables must start with enough gateway information to
12761276- guarantee that they can access the appropriate name server.
12771277-12781278- - The gateway data only reflects the existence of a gateway in a
12791279- manner equivalent to the current HOSTS.TXT file. It doesn't
12801280- replace the dynamic availability information from GGP or EGP.
12811281-12821282-12831283-12841284-Mockapetris [Page 23]
12851285-12861286-RFC 1035 Domain Implementation and Specification November 1987
12871287-12881288-12891289-3.6. Defining new types, classes, and special namespaces
12901290-12911291-The previously defined types and classes are the ones in use as of the
12921292-date of this memo. New definitions should be expected. This section
12931293-makes some recommendations to designers considering additions to the
12941294-existing facilities. The mailing list NAMEDROPPERS@SRI-NIC.ARPA is the
12951295-forum where general discussion of design issues takes place.
12961296-12971297-In general, a new type is appropriate when new information is to be
12981298-added to the database about an existing object, or we need new data
12991299-formats for some totally new object. Designers should attempt to define
13001300-types and their RDATA formats that are generally applicable to all
13011301-classes, and which avoid duplication of information. New classes are
13021302-appropriate when the DNS is to be used for a new protocol, etc which
13031303-requires new class-specific data formats, or when a copy of the existing
13041304-name space is desired, but a separate management domain is necessary.
13051305-13061306-New types and classes need mnemonics for master files; the format of the
13071307-master files requires that the mnemonics for type and class be disjoint.
13081308-13091309-TYPE and CLASS values must be a proper subset of QTYPEs and QCLASSes
13101310-respectively.
13111311-13121312-The present system uses multiple RRs to represent multiple values of a
13131313-type rather than storing multiple values in the RDATA section of a
13141314-single RR. This is less efficient for most applications, but does keep
13151315-RRs shorter. The multiple RRs assumption is incorporated in some
13161316-experimental work on dynamic update methods.
13171317-13181318-The present system attempts to minimize the duplication of data in the
13191319-database in order to insure consistency. Thus, in order to find the
13201320-address of the host for a mail exchange, you map the mail domain name to
13211321-a host name, then the host name to addresses, rather than a direct
13221322-mapping to host address. This approach is preferred because it avoids
13231323-the opportunity for inconsistency.
13241324-13251325-In defining a new type of data, multiple RR types should not be used to
13261326-create an ordering between entries or express different formats for
13271327-equivalent bindings, instead this information should be carried in the
13281328-body of the RR and a single type used. This policy avoids problems with
13291329-caching multiple types and defining QTYPEs to match multiple types.
13301330-13311331-For example, the original form of mail exchange binding used two RR
13321332-types one to represent a "closer" exchange (MD) and one to represent a
13331333-"less close" exchange (MF). The difficulty is that the presence of one
13341334-RR type in a cache doesn't convey any information about the other
13351335-because the query which acquired the cached information might have used
13361336-a QTYPE of MF, MD, or MAILA (which matched both). The redesigned
13371337-13381338-13391339-13401340-Mockapetris [Page 24]
13411341-13421342-RFC 1035 Domain Implementation and Specification November 1987
13431343-13441344-13451345-service used a single type (MX) with a "preference" value in the RDATA
13461346-section which can order different RRs. However, if any MX RRs are found
13471347-in the cache, then all should be there.
13481348-13491349-4. MESSAGES
13501350-13511351-4.1. Format
13521352-13531353-All communications inside of the domain protocol are carried in a single
13541354-format called a message. The top level format of message is divided
13551355-into 5 sections (some of which are empty in certain cases) shown below:
13561356-13571357- +---------------------+
13581358- | Header |
13591359- +---------------------+
13601360- | Question | the question for the name server
13611361- +---------------------+
13621362- | Answer | RRs answering the question
13631363- +---------------------+
13641364- | Authority | RRs pointing toward an authority
13651365- +---------------------+
13661366- | Additional | RRs holding additional information
13671367- +---------------------+
13681368-13691369-The header section is always present. The header includes fields that
13701370-specify which of the remaining sections are present, and also specify
13711371-whether the message is a query or a response, a standard query or some
13721372-other opcode, etc.
13731373-13741374-The names of the sections after the header are derived from their use in
13751375-standard queries. The question section contains fields that describe a
13761376-question to a name server. These fields are a query type (QTYPE), a
13771377-query class (QCLASS), and a query domain name (QNAME). The last three
13781378-sections have the same format: a possibly empty list of concatenated
13791379-resource records (RRs). The answer section contains RRs that answer the
13801380-question; the authority section contains RRs that point toward an
13811381-authoritative name server; the additional records section contains RRs
13821382-which relate to the query, but are not strictly answers for the
13831383-question.
13841384-13851385-13861386-13871387-13881388-13891389-13901390-13911391-13921392-13931393-13941394-13951395-13961396-Mockapetris [Page 25]
13971397-13981398-RFC 1035 Domain Implementation and Specification November 1987
13991399-14001400-14011401-4.1.1. Header section format
14021402-14031403-The header contains the following fields:
14041404-14051405- 1 1 1 1 1 1
14061406- 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
14071407- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
14081408- | ID |
14091409- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
14101410- |QR| Opcode |AA|TC|RD|RA| Z | RCODE |
14111411- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
14121412- | QDCOUNT |
14131413- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
14141414- | ANCOUNT |
14151415- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
14161416- | NSCOUNT |
14171417- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
14181418- | ARCOUNT |
14191419- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
14201420-14211421-where:
14221422-14231423-ID A 16 bit identifier assigned by the program that
14241424- generates any kind of query. This identifier is copied
14251425- the corresponding reply and can be used by the requester
14261426- to match up replies to outstanding queries.
14271427-14281428-QR A one bit field that specifies whether this message is a
14291429- query (0), or a response (1).
14301430-14311431-OPCODE A four bit field that specifies kind of query in this
14321432- message. This value is set by the originator of a query
14331433- and copied into the response. The values are:
14341434-14351435- 0 a standard query (QUERY)
14361436-14371437- 1 an inverse query (IQUERY)
14381438-14391439- 2 a server status request (STATUS)
14401440-14411441- 3-15 reserved for future use
14421442-14431443-AA Authoritative Answer - this bit is valid in responses,
14441444- and specifies that the responding name server is an
14451445- authority for the domain name in question section.
14461446-14471447- Note that the contents of the answer section may have
14481448- multiple owner names because of aliases. The AA bit
14491449-14501450-14511451-14521452-Mockapetris [Page 26]
14531453-14541454-RFC 1035 Domain Implementation and Specification November 1987
14551455-14561456-14571457- corresponds to the name which matches the query name, or
14581458- the first owner name in the answer section.
14591459-14601460-TC TrunCation - specifies that this message was truncated
14611461- due to length greater than that permitted on the
14621462- transmission channel.
14631463-14641464-RD Recursion Desired - this bit may be set in a query and
14651465- is copied into the response. If RD is set, it directs
14661466- the name server to pursue the query recursively.
14671467- Recursive query support is optional.
14681468-14691469-RA Recursion Available - this be is set or cleared in a
14701470- response, and denotes whether recursive query support is
14711471- available in the name server.
14721472-14731473-Z Reserved for future use. Must be zero in all queries
14741474- and responses.
14751475-14761476-RCODE Response code - this 4 bit field is set as part of
14771477- responses. The values have the following
14781478- interpretation:
14791479-14801480- 0 No error condition
14811481-14821482- 1 Format error - The name server was
14831483- unable to interpret the query.
14841484-14851485- 2 Server failure - The name server was
14861486- unable to process this query due to a
14871487- problem with the name server.
14881488-14891489- 3 Name Error - Meaningful only for
14901490- responses from an authoritative name
14911491- server, this code signifies that the
14921492- domain name referenced in the query does
14931493- not exist.
14941494-14951495- 4 Not Implemented - The name server does
14961496- not support the requested kind of query.
14971497-14981498- 5 Refused - The name server refuses to
14991499- perform the specified operation for
15001500- policy reasons. For example, a name
15011501- server may not wish to provide the
15021502- information to the particular requester,
15031503- or a name server may not wish to perform
15041504- a particular operation (e.g., zone
15051505-15061506-15071507-15081508-Mockapetris [Page 27]
15091509-15101510-RFC 1035 Domain Implementation and Specification November 1987
15111511-15121512-15131513- transfer) for particular data.
15141514-15151515- 6-15 Reserved for future use.
15161516-15171517-QDCOUNT an unsigned 16 bit integer specifying the number of
15181518- entries in the question section.
15191519-15201520-ANCOUNT an unsigned 16 bit integer specifying the number of
15211521- resource records in the answer section.
15221522-15231523-NSCOUNT an unsigned 16 bit integer specifying the number of name
15241524- server resource records in the authority records
15251525- section.
15261526-15271527-ARCOUNT an unsigned 16 bit integer specifying the number of
15281528- resource records in the additional records section.
15291529-15301530-4.1.2. Question section format
15311531-15321532-The question section is used to carry the "question" in most queries,
15331533-i.e., the parameters that define what is being asked. The section
15341534-contains QDCOUNT (usually 1) entries, each of the following format:
15351535-15361536- 1 1 1 1 1 1
15371537- 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
15381538- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15391539- | |
15401540- / QNAME /
15411541- / /
15421542- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15431543- | QTYPE |
15441544- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15451545- | QCLASS |
15461546- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15471547-15481548-where:
15491549-15501550-QNAME a domain name represented as a sequence of labels, where
15511551- each label consists of a length octet followed by that
15521552- number of octets. The domain name terminates with the
15531553- zero length octet for the null label of the root. Note
15541554- that this field may be an odd number of octets; no
15551555- padding is used.
15561556-15571557-QTYPE a two octet code which specifies the type of the query.
15581558- The values for this field include all codes valid for a
15591559- TYPE field, together with some more general codes which
15601560- can match more than one type of RR.
15611561-15621562-15631563-15641564-Mockapetris [Page 28]
15651565-15661566-RFC 1035 Domain Implementation and Specification November 1987
15671567-15681568-15691569-QCLASS a two octet code that specifies the class of the query.
15701570- For example, the QCLASS field is IN for the Internet.
15711571-15721572-4.1.3. Resource record format
15731573-15741574-The answer, authority, and additional sections all share the same
15751575-format: a variable number of resource records, where the number of
15761576-records is specified in the corresponding count field in the header.
15771577-Each resource record has the following format:
15781578- 1 1 1 1 1 1
15791579- 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
15801580- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15811581- | |
15821582- / /
15831583- / NAME /
15841584- | |
15851585- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15861586- | TYPE |
15871587- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15881588- | CLASS |
15891589- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15901590- | TTL |
15911591- | |
15921592- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15931593- | RDLENGTH |
15941594- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
15951595- / RDATA /
15961596- / /
15971597- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
15981598-15991599-where:
16001600-16011601-NAME a domain name to which this resource record pertains.
16021602-16031603-TYPE two octets containing one of the RR type codes. This
16041604- field specifies the meaning of the data in the RDATA
16051605- field.
16061606-16071607-CLASS two octets which specify the class of the data in the
16081608- RDATA field.
16091609-16101610-TTL a 32 bit unsigned integer that specifies the time
16111611- interval (in seconds) that the resource record may be
16121612- cached before it should be discarded. Zero values are
16131613- interpreted to mean that the RR can only be used for the
16141614- transaction in progress, and should not be cached.
16151615-16161616-16171617-16181618-16191619-16201620-Mockapetris [Page 29]
16211621-16221622-RFC 1035 Domain Implementation and Specification November 1987
16231623-16241624-16251625-RDLENGTH an unsigned 16 bit integer that specifies the length in
16261626- octets of the RDATA field.
16271627-16281628-RDATA a variable length string of octets that describes the
16291629- resource. The format of this information varies
16301630- according to the TYPE and CLASS of the resource record.
16311631- For example, the if the TYPE is A and the CLASS is IN,
16321632- the RDATA field is a 4 octet ARPA Internet address.
16331633-16341634-4.1.4. Message compression
16351635-16361636-In order to reduce the size of messages, the domain system utilizes a
16371637-compression scheme which eliminates the repetition of domain names in a
16381638-message. In this scheme, an entire domain name or a list of labels at
16391639-the end of a domain name is replaced with a pointer to a prior occurance
16401640-of the same name.
16411641-16421642-The pointer takes the form of a two octet sequence:
16431643-16441644- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
16451645- | 1 1| OFFSET |
16461646- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
16471647-16481648-The first two bits are ones. This allows a pointer to be distinguished
16491649-from a label, since the label must begin with two zero bits because
16501650-labels are restricted to 63 octets or less. (The 10 and 01 combinations
16511651-are reserved for future use.) The OFFSET field specifies an offset from
16521652-the start of the message (i.e., the first octet of the ID field in the
16531653-domain header). A zero offset specifies the first byte of the ID field,
16541654-etc.
16551655-16561656-The compression scheme allows a domain name in a message to be
16571657-represented as either:
16581658-16591659- - a sequence of labels ending in a zero octet
16601660-16611661- - a pointer
16621662-16631663- - a sequence of labels ending with a pointer
16641664-16651665-Pointers can only be used for occurances of a domain name where the
16661666-format is not class specific. If this were not the case, a name server
16671667-or resolver would be required to know the format of all RRs it handled.
16681668-As yet, there are no such cases, but they may occur in future RDATA
16691669-formats.
16701670-16711671-If a domain name is contained in a part of the message subject to a
16721672-length field (such as the RDATA section of an RR), and compression is
16731673-16741674-16751675-16761676-Mockapetris [Page 30]
16771677-16781678-RFC 1035 Domain Implementation and Specification November 1987
16791679-16801680-16811681-used, the length of the compressed name is used in the length
16821682-calculation, rather than the length of the expanded name.
16831683-16841684-Programs are free to avoid using pointers in messages they generate,
16851685-although this will reduce datagram capacity, and may cause truncation.
16861686-However all programs are required to understand arriving messages that
16871687-contain pointers.
16881688-16891689-For example, a datagram might need to use the domain names F.ISI.ARPA,
16901690-FOO.F.ISI.ARPA, ARPA, and the root. Ignoring the other fields of the
16911691-message, these domain names might be represented as:
16921692-16931693- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
16941694- 20 | 1 | F |
16951695- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
16961696- 22 | 3 | I |
16971697- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
16981698- 24 | S | I |
16991699- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17001700- 26 | 4 | A |
17011701- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17021702- 28 | R | P |
17031703- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17041704- 30 | A | 0 |
17051705- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17061706-17071707- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17081708- 40 | 3 | F |
17091709- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17101710- 42 | O | O |
17111711- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17121712- 44 | 1 1| 20 |
17131713- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17141714-17151715- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17161716- 64 | 1 1| 26 |
17171717- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17181718-17191719- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17201720- 92 | 0 | |
17211721- +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
17221722-17231723-The domain name for F.ISI.ARPA is shown at offset 20. The domain name
17241724-FOO.F.ISI.ARPA is shown at offset 40; this definition uses a pointer to
17251725-concatenate a label for FOO to the previously defined F.ISI.ARPA. The
17261726-domain name ARPA is defined at offset 64 using a pointer to the ARPA
17271727-component of the name F.ISI.ARPA at 20; note that this pointer relies on
17281728-ARPA being the last label in the string at 20. The root domain name is
17291729-17301730-17311731-17321732-Mockapetris [Page 31]
17331733-17341734-RFC 1035 Domain Implementation and Specification November 1987
17351735-17361736-17371737-defined by a single octet of zeros at 92; the root domain name has no
17381738-labels.
17391739-17401740-4.2. Transport
17411741-17421742-The DNS assumes that messages will be transmitted as datagrams or in a
17431743-byte stream carried by a virtual circuit. While virtual circuits can be
17441744-used for any DNS activity, datagrams are preferred for queries due to
17451745-their lower overhead and better performance. Zone refresh activities
17461746-must use virtual circuits because of the need for reliable transfer.
17471747-17481748-The Internet supports name server access using TCP [RFC-793] on server
17491749-port 53 (decimal) as well as datagram access using UDP [RFC-768] on UDP
17501750-port 53 (decimal).
17511751-17521752-4.2.1. UDP usage
17531753-17541754-Messages sent using UDP user server port 53 (decimal).
17551755-17561756-Messages carried by UDP are restricted to 512 bytes (not counting the IP
17571757-or UDP headers). Longer messages are truncated and the TC bit is set in
17581758-the header.
17591759-17601760-UDP is not acceptable for zone transfers, but is the recommended method
17611761-for standard queries in the Internet. Queries sent using UDP may be
17621762-lost, and hence a retransmission strategy is required. Queries or their
17631763-responses may be reordered by the network, or by processing in name
17641764-servers, so resolvers should not depend on them being returned in order.
17651765-17661766-The optimal UDP retransmission policy will vary with performance of the
17671767-Internet and the needs of the client, but the following are recommended:
17681768-17691769- - The client should try other servers and server addresses
17701770- before repeating a query to a specific address of a server.
17711771-17721772- - The retransmission interval should be based on prior
17731773- statistics if possible. Too aggressive retransmission can
17741774- easily slow responses for the community at large. Depending
17751775- on how well connected the client is to its expected servers,
17761776- the minimum retransmission interval should be 2-5 seconds.
17771777-17781778-More suggestions on server selection and retransmission policy can be
17791779-found in the resolver section of this memo.
17801780-17811781-4.2.2. TCP usage
17821782-17831783-Messages sent over TCP connections use server port 53 (decimal). The
17841784-message is prefixed with a two byte length field which gives the message
17851785-17861786-17871787-17881788-Mockapetris [Page 32]
17891789-17901790-RFC 1035 Domain Implementation and Specification November 1987
17911791-17921792-17931793-length, excluding the two byte length field. This length field allows
17941794-the low-level processing to assemble a complete message before beginning
17951795-to parse it.
17961796-17971797-Several connection management policies are recommended:
17981798-17991799- - The server should not block other activities waiting for TCP
18001800- data.
18011801-18021802- - The server should support multiple connections.
18031803-18041804- - The server should assume that the client will initiate
18051805- connection closing, and should delay closing its end of the
18061806- connection until all outstanding client requests have been
18071807- satisfied.
18081808-18091809- - If the server needs to close a dormant connection to reclaim
18101810- resources, it should wait until the connection has been idle
18111811- for a period on the order of two minutes. In particular, the
18121812- server should allow the SOA and AXFR request sequence (which
18131813- begins a refresh operation) to be made on a single connection.
18141814- Since the server would be unable to answer queries anyway, a
18151815- unilateral close or reset may be used instead of a graceful
18161816- close.
18171817-18181818-5. MASTER FILES
18191819-18201820-Master files are text files that contain RRs in text form. Since the
18211821-contents of a zone can be expressed in the form of a list of RRs a
18221822-master file is most often used to define a zone, though it can be used
18231823-to list a cache's contents. Hence, this section first discusses the
18241824-format of RRs in a master file, and then the special considerations when
18251825-a master file is used to create a zone in some name server.
18261826-18271827-5.1. Format
18281828-18291829-The format of these files is a sequence of entries. Entries are
18301830-predominantly line-oriented, though parentheses can be used to continue
18311831-a list of items across a line boundary, and text literals can contain
18321832-CRLF within the text. Any combination of tabs and spaces act as a
18331833-delimiter between the separate items that make up an entry. The end of
18341834-any line in the master file can end with a comment. The comment starts
18351835-with a ";" (semicolon).
18361836-18371837-The following entries are defined:
18381838-18391839- <blank>[<comment>]
18401840-18411841-18421842-18431843-18441844-Mockapetris [Page 33]
18451845-18461846-RFC 1035 Domain Implementation and Specification November 1987
18471847-18481848-18491849- $ORIGIN <domain-name> [<comment>]
18501850-18511851- $INCLUDE <file-name> [<domain-name>] [<comment>]
18521852-18531853- <domain-name><rr> [<comment>]
18541854-18551855- <blank><rr> [<comment>]
18561856-18571857-Blank lines, with or without comments, are allowed anywhere in the file.
18581858-18591859-Two control entries are defined: $ORIGIN and $INCLUDE. $ORIGIN is
18601860-followed by a domain name, and resets the current origin for relative
18611861-domain names to the stated name. $INCLUDE inserts the named file into
18621862-the current file, and may optionally specify a domain name that sets the
18631863-relative domain name origin for the included file. $INCLUDE may also
18641864-have a comment. Note that a $INCLUDE entry never changes the relative
18651865-origin of the parent file, regardless of changes to the relative origin
18661866-made within the included file.
18671867-18681868-The last two forms represent RRs. If an entry for an RR begins with a
18691869-blank, then the RR is assumed to be owned by the last stated owner. If
18701870-an RR entry begins with a <domain-name>, then the owner name is reset.
18711871-18721872-<rr> contents take one of the following forms:
18731873-18741874- [<TTL>] [<class>] <type> <RDATA>
18751875-18761876- [<class>] [<TTL>] <type> <RDATA>
18771877-18781878-The RR begins with optional TTL and class fields, followed by a type and
18791879-RDATA field appropriate to the type and class. Class and type use the
18801880-standard mnemonics, TTL is a decimal integer. Omitted class and TTL
18811881-values are default to the last explicitly stated values. Since type and
18821882-class mnemonics are disjoint, the parse is unique. (Note that this
18831883-order is different from the order used in examples and the order used in
18841884-the actual RRs; the given order allows easier parsing and defaulting.)
18851885-18861886-<domain-name>s make up a large share of the data in the master file.
18871887-The labels in the domain name are expressed as character strings and
18881888-separated by dots. Quoting conventions allow arbitrary characters to be
18891889-stored in domain names. Domain names that end in a dot are called
18901890-absolute, and are taken as complete. Domain names which do not end in a
18911891-dot are called relative; the actual domain name is the concatenation of
18921892-the relative part with an origin specified in a $ORIGIN, $INCLUDE, or as
18931893-an argument to the master file loading routine. A relative name is an
18941894-error when no origin is available.
18951895-18961896-18971897-18981898-18991899-19001900-Mockapetris [Page 34]
19011901-19021902-RFC 1035 Domain Implementation and Specification November 1987
19031903-19041904-19051905-<character-string> is expressed in one or two ways: as a contiguous set
19061906-of characters without interior spaces, or as a string beginning with a "
19071907-and ending with a ". Inside a " delimited string any character can
19081908-occur, except for a " itself, which must be quoted using \ (back slash).
19091909-19101910-Because these files are text files several special encodings are
19111911-necessary to allow arbitrary data to be loaded. In particular:
19121912-19131913- of the root.
19141914-19151915-@ A free standing @ is used to denote the current origin.
19161916-19171917-\X where X is any character other than a digit (0-9), is
19181918- used to quote that character so that its special meaning
19191919- does not apply. For example, "\." can be used to place
19201920- a dot character in a label.
19211921-19221922-\DDD where each D is a digit is the octet corresponding to
19231923- the decimal number described by DDD. The resulting
19241924- octet is assumed to be text and is not checked for
19251925- special meaning.
19261926-19271927-( ) Parentheses are used to group data that crosses a line
19281928- boundary. In effect, line terminations are not
19291929- recognized within parentheses.
19301930-19311931-; Semicolon is used to start a comment; the remainder of
19321932- the line is ignored.
19331933-19341934-5.2. Use of master files to define zones
19351935-19361936-When a master file is used to load a zone, the operation should be
19371937-suppressed if any errors are encountered in the master file. The
19381938-rationale for this is that a single error can have widespread
19391939-consequences. For example, suppose that the RRs defining a delegation
19401940-have syntax errors; then the server will return authoritative name
19411941-errors for all names in the subzone (except in the case where the
19421942-subzone is also present on the server).
19431943-19441944-Several other validity checks that should be performed in addition to
19451945-insuring that the file is syntactically correct:
19461946-19471947- 1. All RRs in the file should have the same class.
19481948-19491949- 2. Exactly one SOA RR should be present at the top of the zone.
19501950-19511951- 3. If delegations are present and glue information is required,
19521952- it should be present.
19531953-19541954-19551955-19561956-Mockapetris [Page 35]
19571957-19581958-RFC 1035 Domain Implementation and Specification November 1987
19591959-19601960-19611961- 4. Information present outside of the authoritative nodes in the
19621962- zone should be glue information, rather than the result of an
19631963- origin or similar error.
19641964-19651965-5.3. Master file example
19661966-19671967-The following is an example file which might be used to define the
19681968-ISI.EDU zone.and is loaded with an origin of ISI.EDU:
19691969-19701970-@ IN SOA VENERA Action\.domains (
19711971- 20 ; SERIAL
19721972- 7200 ; REFRESH
19731973- 600 ; RETRY
19741974- 3600000; EXPIRE
19751975- 60) ; MINIMUM
19761976-19771977- NS A.ISI.EDU.
19781978- NS VENERA
19791979- NS VAXA
19801980- MX 10 VENERA
19811981- MX 20 VAXA
19821982-19831983-A A 26.3.0.103
19841984-19851985-VENERA A 10.1.0.52
19861986- A 128.9.0.32
19871987-19881988-VAXA A 10.2.0.27
19891989- A 128.9.0.33
19901990-19911991-19921992-$INCLUDE <SUBSYS>ISI-MAILBOXES.TXT
19931993-19941994-Where the file <SUBSYS>ISI-MAILBOXES.TXT is:
19951995-19961996- MOE MB A.ISI.EDU.
19971997- LARRY MB A.ISI.EDU.
19981998- CURLEY MB A.ISI.EDU.
19991999- STOOGES MG MOE
20002000- MG LARRY
20012001- MG CURLEY
20022002-20032003-Note the use of the \ character in the SOA RR to specify the responsible
20042004-person mailbox "Action.domains@E.ISI.EDU".
20052005-20062006-20072007-20082008-20092009-20102010-20112011-20122012-Mockapetris [Page 36]
20132013-20142014-RFC 1035 Domain Implementation and Specification November 1987
20152015-20162016-20172017-6. NAME SERVER IMPLEMENTATION
20182018-20192019-6.1. Architecture
20202020-20212021-The optimal structure for the name server will depend on the host
20222022-operating system and whether the name server is integrated with resolver
20232023-operations, either by supporting recursive service, or by sharing its
20242024-database with a resolver. This section discusses implementation
20252025-considerations for a name server which shares a database with a
20262026-resolver, but most of these concerns are present in any name server.
20272027-20282028-6.1.1. Control
20292029-20302030-A name server must employ multiple concurrent activities, whether they
20312031-are implemented as separate tasks in the host's OS or multiplexing
20322032-inside a single name server program. It is simply not acceptable for a
20332033-name server to block the service of UDP requests while it waits for TCP
20342034-data for refreshing or query activities. Similarly, a name server
20352035-should not attempt to provide recursive service without processing such
20362036-requests in parallel, though it may choose to serialize requests from a
20372037-single client, or to regard identical requests from the same client as
20382038-duplicates. A name server should not substantially delay requests while
20392039-it reloads a zone from master files or while it incorporates a newly
20402040-refreshed zone into its database.
20412041-20422042-6.1.2. Database
20432043-20442044-While name server implementations are free to use any internal data
20452045-structures they choose, the suggested structure consists of three major
20462046-parts:
20472047-20482048- - A "catalog" data structure which lists the zones available to
20492049- this server, and a "pointer" to the zone data structure. The
20502050- main purpose of this structure is to find the nearest ancestor
20512051- zone, if any, for arriving standard queries.
20522052-20532053- - Separate data structures for each of the zones held by the
20542054- name server.
20552055-20562056- - A data structure for cached data. (or perhaps separate caches
20572057- for different classes)
20582058-20592059-All of these data structures can be implemented an identical tree
20602060-structure format, with different data chained off the nodes in different
20612061-parts: in the catalog the data is pointers to zones, while in the zone
20622062-and cache data structures, the data will be RRs. In designing the tree
20632063-framework the designer should recognize that query processing will need
20642064-to traverse the tree using case-insensitive label comparisons; and that
20652065-20662066-20672067-20682068-Mockapetris [Page 37]
20692069-20702070-RFC 1035 Domain Implementation and Specification November 1987
20712071-20722072-20732073-in real data, a few nodes have a very high branching factor (100-1000 or
20742074-more), but the vast majority have a very low branching factor (0-1).
20752075-20762076-One way to solve the case problem is to store the labels for each node
20772077-in two pieces: a standardized-case representation of the label where all
20782078-ASCII characters are in a single case, together with a bit mask that
20792079-denotes which characters are actually of a different case. The
20802080-branching factor diversity can be handled using a simple linked list for
20812081-a node until the branching factor exceeds some threshold, and
20822082-transitioning to a hash structure after the threshold is exceeded. In
20832083-any case, hash structures used to store tree sections must insure that
20842084-hash functions and procedures preserve the casing conventions of the
20852085-DNS.
20862086-20872087-The use of separate structures for the different parts of the database
20882088-is motivated by several factors:
20892089-20902090- - The catalog structure can be an almost static structure that
20912091- need change only when the system administrator changes the
20922092- zones supported by the server. This structure can also be
20932093- used to store parameters used to control refreshing
20942094- activities.
20952095-20962096- - The individual data structures for zones allow a zone to be
20972097- replaced simply by changing a pointer in the catalog. Zone
20982098- refresh operations can build a new structure and, when
20992099- complete, splice it into the database via a simple pointer
21002100- replacement. It is very important that when a zone is
21012101- refreshed, queries should not use old and new data
21022102- simultaneously.
21032103-21042104- - With the proper search procedures, authoritative data in zones
21052105- will always "hide", and hence take precedence over, cached
21062106- data.
21072107-21082108- - Errors in zone definitions that cause overlapping zones, etc.,
21092109- may cause erroneous responses to queries, but problem
21102110- determination is simplified, and the contents of one "bad"
21112111- zone can't corrupt another.
21122112-21132113- - Since the cache is most frequently updated, it is most
21142114- vulnerable to corruption during system restarts. It can also
21152115- become full of expired RR data. In either case, it can easily
21162116- be discarded without disturbing zone data.
21172117-21182118-A major aspect of database design is selecting a structure which allows
21192119-the name server to deal with crashes of the name server's host. State
21202120-information which a name server should save across system crashes
21212121-21222122-21232123-21242124-Mockapetris [Page 38]
21252125-21262126-RFC 1035 Domain Implementation and Specification November 1987
21272127-21282128-21292129-includes the catalog structure (including the state of refreshing for
21302130-each zone) and the zone data itself.
21312131-21322132-6.1.3. Time
21332133-21342134-Both the TTL data for RRs and the timing data for refreshing activities
21352135-depends on 32 bit timers in units of seconds. Inside the database,
21362136-refresh timers and TTLs for cached data conceptually "count down", while
21372137-data in the zone stays with constant TTLs.
21382138-21392139-A recommended implementation strategy is to store time in two ways: as
21402140-a relative increment and as an absolute time. One way to do this is to
21412141-use positive 32 bit numbers for one type and negative numbers for the
21422142-other. The RRs in zones use relative times; the refresh timers and
21432143-cache data use absolute times. Absolute numbers are taken with respect
21442144-to some known origin and converted to relative values when placed in the
21452145-response to a query. When an absolute TTL is negative after conversion
21462146-to relative, then the data is expired and should be ignored.
21472147-21482148-6.2. Standard query processing
21492149-21502150-The major algorithm for standard query processing is presented in
21512151-[RFC-1034].
21522152-21532153-When processing queries with QCLASS=*, or some other QCLASS which
21542154-matches multiple classes, the response should never be authoritative
21552155-unless the server can guarantee that the response covers all classes.
21562156-21572157-When composing a response, RRs which are to be inserted in the
21582158-additional section, but duplicate RRs in the answer or authority
21592159-sections, may be omitted from the additional section.
21602160-21612161-When a response is so long that truncation is required, the truncation
21622162-should start at the end of the response and work forward in the
21632163-datagram. Thus if there is any data for the authority section, the
21642164-answer section is guaranteed to be unique.
21652165-21662166-The MINIMUM value in the SOA should be used to set a floor on the TTL of
21672167-data distributed from a zone. This floor function should be done when
21682168-the data is copied into a response. This will allow future dynamic
21692169-update protocols to change the SOA MINIMUM field without ambiguous
21702170-semantics.
21712171-21722172-6.3. Zone refresh and reload processing
21732173-21742174-In spite of a server's best efforts, it may be unable to load zone data
21752175-from a master file due to syntax errors, etc., or be unable to refresh a
21762176-zone within the its expiration parameter. In this case, the name server
21772177-21782178-21792179-21802180-Mockapetris [Page 39]
21812181-21822182-RFC 1035 Domain Implementation and Specification November 1987
21832183-21842184-21852185-should answer queries as if it were not supposed to possess the zone.
21862186-21872187-If a master is sending a zone out via AXFR, and a new version is created
21882188-during the transfer, the master should continue to send the old version
21892189-if possible. In any case, it should never send part of one version and
21902190-part of another. If completion is not possible, the master should reset
21912191-the connection on which the zone transfer is taking place.
21922192-21932193-6.4. Inverse queries (Optional)
21942194-21952195-Inverse queries are an optional part of the DNS. Name servers are not
21962196-required to support any form of inverse queries. If a name server
21972197-receives an inverse query that it does not support, it returns an error
21982198-response with the "Not Implemented" error set in the header. While
21992199-inverse query support is optional, all name servers must be at least
22002200-able to return the error response.
22012201-22022202-6.4.1. The contents of inverse queries and responses Inverse
22032203-queries reverse the mappings performed by standard query operations;
22042204-while a standard query maps a domain name to a resource, an inverse
22052205-query maps a resource to a domain name. For example, a standard query
22062206-might bind a domain name to a host address; the corresponding inverse
22072207-query binds the host address to a domain name.
22082208-22092209-Inverse queries take the form of a single RR in the answer section of
22102210-the message, with an empty question section. The owner name of the
22112211-query RR and its TTL are not significant. The response carries
22122212-questions in the question section which identify all names possessing
22132213-the query RR WHICH THE NAME SERVER KNOWS. Since no name server knows
22142214-about all of the domain name space, the response can never be assumed to
22152215-be complete. Thus inverse queries are primarily useful for database
22162216-management and debugging activities. Inverse queries are NOT an
22172217-acceptable method of mapping host addresses to host names; use the IN-
22182218-ADDR.ARPA domain instead.
22192219-22202220-Where possible, name servers should provide case-insensitive comparisons
22212221-for inverse queries. Thus an inverse query asking for an MX RR of
22222222-"Venera.isi.edu" should get the same response as a query for
22232223-"VENERA.ISI.EDU"; an inverse query for HINFO RR "IBM-PC UNIX" should
22242224-produce the same result as an inverse query for "IBM-pc unix". However,
22252225-this cannot be guaranteed because name servers may possess RRs that
22262226-contain character strings but the name server does not know that the
22272227-data is character.
22282228-22292229-When a name server processes an inverse query, it either returns:
22302230-22312231- 1. zero, one, or multiple domain names for the specified
22322232- resource as QNAMEs in the question section
22332233-22342234-22352235-22362236-Mockapetris [Page 40]
22372237-22382238-RFC 1035 Domain Implementation and Specification November 1987
22392239-22402240-22412241- 2. an error code indicating that the name server doesn't support
22422242- inverse mapping of the specified resource type.
22432243-22442244-When the response to an inverse query contains one or more QNAMEs, the
22452245-owner name and TTL of the RR in the answer section which defines the
22462246-inverse query is modified to exactly match an RR found at the first
22472247-QNAME.
22482248-22492249-RRs returned in the inverse queries cannot be cached using the same
22502250-mechanism as is used for the replies to standard queries. One reason
22512251-for this is that a name might have multiple RRs of the same type, and
22522252-only one would appear. For example, an inverse query for a single
22532253-address of a multiply homed host might create the impression that only
22542254-one address existed.
22552255-22562256-6.4.2. Inverse query and response example The overall structure
22572257-of an inverse query for retrieving the domain name that corresponds to
22582258-Internet address 10.1.0.52 is shown below:
22592259-22602260- +-----------------------------------------+
22612261- Header | OPCODE=IQUERY, ID=997 |
22622262- +-----------------------------------------+
22632263- Question | <empty> |
22642264- +-----------------------------------------+
22652265- Answer | <anyname> A IN 10.1.0.52 |
22662266- +-----------------------------------------+
22672267- Authority | <empty> |
22682268- +-----------------------------------------+
22692269- Additional | <empty> |
22702270- +-----------------------------------------+
22712271-22722272-This query asks for a question whose answer is the Internet style
22732273-address 10.1.0.52. Since the owner name is not known, any domain name
22742274-can be used as a placeholder (and is ignored). A single octet of zero,
22752275-signifying the root, is usually used because it minimizes the length of
22762276-the message. The TTL of the RR is not significant. The response to
22772277-this query might be:
22782278-22792279-22802280-22812281-22822282-22832283-22842284-22852285-22862286-22872287-22882288-22892289-22902290-22912291-22922292-Mockapetris [Page 41]
22932293-22942294-RFC 1035 Domain Implementation and Specification November 1987
22952295-22962296-22972297- +-----------------------------------------+
22982298- Header | OPCODE=RESPONSE, ID=997 |
22992299- +-----------------------------------------+
23002300- Question |QTYPE=A, QCLASS=IN, QNAME=VENERA.ISI.EDU |
23012301- +-----------------------------------------+
23022302- Answer | VENERA.ISI.EDU A IN 10.1.0.52 |
23032303- +-----------------------------------------+
23042304- Authority | <empty> |
23052305- +-----------------------------------------+
23062306- Additional | <empty> |
23072307- +-----------------------------------------+
23082308-23092309-Note that the QTYPE in a response to an inverse query is the same as the
23102310-TYPE field in the answer section of the inverse query. Responses to
23112311-inverse queries may contain multiple questions when the inverse is not
23122312-unique. If the question section in the response is not empty, then the
23132313-RR in the answer section is modified to correspond to be an exact copy
23142314-of an RR at the first QNAME.
23152315-23162316-6.4.3. Inverse query processing
23172317-23182318-Name servers that support inverse queries can support these operations
23192319-through exhaustive searches of their databases, but this becomes
23202320-impractical as the size of the database increases. An alternative
23212321-approach is to invert the database according to the search key.
23222322-23232323-For name servers that support multiple zones and a large amount of data,
23242324-the recommended approach is separate inversions for each zone. When a
23252325-particular zone is changed during a refresh, only its inversions need to
23262326-be redone.
23272327-23282328-Support for transfer of this type of inversion may be included in future
23292329-versions of the domain system, but is not supported in this version.
23302330-23312331-6.5. Completion queries and responses
23322332-23332333-The optional completion services described in RFC-882 and RFC-883 have
23342334-been deleted. Redesigned services may become available in the future.
23352335-23362336-23372337-23382338-23392339-23402340-23412341-23422342-23432343-23442344-23452345-23462346-23472347-23482348-Mockapetris [Page 42]
23492349-23502350-RFC 1035 Domain Implementation and Specification November 1987
23512351-23522352-23532353-7. RESOLVER IMPLEMENTATION
23542354-23552355-The top levels of the recommended resolver algorithm are discussed in
23562356-[RFC-1034]. This section discusses implementation details assuming the
23572357-database structure suggested in the name server implementation section
23582358-of this memo.
23592359-23602360-7.1. Transforming a user request into a query
23612361-23622362-The first step a resolver takes is to transform the client's request,
23632363-stated in a format suitable to the local OS, into a search specification
23642364-for RRs at a specific name which match a specific QTYPE and QCLASS.
23652365-Where possible, the QTYPE and QCLASS should correspond to a single type
23662366-and a single class, because this makes the use of cached data much
23672367-simpler. The reason for this is that the presence of data of one type
23682368-in a cache doesn't confirm the existence or non-existence of data of
23692369-other types, hence the only way to be sure is to consult an
23702370-authoritative source. If QCLASS=* is used, then authoritative answers
23712371-won't be available.
23722372-23732373-Since a resolver must be able to multiplex multiple requests if it is to
23742374-perform its function efficiently, each pending request is usually
23752375-represented in some block of state information. This state block will
23762376-typically contain:
23772377-23782378- - A timestamp indicating the time the request began.
23792379- The timestamp is used to decide whether RRs in the database
23802380- can be used or are out of date. This timestamp uses the
23812381- absolute time format previously discussed for RR storage in
23822382- zones and caches. Note that when an RRs TTL indicates a
23832383- relative time, the RR must be timely, since it is part of a
23842384- zone. When the RR has an absolute time, it is part of a
23852385- cache, and the TTL of the RR is compared against the timestamp
23862386- for the start of the request.
23872387-23882388- Note that using the timestamp is superior to using a current
23892389- time, since it allows RRs with TTLs of zero to be entered in
23902390- the cache in the usual manner, but still used by the current
23912391- request, even after intervals of many seconds due to system
23922392- load, query retransmission timeouts, etc.
23932393-23942394- - Some sort of parameters to limit the amount of work which will
23952395- be performed for this request.
23962396-23972397- The amount of work which a resolver will do in response to a
23982398- client request must be limited to guard against errors in the
23992399- database, such as circular CNAME references, and operational
24002400- problems, such as network partition which prevents the
24012401-24022402-24032403-24042404-Mockapetris [Page 43]
24052405-24062406-RFC 1035 Domain Implementation and Specification November 1987
24072407-24082408-24092409- resolver from accessing the name servers it needs. While
24102410- local limits on the number of times a resolver will retransmit
24112411- a particular query to a particular name server address are
24122412- essential, the resolver should have a global per-request
24132413- counter to limit work on a single request. The counter should
24142414- be set to some initial value and decremented whenever the
24152415- resolver performs any action (retransmission timeout,
24162416- retransmission, etc.) If the counter passes zero, the request
24172417- is terminated with a temporary error.
24182418-24192419- Note that if the resolver structure allows one request to
24202420- start others in parallel, such as when the need to access a
24212421- name server for one request causes a parallel resolve for the
24222422- name server's addresses, the spawned request should be started
24232423- with a lower counter. This prevents circular references in
24242424- the database from starting a chain reaction of resolver
24252425- activity.
24262426-24272427- - The SLIST data structure discussed in [RFC-1034].
24282428-24292429- This structure keeps track of the state of a request if it
24302430- must wait for answers from foreign name servers.
24312431-24322432-7.2. Sending the queries
24332433-24342434-As described in [RFC-1034], the basic task of the resolver is to
24352435-formulate a query which will answer the client's request and direct that
24362436-query to name servers which can provide the information. The resolver
24372437-will usually only have very strong hints about which servers to ask, in
24382438-the form of NS RRs, and may have to revise the query, in response to
24392439-CNAMEs, or revise the set of name servers the resolver is asking, in
24402440-response to delegation responses which point the resolver to name
24412441-servers closer to the desired information. In addition to the
24422442-information requested by the client, the resolver may have to call upon
24432443-its own services to determine the address of name servers it wishes to
24442444-contact.
24452445-24462446-In any case, the model used in this memo assumes that the resolver is
24472447-multiplexing attention between multiple requests, some from the client,
24482448-and some internally generated. Each request is represented by some
24492449-state information, and the desired behavior is that the resolver
24502450-transmit queries to name servers in a way that maximizes the probability
24512451-that the request is answered, minimizes the time that the request takes,
24522452-and avoids excessive transmissions. The key algorithm uses the state
24532453-information of the request to select the next name server address to
24542454-query, and also computes a timeout which will cause the next action
24552455-should a response not arrive. The next action will usually be a
24562456-transmission to some other server, but may be a temporary error to the
24572457-24582458-24592459-24602460-Mockapetris [Page 44]
24612461-24622462-RFC 1035 Domain Implementation and Specification November 1987
24632463-24642464-24652465-client.
24662466-24672467-The resolver always starts with a list of server names to query (SLIST).
24682468-This list will be all NS RRs which correspond to the nearest ancestor
24692469-zone that the resolver knows about. To avoid startup problems, the
24702470-resolver should have a set of default servers which it will ask should
24712471-it have no current NS RRs which are appropriate. The resolver then adds
24722472-to SLIST all of the known addresses for the name servers, and may start
24732473-parallel requests to acquire the addresses of the servers when the
24742474-resolver has the name, but no addresses, for the name servers.
24752475-24762476-To complete initialization of SLIST, the resolver attaches whatever
24772477-history information it has to the each address in SLIST. This will
24782478-usually consist of some sort of weighted averages for the response time
24792479-of the address, and the batting average of the address (i.e., how often
24802480-the address responded at all to the request). Note that this
24812481-information should be kept on a per address basis, rather than on a per
24822482-name server basis, because the response time and batting average of a
24832483-particular server may vary considerably from address to address. Note
24842484-also that this information is actually specific to a resolver address /
24852485-server address pair, so a resolver with multiple addresses may wish to
24862486-keep separate histories for each of its addresses. Part of this step
24872487-must deal with addresses which have no such history; in this case an
24882488-expected round trip time of 5-10 seconds should be the worst case, with
24892489-lower estimates for the same local network, etc.
24902490-24912491-Note that whenever a delegation is followed, the resolver algorithm
24922492-reinitializes SLIST.
24932493-24942494-The information establishes a partial ranking of the available name
24952495-server addresses. Each time an address is chosen and the state should
24962496-be altered to prevent its selection again until all other addresses have
24972497-been tried. The timeout for each transmission should be 50-100% greater
24982498-than the average predicted value to allow for variance in response.
24992499-25002500-Some fine points:
25012501-25022502- - The resolver may encounter a situation where no addresses are
25032503- available for any of the name servers named in SLIST, and
25042504- where the servers in the list are precisely those which would
25052505- normally be used to look up their own addresses. This
25062506- situation typically occurs when the glue address RRs have a
25072507- smaller TTL than the NS RRs marking delegation, or when the
25082508- resolver caches the result of a NS search. The resolver
25092509- should detect this condition and restart the search at the
25102510- next ancestor zone, or alternatively at the root.
25112511-25122512-25132513-25142514-25152515-25162516-Mockapetris [Page 45]
25172517-25182518-RFC 1035 Domain Implementation and Specification November 1987
25192519-25202520-25212521- - If a resolver gets a server error or other bizarre response
25222522- from a name server, it should remove it from SLIST, and may
25232523- wish to schedule an immediate transmission to the next
25242524- candidate server address.
25252525-25262526-7.3. Processing responses
25272527-25282528-The first step in processing arriving response datagrams is to parse the
25292529-response. This procedure should include:
25302530-25312531- - Check the header for reasonableness. Discard datagrams which
25322532- are queries when responses are expected.
25332533-25342534- - Parse the sections of the message, and insure that all RRs are
25352535- correctly formatted.
25362536-25372537- - As an optional step, check the TTLs of arriving data looking
25382538- for RRs with excessively long TTLs. If a RR has an
25392539- excessively long TTL, say greater than 1 week, either discard
25402540- the whole response, or limit all TTLs in the response to 1
25412541- week.
25422542-25432543-The next step is to match the response to a current resolver request.
25442544-The recommended strategy is to do a preliminary matching using the ID
25452545-field in the domain header, and then to verify that the question section
25462546-corresponds to the information currently desired. This requires that
25472547-the transmission algorithm devote several bits of the domain ID field to
25482548-a request identifier of some sort. This step has several fine points:
25492549-25502550- - Some name servers send their responses from different
25512551- addresses than the one used to receive the query. That is, a
25522552- resolver cannot rely that a response will come from the same
25532553- address which it sent the corresponding query to. This name
25542554- server bug is typically encountered in UNIX systems.
25552555-25562556- - If the resolver retransmits a particular request to a name
25572557- server it should be able to use a response from any of the
25582558- transmissions. However, if it is using the response to sample
25592559- the round trip time to access the name server, it must be able
25602560- to determine which transmission matches the response (and keep
25612561- transmission times for each outgoing message), or only
25622562- calculate round trip times based on initial transmissions.
25632563-25642564- - A name server will occasionally not have a current copy of a
25652565- zone which it should have according to some NS RRs. The
25662566- resolver should simply remove the name server from the current
25672567- SLIST, and continue.
25682568-25692569-25702570-25712571-25722572-Mockapetris [Page 46]
25732573-25742574-RFC 1035 Domain Implementation and Specification November 1987
25752575-25762576-25772577-7.4. Using the cache
25782578-25792579-In general, we expect a resolver to cache all data which it receives in
25802580-responses since it may be useful in answering future client requests.
25812581-However, there are several types of data which should not be cached:
25822582-25832583- - When several RRs of the same type are available for a
25842584- particular owner name, the resolver should either cache them
25852585- all or none at all. When a response is truncated, and a
25862586- resolver doesn't know whether it has a complete set, it should
25872587- not cache a possibly partial set of RRs.
25882588-25892589- - Cached data should never be used in preference to
25902590- authoritative data, so if caching would cause this to happen
25912591- the data should not be cached.
25922592-25932593- - The results of an inverse query should not be cached.
25942594-25952595- - The results of standard queries where the QNAME contains "*"
25962596- labels if the data might be used to construct wildcards. The
25972597- reason is that the cache does not necessarily contain existing
25982598- RRs or zone boundary information which is necessary to
25992599- restrict the application of the wildcard RRs.
26002600-26012601- - RR data in responses of dubious reliability. When a resolver
26022602- receives unsolicited responses or RR data other than that
26032603- requested, it should discard it without caching it. The basic
26042604- implication is that all sanity checks on a packet should be
26052605- performed before any of it is cached.
26062606-26072607-In a similar vein, when a resolver has a set of RRs for some name in a
26082608-response, and wants to cache the RRs, it should check its cache for
26092609-already existing RRs. Depending on the circumstances, either the data
26102610-in the response or the cache is preferred, but the two should never be
26112611-combined. If the data in the response is from authoritative data in the
26122612-answer section, it is always preferred.
26132613-26142614-8. MAIL SUPPORT
26152615-26162616-The domain system defines a standard for mapping mailboxes into domain
26172617-names, and two methods for using the mailbox information to derive mail
26182618-routing information. The first method is called mail exchange binding
26192619-and the other method is mailbox binding. The mailbox encoding standard
26202620-and mail exchange binding are part of the DNS official protocol, and are
26212621-the recommended method for mail routing in the Internet. Mailbox
26222622-binding is an experimental feature which is still under development and
26232623-subject to change.
26242624-26252625-26262626-26272627-26282628-Mockapetris [Page 47]
26292629-26302630-RFC 1035 Domain Implementation and Specification November 1987
26312631-26322632-26332633-The mailbox encoding standard assumes a mailbox name of the form
26342634-"<local-part>@<mail-domain>". While the syntax allowed in each of these
26352635-sections varies substantially between the various mail internets, the
26362636-preferred syntax for the ARPA Internet is given in [RFC-822].
26372637-26382638-The DNS encodes the <local-part> as a single label, and encodes the
26392639-<mail-domain> as a domain name. The single label from the <local-part>
26402640-is prefaced to the domain name from <mail-domain> to form the domain
26412641-name corresponding to the mailbox. Thus the mailbox HOSTMASTER@SRI-
26422642-NIC.ARPA is mapped into the domain name HOSTMASTER.SRI-NIC.ARPA. If the
26432643-<local-part> contains dots or other special characters, its
26442644-representation in a master file will require the use of backslash
26452645-quoting to ensure that the domain name is properly encoded. For
26462646-example, the mailbox Action.domains@ISI.EDU would be represented as
26472647-Action\.domains.ISI.EDU.
26482648-26492649-8.1. Mail exchange binding
26502650-26512651-Mail exchange binding uses the <mail-domain> part of a mailbox
26522652-specification to determine where mail should be sent. The <local-part>
26532653-is not even consulted. [RFC-974] specifies this method in detail, and
26542654-should be consulted before attempting to use mail exchange support.
26552655-26562656-One of the advantages of this method is that it decouples mail
26572657-destination naming from the hosts used to support mail service, at the
26582658-cost of another layer of indirection in the lookup function. However,
26592659-the addition layer should eliminate the need for complicated "%", "!",
26602660-etc encodings in <local-part>.
26612661-26622662-The essence of the method is that the <mail-domain> is used as a domain
26632663-name to locate type MX RRs which list hosts willing to accept mail for
26642664-<mail-domain>, together with preference values which rank the hosts
26652665-according to an order specified by the administrators for <mail-domain>.
26662666-26672667-In this memo, the <mail-domain> ISI.EDU is used in examples, together
26682668-with the hosts VENERA.ISI.EDU and VAXA.ISI.EDU as mail exchanges for
26692669-ISI.EDU. If a mailer had a message for Mockapetris@ISI.EDU, it would
26702670-route it by looking up MX RRs for ISI.EDU. The MX RRs at ISI.EDU name
26712671-VENERA.ISI.EDU and VAXA.ISI.EDU, and type A queries can find the host
26722672-addresses.
26732673-26742674-8.2. Mailbox binding (Experimental)
26752675-26762676-In mailbox binding, the mailer uses the entire mail destination
26772677-specification to construct a domain name. The encoded domain name for
26782678-the mailbox is used as the QNAME field in a QTYPE=MAILB query.
26792679-26802680-Several outcomes are possible for this query:
26812681-26822682-26832683-26842684-Mockapetris [Page 48]
26852685-26862686-RFC 1035 Domain Implementation and Specification November 1987
26872687-26882688-26892689- 1. The query can return a name error indicating that the mailbox
26902690- does not exist as a domain name.
26912691-26922692- In the long term, this would indicate that the specified
26932693- mailbox doesn't exist. However, until the use of mailbox
26942694- binding is universal, this error condition should be
26952695- interpreted to mean that the organization identified by the
26962696- global part does not support mailbox binding. The
26972697- appropriate procedure is to revert to exchange binding at
26982698- this point.
26992699-27002700- 2. The query can return a Mail Rename (MR) RR.
27012701-27022702- The MR RR carries new mailbox specification in its RDATA
27032703- field. The mailer should replace the old mailbox with the
27042704- new one and retry the operation.
27052705-27062706- 3. The query can return a MB RR.
27072707-27082708- The MB RR carries a domain name for a host in its RDATA
27092709- field. The mailer should deliver the message to that host
27102710- via whatever protocol is applicable, e.g., b,SMTP.
27112711-27122712- 4. The query can return one or more Mail Group (MG) RRs.
27132713-27142714- This condition means that the mailbox was actually a mailing
27152715- list or mail group, rather than a single mailbox. Each MG RR
27162716- has a RDATA field that identifies a mailbox that is a member
27172717- of the group. The mailer should deliver a copy of the
27182718- message to each member.
27192719-27202720- 5. The query can return a MB RR as well as one or more MG RRs.
27212721-27222722- This condition means the the mailbox was actually a mailing
27232723- list. The mailer can either deliver the message to the host
27242724- specified by the MB RR, which will in turn do the delivery to
27252725- all members, or the mailer can use the MG RRs to do the
27262726- expansion itself.
27272727-27282728-In any of these cases, the response may include a Mail Information
27292729-(MINFO) RR. This RR is usually associated with a mail group, but is
27302730-legal with a MB. The MINFO RR identifies two mailboxes. One of these
27312731-identifies a responsible person for the original mailbox name. This
27322732-mailbox should be used for requests to be added to a mail group, etc.
27332733-The second mailbox name in the MINFO RR identifies a mailbox that should
27342734-receive error messages for mail failures. This is particularly
27352735-appropriate for mailing lists when errors in member names should be
27362736-reported to a person other than the one who sends a message to the list.
27372737-27382738-27392739-27402740-Mockapetris [Page 49]
27412741-27422742-RFC 1035 Domain Implementation and Specification November 1987
27432743-27442744-27452745-New fields may be added to this RR in the future.
27462746-27472747-27482748-9. REFERENCES and BIBLIOGRAPHY
27492749-27502750-[Dyer 87] S. Dyer, F. Hsu, "Hesiod", Project Athena
27512751- Technical Plan - Name Service, April 1987, version 1.9.
27522752-27532753- Describes the fundamentals of the Hesiod name service.
27542754-27552755-[IEN-116] J. Postel, "Internet Name Server", IEN-116,
27562756- USC/Information Sciences Institute, August 1979.
27572757-27582758- A name service obsoleted by the Domain Name System, but
27592759- still in use.
27602760-27612761-[Quarterman 86] J. Quarterman, and J. Hoskins, "Notable Computer Networks",
27622762- Communications of the ACM, October 1986, volume 29, number
27632763- 10.
27642764-27652765-[RFC-742] K. Harrenstien, "NAME/FINGER", RFC-742, Network
27662766- Information Center, SRI International, December 1977.
27672767-27682768-[RFC-768] J. Postel, "User Datagram Protocol", RFC-768,
27692769- USC/Information Sciences Institute, August 1980.
27702770-27712771-[RFC-793] J. Postel, "Transmission Control Protocol", RFC-793,
27722772- USC/Information Sciences Institute, September 1981.
27732773-27742774-[RFC-799] D. Mills, "Internet Name Domains", RFC-799, COMSAT,
27752775- September 1981.
27762776-27772777- Suggests introduction of a hierarchy in place of a flat
27782778- name space for the Internet.
27792779-27802780-[RFC-805] J. Postel, "Computer Mail Meeting Notes", RFC-805,
27812781- USC/Information Sciences Institute, February 1982.
27822782-27832783-[RFC-810] E. Feinler, K. Harrenstien, Z. Su, and V. White, "DOD
27842784- Internet Host Table Specification", RFC-810, Network
27852785- Information Center, SRI International, March 1982.
27862786-27872787- Obsolete. See RFC-952.
27882788-27892789-[RFC-811] K. Harrenstien, V. White, and E. Feinler, "Hostnames
27902790- Server", RFC-811, Network Information Center, SRI
27912791- International, March 1982.
27922792-27932793-27942794-27952795-27962796-Mockapetris [Page 50]
27972797-27982798-RFC 1035 Domain Implementation and Specification November 1987
27992799-28002800-28012801- Obsolete. See RFC-953.
28022802-28032803-[RFC-812] K. Harrenstien, and V. White, "NICNAME/WHOIS", RFC-812,
28042804- Network Information Center, SRI International, March
28052805- 1982.
28062806-28072807-[RFC-819] Z. Su, and J. Postel, "The Domain Naming Convention for
28082808- Internet User Applications", RFC-819, Network
28092809- Information Center, SRI International, August 1982.
28102810-28112811- Early thoughts on the design of the domain system.
28122812- Current implementation is completely different.
28132813-28142814-[RFC-821] J. Postel, "Simple Mail Transfer Protocol", RFC-821,
28152815- USC/Information Sciences Institute, August 1980.
28162816-28172817-[RFC-830] Z. Su, "A Distributed System for Internet Name Service",
28182818- RFC-830, Network Information Center, SRI International,
28192819- October 1982.
28202820-28212821- Early thoughts on the design of the domain system.
28222822- Current implementation is completely different.
28232823-28242824-[RFC-882] P. Mockapetris, "Domain names - Concepts and
28252825- Facilities," RFC-882, USC/Information Sciences
28262826- Institute, November 1983.
28272827-28282828- Superceeded by this memo.
28292829-28302830-[RFC-883] P. Mockapetris, "Domain names - Implementation and
28312831- Specification," RFC-883, USC/Information Sciences
28322832- Institute, November 1983.
28332833-28342834- Superceeded by this memo.
28352835-28362836-[RFC-920] J. Postel and J. Reynolds, "Domain Requirements",
28372837- RFC-920, USC/Information Sciences Institute,
28382838- October 1984.
28392839-28402840- Explains the naming scheme for top level domains.
28412841-28422842-[RFC-952] K. Harrenstien, M. Stahl, E. Feinler, "DoD Internet Host
28432843- Table Specification", RFC-952, SRI, October 1985.
28442844-28452845- Specifies the format of HOSTS.TXT, the host/address
28462846- table replaced by the DNS.
28472847-28482848-28492849-28502850-28512851-28522852-Mockapetris [Page 51]
28532853-28542854-RFC 1035 Domain Implementation and Specification November 1987
28552855-28562856-28572857-[RFC-953] K. Harrenstien, M. Stahl, E. Feinler, "HOSTNAME Server",
28582858- RFC-953, SRI, October 1985.
28592859-28602860- This RFC contains the official specification of the
28612861- hostname server protocol, which is obsoleted by the DNS.
28622862- This TCP based protocol accesses information stored in
28632863- the RFC-952 format, and is used to obtain copies of the
28642864- host table.
28652865-28662866-[RFC-973] P. Mockapetris, "Domain System Changes and
28672867- Observations", RFC-973, USC/Information Sciences
28682868- Institute, January 1986.
28692869-28702870- Describes changes to RFC-882 and RFC-883 and reasons for
28712871- them.
28722872-28732873-[RFC-974] C. Partridge, "Mail routing and the domain system",
28742874- RFC-974, CSNET CIC BBN Labs, January 1986.
28752875-28762876- Describes the transition from HOSTS.TXT based mail
28772877- addressing to the more powerful MX system used with the
28782878- domain system.
28792879-28802880-[RFC-1001] NetBIOS Working Group, "Protocol standard for a NetBIOS
28812881- service on a TCP/UDP transport: Concepts and Methods",
28822882- RFC-1001, March 1987.
28832883-28842884- This RFC and RFC-1002 are a preliminary design for
28852885- NETBIOS on top of TCP/IP which proposes to base NetBIOS
28862886- name service on top of the DNS.
28872887-28882888-[RFC-1002] NetBIOS Working Group, "Protocol standard for a NetBIOS
28892889- service on a TCP/UDP transport: Detailed
28902890- Specifications", RFC-1002, March 1987.
28912891-28922892-[RFC-1010] J. Reynolds, and J. Postel, "Assigned Numbers", RFC-1010,
28932893- USC/Information Sciences Institute, May 1987.
28942894-28952895- Contains socket numbers and mnemonics for host names,
28962896- operating systems, etc.
28972897-28982898-[RFC-1031] W. Lazear, "MILNET Name Domain Transition", RFC-1031,
28992899- November 1987.
29002900-29012901- Describes a plan for converting the MILNET to the DNS.
29022902-29032903-[RFC-1032] M. Stahl, "Establishing a Domain - Guidelines for
29042904- Administrators", RFC-1032, November 1987.
29052905-29062906-29072907-29082908-Mockapetris [Page 52]
29092909-29102910-RFC 1035 Domain Implementation and Specification November 1987
29112911-29122912-29132913- Describes the registration policies used by the NIC to
29142914- administer the top level domains and delegate subzones.
29152915-29162916-[RFC-1033] M. Lottor, "Domain Administrators Operations Guide",
29172917- RFC-1033, November 1987.
29182918-29192919- A cookbook for domain administrators.
29202920-29212921-[Solomon 82] M. Solomon, L. Landweber, and D. Neuhengen, "The CSNET
29222922- Name Server", Computer Networks, vol 6, nr 3, July 1982.
29232923-29242924- Describes a name service for CSNET which is independent
29252925- from the DNS and DNS use in the CSNET.
29262926-29272927-29282928-29292929-29302930-29312931-29322932-29332933-29342934-29352935-29362936-29372937-29382938-29392939-29402940-29412941-29422942-29432943-29442944-29452945-29462946-29472947-29482948-29492949-29502950-29512951-29522952-29532953-29542954-29552955-29562956-29572957-29582958-29592959-29602960-29612961-29622962-29632963-29642964-Mockapetris [Page 53]
29652965-29662966-RFC 1035 Domain Implementation and Specification November 1987
29672967-29682968-29692969-Index
29702970-29712971- * 13
29722972-29732973- ; 33, 35
29742974-29752975- <character-string> 35
29762976- <domain-name> 34
29772977-29782978- @ 35
29792979-29802980- \ 35
29812981-29822982- A 12
29832983-29842984- Byte order 8
29852985-29862986- CH 13
29872987- Character case 9
29882988- CLASS 11
29892989- CNAME 12
29902990- Completion 42
29912991- CS 13
29922992-29932993- Hesiod 13
29942994- HINFO 12
29952995- HS 13
29962996-29972997- IN 13
29982998- IN-ADDR.ARPA domain 22
29992999- Inverse queries 40
30003000-30013001- Mailbox names 47
30023002- MB 12
30033003- MD 12
30043004- MF 12
30053005- MG 12
30063006- MINFO 12
30073007- MINIMUM 20
30083008- MR 12
30093009- MX 12
30103010-30113011- NS 12
30123012- NULL 12
30133013-30143014- Port numbers 32
30153015- Primary server 5
30163016- PTR 12, 18
30173017-30183018-30193019-30203020-Mockapetris [Page 54]
30213021-30223022-RFC 1035 Domain Implementation and Specification November 1987
30233023-30243024-30253025- QCLASS 13
30263026- QTYPE 12
30273027-30283028- RDATA 12
30293029- RDLENGTH 11
30303030-30313031- Secondary server 5
30323032- SOA 12
30333033- Stub resolvers 7
30343034-30353035- TCP 32
30363036- TXT 12
30373037- TYPE 11
30383038-30393039- UDP 32
30403040-30413041- WKS 12
30423042-30433043-30443044-30453045-30463046-30473047-30483048-30493049-30503050-30513051-30523052-30533053-30543054-30553055-30563056-30573057-30583058-30593059-30603060-30613061-30623062-30633063-30643064-30653065-30663066-30673067-30683068-30693069-30703070-30713071-30723072-30733073-30743074-30753075-30763076-Mockapetris [Page 55]
30773077-
-1963
ocaml-punycode/spec/rfc3492.txt
···11-22-33-44-55-66-77-Network Working Group A. Costello
88-Request for Comments: 3492 Univ. of California, Berkeley
99-Category: Standards Track March 2003
1010-1111-1212- Punycode: A Bootstring encoding of Unicode
1313- for Internationalized Domain Names in Applications (IDNA)
1414-1515-Status of this Memo
1616-1717- This document specifies an Internet standards track protocol for the
1818- Internet community, and requests discussion and suggestions for
1919- improvements. Please refer to the current edition of the "Internet
2020- Official Protocol Standards" (STD 1) for the standardization state
2121- and status of this protocol. Distribution of this memo is unlimited.
2222-2323-Copyright Notice
2424-2525- Copyright (C) The Internet Society (2003). All Rights Reserved.
2626-2727-Abstract
2828-2929- Punycode is a simple and efficient transfer encoding syntax designed
3030- for use with Internationalized Domain Names in Applications (IDNA).
3131- It uniquely and reversibly transforms a Unicode string into an ASCII
3232- string. ASCII characters in the Unicode string are represented
3333- literally, and non-ASCII characters are represented by ASCII
3434- characters that are allowed in host name labels (letters, digits, and
3535- hyphens). This document defines a general algorithm called
3636- Bootstring that allows a string of basic code points to uniquely
3737- represent any string of code points drawn from a larger set.
3838- Punycode is an instance of Bootstring that uses particular parameter
3939- values specified by this document, appropriate for IDNA.
4040-4141-Table of Contents
4242-4343- 1. Introduction...............................................2
4444- 1.1 Features..............................................2
4545- 1.2 Interaction of protocol parts.........................3
4646- 2. Terminology................................................3
4747- 3. Bootstring description.....................................4
4848- 3.1 Basic code point segregation..........................4
4949- 3.2 Insertion unsort coding...............................4
5050- 3.3 Generalized variable-length integers..................5
5151- 3.4 Bias adaptation.......................................7
5252- 4. Bootstring parameters......................................8
5353- 5. Parameter values for Punycode..............................8
5454- 6. Bootstring algorithms......................................9
5555-5656-5757-5858-Costello Standards Track [Page 1]
5959-6060-RFC 3492 IDNA Punycode March 2003
6161-6262-6363- 6.1 Bias adaptation function.............................10
6464- 6.2 Decoding procedure...................................11
6565- 6.3 Encoding procedure...................................12
6666- 6.4 Overflow handling....................................13
6767- 7. Punycode examples.........................................14
6868- 7.1 Sample strings.......................................14
6969- 7.2 Decoding traces......................................17
7070- 7.3 Encoding traces......................................19
7171- 8. Security Considerations...................................20
7272- 9. References................................................21
7373- 9.1 Normative References.................................21
7474- 9.2 Informative References...............................21
7575- A. Mixed-case annotation.....................................22
7676- B. Disclaimer and license....................................22
7777- C. Punycode sample implementation............................23
7878- Author's Address.............................................34
7979- Full Copyright Statement.....................................35
8080-8181-1. Introduction
8282-8383- [IDNA] describes an architecture for supporting internationalized
8484- domain names. Labels containing non-ASCII characters can be
8585- represented by ACE labels, which begin with a special ACE prefix and
8686- contain only ASCII characters. The remainder of the label after the
8787- prefix is a Punycode encoding of a Unicode string satisfying certain
8888- constraints. For the details of the prefix and constraints, see
8989- [IDNA] and [NAMEPREP].
9090-9191- Punycode is an instance of a more general algorithm called
9292- Bootstring, which allows strings composed from a small set of "basic"
9393- code points to uniquely represent any string of code points drawn
9494- from a larger set. Punycode is Bootstring with particular parameter
9595- values appropriate for IDNA.
9696-9797-1.1 Features
9898-9999- Bootstring has been designed to have the following features:
100100-101101- * Completeness: Every extended string (sequence of arbitrary code
102102- points) can be represented by a basic string (sequence of basic
103103- code points). Restrictions on what strings are allowed, and on
104104- length, can be imposed by higher layers.
105105-106106- * Uniqueness: There is at most one basic string that represents a
107107- given extended string.
108108-109109- * Reversibility: Any extended string mapped to a basic string can
110110- be recovered from that basic string.
111111-112112-113113-114114-Costello Standards Track [Page 2]
115115-116116-RFC 3492 IDNA Punycode March 2003
117117-118118-119119- * Efficient encoding: The ratio of basic string length to extended
120120- string length is small. This is important in the context of
121121- domain names because RFC 1034 [RFC1034] restricts the length of a
122122- domain label to 63 characters.
123123-124124- * Simplicity: The encoding and decoding algorithms are reasonably
125125- simple to implement. The goals of efficiency and simplicity are
126126- at odds; Bootstring aims at a good balance between them.
127127-128128- * Readability: Basic code points appearing in the extended string
129129- are represented as themselves in the basic string (although the
130130- main purpose is to improve efficiency, not readability).
131131-132132- Punycode can also support an additional feature that is not used by
133133- the ToASCII and ToUnicode operations of [IDNA]. When extended
134134- strings are case-folded prior to encoding, the basic string can use
135135- mixed case to tell how to convert the folded string into a mixed-case
136136- string. See appendix A "Mixed-case annotation".
137137-138138-1.2 Interaction of protocol parts
139139-140140- Punycode is used by the IDNA protocol [IDNA] for converting domain
141141- labels into ASCII; it is not designed for any other purpose. It is
142142- explicitly not designed for processing arbitrary free text.
143143-144144-2. Terminology
145145-146146- The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
147147- "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
148148- document are to be interpreted as described in BCP 14, RFC 2119
149149- [RFC2119].
150150-151151- A code point is an integral value associated with a character in a
152152- coded character set.
153153-154154- As in the Unicode Standard [UNICODE], Unicode code points are denoted
155155- by "U+" followed by four to six hexadecimal digits, while a range of
156156- code points is denoted by two hexadecimal numbers separated by "..",
157157- with no prefixes.
158158-159159- The operators div and mod perform integer division; (x div y) is the
160160- quotient of x divided by y, discarding the remainder, and (x mod y)
161161- is the remainder, so (x div y) * y + (x mod y) == x. Bootstring uses
162162- these operators only with nonnegative operands, so the quotient and
163163- remainder are always nonnegative.
164164-165165- The break statement jumps out of the innermost loop (as in C).
166166-167167-168168-169169-170170-Costello Standards Track [Page 3]
171171-172172-RFC 3492 IDNA Punycode March 2003
173173-174174-175175- An overflow is an attempt to compute a value that exceeds the maximum
176176- value of an integer variable.
177177-178178-3. Bootstring description
179179-180180- Bootstring represents an arbitrary sequence of code points (the
181181- "extended string") as a sequence of basic code points (the "basic
182182- string"). This section describes the representation. Section 6
183183- "Bootstring algorithms" presents the algorithms as pseudocode.
184184- Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
185185- algorithms for sample inputs.
186186-187187- The following sections describe the four techniques used in
188188- Bootstring. "Basic code point segregation" is a very simple and
189189- efficient encoding for basic code points occurring in the extended
190190- string: they are simply copied all at once. "Insertion unsort
191191- coding" encodes the non-basic code points as deltas, and processes
192192- the code points in numerical order rather than in order of
193193- appearance, which typically results in smaller deltas. The deltas
194194- are represented as "generalized variable-length integers", which use
195195- basic code points to represent nonnegative integers. The parameters
196196- of this integer representation are dynamically adjusted using "bias
197197- adaptation", to improve efficiency when consecutive deltas have
198198- similar magnitudes.
199199-200200-3.1 Basic code point segregation
201201-202202- All basic code points appearing in the extended string are
203203- represented literally at the beginning of the basic string, in their
204204- original order, followed by a delimiter if (and only if) the number
205205- of basic code points is nonzero. The delimiter is a particular basic
206206- code point, which never appears in the remainder of the basic string.
207207- The decoder can therefore find the end of the literal portion (if
208208- there is one) by scanning for the last delimiter.
209209-210210-3.2 Insertion unsort coding
211211-212212- The remainder of the basic string (after the last delimiter if there
213213- is one) represents a sequence of nonnegative integral deltas as
214214- generalized variable-length integers, described in section 3.3. The
215215- meaning of the deltas is best understood in terms of the decoder.
216216-217217- The decoder builds the extended string incrementally. Initially, the
218218- extended string is a copy of the literal portion of the basic string
219219- (excluding the last delimiter). The decoder inserts non-basic code
220220- points, one for each delta, into the extended string, ultimately
221221- arriving at the final decoded string.
222222-223223-224224-225225-226226-Costello Standards Track [Page 4]
227227-228228-RFC 3492 IDNA Punycode March 2003
229229-230230-231231- At the heart of this process is a state machine with two state
232232- variables: an index i and a counter n. The index i refers to a
233233- position in the extended string; it ranges from 0 (the first
234234- position) to the current length of the extended string (which refers
235235- to a potential position beyond the current end). If the current
236236- state is <n,i>, the next state is <n,i+1> if i is less than the
237237- length of the extended string, or <n+1,0> if i equals the length of
238238- the extended string. In other words, each state change causes i to
239239- increment, wrapping around to zero if necessary, and n counts the
240240- number of wrap-arounds.
241241-242242- Notice that the state always advances monotonically (there is no way
243243- for the decoder to return to an earlier state). At each state, an
244244- insertion is either performed or not performed. At most one
245245- insertion is performed in a given state. An insertion inserts the
246246- value of n at position i in the extended string. The deltas are a
247247- run-length encoding of this sequence of events: they are the lengths
248248- of the runs of non-insertion states preceeding the insertion states.
249249- Hence, for each delta, the decoder performs delta state changes, then
250250- an insertion, and then one more state change. (An implementation
251251- need not perform each state change individually, but can instead use
252252- division and remainder calculations to compute the next insertion
253253- state directly.) It is an error if the inserted code point is a
254254- basic code point (because basic code points were supposed to be
255255- segregated as described in section 3.1).
256256-257257- The encoder's main task is to derive the sequence of deltas that will
258258- cause the decoder to construct the desired string. It can do this by
259259- repeatedly scanning the extended string for the next code point that
260260- the decoder would need to insert, and counting the number of state
261261- changes the decoder would need to perform, mindful of the fact that
262262- the decoder's extended string will include only those code points
263263- that have already been inserted. Section 6.3 "Encoding procedure"
264264- gives a precise algorithm.
265265-266266-3.3 Generalized variable-length integers
267267-268268- In a conventional integer representation the base is the number of
269269- distinct symbols for digits, whose values are 0 through base-1. Let
270270- digit_0 denote the least significant digit, digit_1 the next least
271271- significant, and so on. The value represented is the sum over j of
272272- digit_j * w(j), where w(j) = base^j is the weight (scale factor) for
273273- position j. For example, in the base 8 integer 437, the digits are
274274- 7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 +
275275- 3*8 + 4*64 = 287. This representation has two disadvantages: First,
276276- there are multiple encodings of each value (because there can be
277277- extra zeros in the most significant positions), which is inconvenient
278278-279279-280280-281281-282282-Costello Standards Track [Page 5]
283283-284284-RFC 3492 IDNA Punycode March 2003
285285-286286-287287- when unique encodings are needed. Second, the integer is not self-
288288- delimiting, so if multiple integers are concatenated the boundaries
289289- between them are lost.
290290-291291- The generalized variable-length representation solves these two
292292- problems. The digit values are still 0 through base-1, but now the
293293- integer is self-delimiting by means of thresholds t(j), each of which
294294- is in the range 0 through base-1. Exactly one digit, the most
295295- significant, satisfies digit_j < t(j). Therefore, if several
296296- integers are concatenated, it is easy to separate them, starting with
297297- the first if they are little-endian (least significant digit first),
298298- or starting with the last if they are big-endian (most significant
299299- digit first). As before, the value is the sum over j of digit_j *
300300- w(j), but the weights are different:
301301-302302- w(0) = 1
303303- w(j) = w(j-1) * (base - t(j-1)) for j > 0
304304-305305- For example, consider the little-endian sequence of base 8 digits
306306- 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This
307307- implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) =
308308- 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is not
309309- less than 3, but 4 is less than 5, so 4 is the last digit. The value
310310- of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, with
311311- value 2*1 + 5*6 + 1*30 = 62. Decoding this representation is very
312312- similar to decoding a conventional integer: Start with a current
313313- value of N = 0 and a weight w = 1. Fetch the next digit d and
314314- increase N by d * w. If d is less than the current threshold (t)
315315- then stop, otherwise increase w by a factor of (base - t), update t
316316- for the next position, and repeat.
317317-318318- Encoding this representation is similar to encoding a conventional
319319- integer: If N < t then output one digit for N and stop, otherwise
320320- output the digit for t + ((N - t) mod (base - t)), then replace N
321321- with (N - t) div (base - t), update t for the next position, and
322322- repeat.
323323-324324- For any particular set of values of t(j), there is exactly one
325325- generalized variable-length representation of each nonnegative
326326- integral value.
327327-328328- Bootstring uses little-endian ordering so that the deltas can be
329329- separated starting with the first. The t(j) values are defined in
330330- terms of the constants base, tmin, and tmax, and a state variable
331331- called bias:
332332-333333- t(j) = base * (j + 1) - bias,
334334- clamped to the range tmin through tmax
335335-336336-337337-338338-Costello Standards Track [Page 6]
339339-340340-RFC 3492 IDNA Punycode March 2003
341341-342342-343343- The clamping means that if the formula yields a value less than tmin
344344- or greater than tmax, then t(j) = tmin or tmax, respectively. (In
345345- the pseudocode in section 6 "Bootstring algorithms", the expression
346346- base * (j + 1) is denoted by k for performance reasons.) These t(j)
347347- values cause the representation to favor integers within a particular
348348- range determined by the bias.
349349-350350-3.4 Bias adaptation
351351-352352- After each delta is encoded or decoded, bias is set for the next
353353- delta as follows:
354354-355355- 1. Delta is scaled in order to avoid overflow in the next step:
356356-357357- let delta = delta div 2
358358-359359- But when this is the very first delta, the divisor is not 2, but
360360- instead a constant called damp. This compensates for the fact
361361- that the second delta is usually much smaller than the first.
362362-363363- 2. Delta is increased to compensate for the fact that the next delta
364364- will be inserting into a longer string:
365365-366366- let delta = delta + (delta div numpoints)
367367-368368- numpoints is the total number of code points encoded/decoded so
369369- far (including the one corresponding to this delta itself, and
370370- including the basic code points).
371371-372372- 3. Delta is repeatedly divided until it falls within a threshold, to
373373- predict the minimum number of digits needed to represent the next
374374- delta:
375375-376376- while delta > ((base - tmin) * tmax) div 2
377377- do let delta = delta div (base - tmin)
378378-379379- 4. The bias is set:
380380-381381- let bias =
382382- (base * the number of divisions performed in step 3) +
383383- (((base - tmin + 1) * delta) div (delta + skew))
384384-385385- The motivation for this procedure is that the current delta
386386- provides a hint about the likely size of the next delta, and so
387387- t(j) is set to tmax for the more significant digits starting with
388388- the one expected to be last, tmin for the less significant digits
389389- up through the one expected to be third-last, and somewhere
390390- between tmin and tmax for the digit expected to be second-last
391391-392392-393393-394394-Costello Standards Track [Page 7]
395395-396396-RFC 3492 IDNA Punycode March 2003
397397-398398-399399- (balancing the hope of the expected-last digit being unnecessary
400400- against the danger of it being insufficient).
401401-402402-4. Bootstring parameters
403403-404404- Given a set of basic code points, one needs to be designated as the
405405- delimiter. The base cannot be greater than the number of
406406- distinguishable basic code points remaining. The digit-values in the
407407- range 0 through base-1 need to be associated with distinct non-
408408- delimiter basic code points. In some cases multiple code points need
409409- to have the same digit-value; for example, uppercase and lowercase
410410- versions of the same letter need to be equivalent if basic strings
411411- are case-insensitive.
412412-413413- The initial value of n cannot be greater than the minimum non-basic
414414- code point that could appear in extended strings.
415415-416416- The remaining five parameters (tmin, tmax, skew, damp, and the
417417- initial value of bias) need to satisfy the following constraints:
418418-419419- 0 <= tmin <= tmax <= base-1
420420- skew >= 1
421421- damp >= 2
422422- initial_bias mod base <= base - tmin
423423-424424- Provided the constraints are satisfied, these five parameters affect
425425- efficiency but not correctness. They are best chosen empirically.
426426-427427- If support for mixed-case annotation is desired (see appendix A),
428428- make sure that the code points corresponding to 0 through tmax-1 all
429429- have both uppercase and lowercase forms.
430430-431431-5. Parameter values for Punycode
432432-433433- Punycode uses the following Bootstring parameter values:
434434-435435- base = 36
436436- tmin = 1
437437- tmax = 26
438438- skew = 38
439439- damp = 700
440440- initial_bias = 72
441441- initial_n = 128 = 0x80
442442-443443- Although the only restriction Punycode imposes on the input integers
444444- is that they be nonnegative, these parameters are especially designed
445445- to work well with Unicode [UNICODE] code points, which are integers
446446- in the range 0..10FFFF (but not D800..DFFF, which are reserved for
447447-448448-449449-450450-Costello Standards Track [Page 8]
451451-452452-RFC 3492 IDNA Punycode March 2003
453453-454454-455455- use by the UTF-16 encoding of Unicode). The basic code points are
456456- the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the
457457- delimiter, and some of the others have digit-values as follows:
458458-459459- code points digit-values
460460- ------------ ----------------------
461461- 41..5A (A-Z) = 0 to 25, respectively
462462- 61..7A (a-z) = 0 to 25, respectively
463463- 30..39 (0-9) = 26 to 35, respectively
464464-465465- Using hyphen-minus as the delimiter implies that the encoded string
466466- can end with a hyphen-minus only if the Unicode string consists
467467- entirely of basic code points, but IDNA forbids such strings from
468468- being encoded. The encoded string can begin with a hyphen-minus, but
469469- IDNA prepends a prefix. Therefore IDNA using Punycode conforms to
470470- the RFC 952 rule that host name labels neither begin nor end with a
471471- hyphen-minus [RFC952].
472472-473473- A decoder MUST recognize the letters in both uppercase and lowercase
474474- forms (including mixtures of both forms). An encoder SHOULD output
475475- only uppercase forms or only lowercase forms, unless it uses mixed-
476476- case annotation (see appendix A).
477477-478478- Presumably most users will not manually write or type encoded strings
479479- (as opposed to cutting and pasting them), but those who do will need
480480- to be alert to the potential visual ambiguity between the following
481481- sets of characters:
482482-483483- G 6
484484- I l 1
485485- O 0
486486- S 5
487487- U V
488488- Z 2
489489-490490- Such ambiguities are usually resolved by context, but in a Punycode
491491- encoded string there is no context apparent to humans.
492492-493493-6. Bootstring algorithms
494494-495495- Some parts of the pseudocode can be omitted if the parameters satisfy
496496- certain conditions (for which Punycode qualifies). These parts are
497497- enclosed in {braces}, and notes immediately following the pseudocode
498498- explain the conditions under which they can be omitted.
499499-500500-501501-502502-503503-504504-505505-506506-Costello Standards Track [Page 9]
507507-508508-RFC 3492 IDNA Punycode March 2003
509509-510510-511511- Formally, code points are integers, and hence the pseudocode assumes
512512- that arithmetic operations can be performed directly on code points.
513513- In some programming languages, explicit conversion between code
514514- points and integers might be necessary.
515515-516516-6.1 Bias adaptation function
517517-518518- function adapt(delta,numpoints,firsttime):
519519- if firsttime then let delta = delta div damp
520520- else let delta = delta div 2
521521- let delta = delta + (delta div numpoints)
522522- let k = 0
523523- while delta > ((base - tmin) * tmax) div 2 do begin
524524- let delta = delta div (base - tmin)
525525- let k = k + base
526526- end
527527- return k + (((base - tmin + 1) * delta) div (delta + skew))
528528-529529- It does not matter whether the modifications to delta and k inside
530530- adapt() affect variables of the same name inside the
531531- encoding/decoding procedures, because after calling adapt() the
532532- caller does not read those variables before overwriting them.
533533-534534-535535-536536-537537-538538-539539-540540-541541-542542-543543-544544-545545-546546-547547-548548-549549-550550-551551-552552-553553-554554-555555-556556-557557-558558-559559-560560-561561-562562-Costello Standards Track [Page 10]
563563-564564-RFC 3492 IDNA Punycode March 2003
565565-566566-567567-6.2 Decoding procedure
568568-569569- let n = initial_n
570570- let i = 0
571571- let bias = initial_bias
572572- let output = an empty string indexed from 0
573573- consume all code points before the last delimiter (if there is one)
574574- and copy them to output, fail on any non-basic code point
575575- if more than zero code points were consumed then consume one more
576576- (which will be the last delimiter)
577577- while the input is not exhausted do begin
578578- let oldi = i
579579- let w = 1
580580- for k = base to infinity in steps of base do begin
581581- consume a code point, or fail if there was none to consume
582582- let digit = the code point's digit-value, fail if it has none
583583- let i = i + digit * w, fail on overflow
584584- let t = tmin if k <= bias {+ tmin}, or
585585- tmax if k >= bias + tmax, or k - bias otherwise
586586- if digit < t then break
587587- let w = w * (base - t), fail on overflow
588588- end
589589- let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
590590- let n = n + i div (length(output) + 1), fail on overflow
591591- let i = i mod (length(output) + 1)
592592- {if n is a basic code point then fail}
593593- insert n into output at position i
594594- increment i
595595- end
596596-597597- The full statement enclosed in braces (checking whether n is a basic
598598- code point) can be omitted if initial_n exceeds all basic code points
599599- (which is true for Punycode), because n is never less than initial_n.
600600-601601- In the assignment of t, where t is clamped to the range tmin through
602602- tmax, "+ tmin" can always be omitted. This makes the clamping
603603- calculation incorrect when bias < k < bias + tmin, but that cannot
604604- happen because of the way bias is computed and because of the
605605- constraints on the parameters.
606606-607607- Because the decoder state can only advance monotonically, and there
608608- is only one representation of any delta, there is therefore only one
609609- encoded string that can represent a given sequence of integers. The
610610- only error conditions are invalid code points, unexpected end-of-
611611- input, overflow, and basic code points encoded using deltas instead
612612- of appearing literally. If the decoder fails on these errors as
613613- shown above, then it cannot produce the same output for two distinct
614614- inputs. Without this property it would have been necessary to re-
615615-616616-617617-618618-Costello Standards Track [Page 11]
619619-620620-RFC 3492 IDNA Punycode March 2003
621621-622622-623623- encode the output and verify that it matches the input in order to
624624- guarantee the uniqueness of the encoding.
625625-626626-6.3 Encoding procedure
627627-628628- let n = initial_n
629629- let delta = 0
630630- let bias = initial_bias
631631- let h = b = the number of basic code points in the input
632632- copy them to the output in order, followed by a delimiter if b > 0
633633- {if the input contains a non-basic code point < n then fail}
634634- while h < length(input) do begin
635635- let m = the minimum {non-basic} code point >= n in the input
636636- let delta = delta + (m - n) * (h + 1), fail on overflow
637637- let n = m
638638- for each code point c in the input (in order) do begin
639639- if c < n {or c is basic} then increment delta, fail on overflow
640640- if c == n then begin
641641- let q = delta
642642- for k = base to infinity in steps of base do begin
643643- let t = tmin if k <= bias {+ tmin}, or
644644- tmax if k >= bias + tmax, or k - bias otherwise
645645- if q < t then break
646646- output the code point for digit t + ((q - t) mod (base - t))
647647- let q = (q - t) div (base - t)
648648- end
649649- output the code point for digit q
650650- let bias = adapt(delta, h + 1, test h equals b?)
651651- let delta = 0
652652- increment h
653653- end
654654- end
655655- increment delta and n
656656- end
657657-658658- The full statement enclosed in braces (checking whether the input
659659- contains a non-basic code point less than n) can be omitted if all
660660- code points less than initial_n are basic code points (which is true
661661- for Punycode if code points are unsigned).
662662-663663- The brace-enclosed conditions "non-basic" and "or c is basic" can be
664664- omitted if initial_n exceeds all basic code points (which is true for
665665- Punycode), because the code point being tested is never less than
666666- initial_n.
667667-668668- In the assignment of t, where t is clamped to the range tmin through
669669- tmax, "+ tmin" can always be omitted. This makes the clamping
670670- calculation incorrect when bias < k < bias + tmin, but that cannot
671671-672672-673673-674674-Costello Standards Track [Page 12]
675675-676676-RFC 3492 IDNA Punycode March 2003
677677-678678-679679- happen because of the way bias is computed and because of the
680680- constraints on the parameters.
681681-682682- The checks for overflow are necessary to avoid producing invalid
683683- output when the input contains very large values or is very long.
684684-685685- The increment of delta at the bottom of the outer loop cannot
686686- overflow because delta < length(input) before the increment, and
687687- length(input) is already assumed to be representable. The increment
688688- of n could overflow, but only if h == length(input), in which case
689689- the procedure is finished anyway.
690690-691691-6.4 Overflow handling
692692-693693- For IDNA, 26-bit unsigned integers are sufficient to handle all valid
694694- IDNA labels without overflow, because any string that needed a 27-bit
695695- delta would have to exceed either the code point limit (0..10FFFF) or
696696- the label length limit (63 characters). However, overflow handling
697697- is necessary because the inputs are not necessarily valid IDNA
698698- labels.
699699-700700- If the programming language does not provide overflow detection, the
701701- following technique can be used. Suppose A, B, and C are
702702- representable nonnegative integers and C is nonzero. Then A + B
703703- overflows if and only if B > maxint - A, and A + (B * C) overflows if
704704- and only if B > (maxint - A) div C, where maxint is the greatest
705705- integer for which maxint + 1 cannot be represented. Refer to
706706- appendix C "Punycode sample implementation" for demonstrations of
707707- this technique in the C language.
708708-709709- The decoding and encoding algorithms shown in sections 6.2 and 6.3
710710- handle overflow by detecting it whenever it happens. Another
711711- approach is to enforce limits on the inputs that prevent overflow
712712- from happening. For example, if the encoder were to verify that no
713713- input code points exceed M and that the input length does not exceed
714714- L, then no delta could ever exceed (M - initial_n) * (L + 1), and
715715- hence no overflow could occur if integer variables were capable of
716716- representing values that large. This prevention approach would
717717- impose more restrictions on the input than the detection approach
718718- does, but might be considered simpler in some programming languages.
719719-720720- In theory, the decoder could use an analogous approach, limiting the
721721- number of digits in a variable-length integer (that is, limiting the
722722- number of iterations in the innermost loop). However, the number of
723723- digits that suffice to represent a given delta can sometimes
724724- represent much larger deltas (because of the adaptation), and hence
725725- this approach would probably need integers wider than 32 bits.
726726-727727-728728-729729-730730-Costello Standards Track [Page 13]
731731-732732-RFC 3492 IDNA Punycode March 2003
733733-734734-735735- Yet another approach for the decoder is to allow overflow to occur,
736736- but to check the final output string by re-encoding it and comparing
737737- to the decoder input. If and only if they do not match (using a
738738- case-insensitive ASCII comparison) overflow has occurred. This
739739- delayed-detection approach would not impose any more restrictions on
740740- the input than the immediate-detection approach does, and might be
741741- considered simpler in some programming languages.
742742-743743- In fact, if the decoder is used only inside the IDNA ToUnicode
744744- operation [IDNA], then it need not check for overflow at all, because
745745- ToUnicode performs a higher level re-encoding and comparison, and a
746746- mismatch has the same consequence as if the Punycode decoder had
747747- failed.
748748-749749-7. Punycode examples
750750-751751-7.1 Sample strings
752752-753753- In the Punycode encodings below, the ACE prefix is not shown.
754754- Backslashes show where line breaks have been inserted in strings too
755755- long for one line.
756756-757757- The first several examples are all translations of the sentence "Why
758758- can't they just speak in <language>?" (courtesy of Michael Kaplan's
759759- "provincial" page [PROVINCIAL]). Word breaks and punctuation have
760760- been removed, as is often done in domain names.
761761-762762- (A) Arabic (Egyptian):
763763- u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
764764- u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
765765- Punycode: egbpdaj6bu4bxfgehfvwxn
766766-767767- (B) Chinese (simplified):
768768- u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
769769- Punycode: ihqwcrb4cv8a8dqg056pqjye
770770-771771- (C) Chinese (traditional):
772772- u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
773773- Punycode: ihqwctvzc91f659drss3x8bo0yb
774774-775775- (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
776776- U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
777777- u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
778778- u+0065 u+0073 u+006B u+0079
779779- Punycode: Proprostnemluvesky-uyb24dma41a
780780-781781-782782-783783-784784-785785-786786-Costello Standards Track [Page 14]
787787-788788-RFC 3492 IDNA Punycode March 2003
789789-790790-791791- (E) Hebrew:
792792- u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
793793- u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
794794- u+05D1 u+05E8 u+05D9 u+05EA
795795- Punycode: 4dbcagdahymbxekheh6e0a7fei0b
796796-797797- (F) Hindi (Devanagari):
798798- u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
799799- u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
800800- u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
801801- u+0939 u+0948 u+0902
802802- Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
803803-804804- (G) Japanese (kanji and hiragana):
805805- u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
806806- u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
807807- Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
808808-809809- (H) Korean (Hangul syllables):
810810- u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
811811- u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
812812- u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
813813- Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
814814- psd879ccm6fea98c
815815-816816- (I) Russian (Cyrillic):
817817- U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
818818- u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
819819- u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
820820- u+0438
821821- Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
822822-823823- (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
824824- U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
825825- u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
826826- u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
827827- u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
828828- u+0061 u+00F1 u+006F u+006C
829829- Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a
830830-831831- (K) Vietnamese:
832832- T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\
833833- <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t
834834- U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
835835- u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
836836- u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
837837- U+0056 u+0069 u+1EC7 u+0074
838838- Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g
839839-840840-841841-842842-Costello Standards Track [Page 15]
843843-844844-RFC 3492 IDNA Punycode March 2003
845845-846846-847847- The next several examples are all names of Japanese music artists,
848848- song titles, and TV programs, just because the author happens to have
849849- them handy (but Japanese is useful for providing examples of single-
850850- row text, two-row text, ideographic text, and various mixtures
851851- thereof).
852852-853853- (L) 3<nen>B<gumi><kinpachi><sensei>
854854- u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
855855- Punycode: 3B-ww4c5e180e575a65lsy2b
856856-857857- (M) <amuro><namie>-with-SUPER-MONKEYS
858858- u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
859859- u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
860860- U+004F U+004E U+004B U+0045 U+0059 U+0053
861861- Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n
862862-863863- (N) Hello-Another-Way-<sorezore><no><basho>
864864- U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
865865- u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
866866- u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
867867- Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b
868868-869869- (O) <hitotsu><yane><no><shita>2
870870- u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
871871- Punycode: 2-u9tlzr9756bt3uc0v
872872-873873- (P) Maji<de>Koi<suru>5<byou><mae>
874874- U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
875875- u+308B u+0035 u+79D2 u+524D
876876- Punycode: MajiKoi5-783gue6qz075azm5e
877877-878878- (Q) <pafii>de<runba>
879879- u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
880880- Punycode: de-jg4avhby1noc0d
881881-882882- (R) <sono><supiido><de>
883883- u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
884884- Punycode: d9juau41awczczp
885885-886886- The last example is an ASCII string that breaks the existing rules
887887- for host name labels. (It is not a realistic example for IDNA,
888888- because IDNA never encodes pure ASCII labels.)
889889-890890- (S) -> $1.00 <-
891891- u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
892892- u+003C u+002D
893893- Punycode: -> $1.00 <--
894894-895895-896896-897897-898898-Costello Standards Track [Page 16]
899899-900900-RFC 3492 IDNA Punycode March 2003
901901-902902-903903-7.2 Decoding traces
904904-905905- In the following traces, the evolving state of the decoder is shown
906906- as a sequence of hexadecimal values, representing the code points in
907907- the extended string. An asterisk appears just after the most
908908- recently inserted code point, indicating both n (the value preceeding
909909- the asterisk) and i (the position of the value just after the
910910- asterisk). Other numerical values are decimal.
911911-912912- Decoding trace of example B from section 7.1:
913913-914914- n is 128, i is 0, bias is 72
915915- input is "ihqwcrb4cv8a8dqg056pqjye"
916916- there is no delimiter, so extended string starts empty
917917- delta "ihq" decodes to 19853
918918- bias becomes 21
919919- 4E0D *
920920- delta "wc" decodes to 64
921921- bias becomes 20
922922- 4E0D 4E2D *
923923- delta "rb" decodes to 37
924924- bias becomes 13
925925- 4E3A * 4E0D 4E2D
926926- delta "4c" decodes to 56
927927- bias becomes 17
928928- 4E3A 4E48 * 4E0D 4E2D
929929- delta "v8a" decodes to 599
930930- bias becomes 32
931931- 4E3A 4EC0 * 4E48 4E0D 4E2D
932932- delta "8d" decodes to 130
933933- bias becomes 23
934934- 4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
935935- delta "qg" decodes to 154
936936- bias becomes 25
937937- 4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
938938- delta "056p" decodes to 46301
939939- bias becomes 84
940940- 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
941941- delta "qjye" decodes to 88531
942942- bias becomes 90
943943- 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
944944-945945-946946-947947-948948-949949-950950-951951-952952-953953-954954-Costello Standards Track [Page 17]
955955-956956-RFC 3492 IDNA Punycode March 2003
957957-958958-959959- Decoding trace of example L from section 7.1:
960960-961961- n is 128, i is 0, bias is 72
962962- input is "3B-ww4c5e180e575a65lsy2b"
963963- literal portion is "3B-", so extended string starts as:
964964- 0033 0042
965965- delta "ww4c" decodes to 62042
966966- bias becomes 27
967967- 0033 0042 5148 *
968968- delta "5e" decodes to 139
969969- bias becomes 24
970970- 0033 0042 516B * 5148
971971- delta "180e" decodes to 16683
972972- bias becomes 67
973973- 0033 5E74 * 0042 516B 5148
974974- delta "575a" decodes to 34821
975975- bias becomes 82
976976- 0033 5E74 0042 516B 5148 751F *
977977- delta "65l" decodes to 14592
978978- bias becomes 67
979979- 0033 5E74 0042 7D44 * 516B 5148 751F
980980- delta "sy2b" decodes to 42088
981981- bias becomes 84
982982- 0033 5E74 0042 7D44 91D1 * 516B 5148 751F
983983-984984-985985-986986-987987-988988-989989-990990-991991-992992-993993-994994-995995-996996-997997-998998-999999-10001000-10011001-10021002-10031003-10041004-10051005-10061006-10071007-10081008-10091009-10101010-Costello Standards Track [Page 18]
10111011-10121012-RFC 3492 IDNA Punycode March 2003
10131013-10141014-10151015-7.3 Encoding traces
10161016-10171017- In the following traces, code point values are hexadecimal, while
10181018- other numerical values are decimal.
10191019-10201020- Encoding trace of example B from section 7.1:
10211021-10221022- bias is 72
10231023- input is:
10241024- 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
10251025- there are no basic code points, so no literal portion
10261026- next code point to insert is 4E0D
10271027- needed delta is 19853, encodes as "ihq"
10281028- bias becomes 21
10291029- next code point to insert is 4E2D
10301030- needed delta is 64, encodes as "wc"
10311031- bias becomes 20
10321032- next code point to insert is 4E3A
10331033- needed delta is 37, encodes as "rb"
10341034- bias becomes 13
10351035- next code point to insert is 4E48
10361036- needed delta is 56, encodes as "4c"
10371037- bias becomes 17
10381038- next code point to insert is 4EC0
10391039- needed delta is 599, encodes as "v8a"
10401040- bias becomes 32
10411041- next code point to insert is 4ED6
10421042- needed delta is 130, encodes as "8d"
10431043- bias becomes 23
10441044- next code point to insert is 4EEC
10451045- needed delta is 154, encodes as "qg"
10461046- bias becomes 25
10471047- next code point to insert is 6587
10481048- needed delta is 46301, encodes as "056p"
10491049- bias becomes 84
10501050- next code point to insert is 8BF4
10511051- needed delta is 88531, encodes as "qjye"
10521052- bias becomes 90
10531053- output is "ihqwcrb4cv8a8dqg056pqjye"
10541054-10551055-10561056-10571057-10581058-10591059-10601060-10611061-10621062-10631063-10641064-10651065-10661066-Costello Standards Track [Page 19]
10671067-10681068-RFC 3492 IDNA Punycode March 2003
10691069-10701070-10711071- Encoding trace of example L from section 7.1:
10721072-10731073- bias is 72
10741074- input is:
10751075- 0033 5E74 0042 7D44 91D1 516B 5148 751F
10761076- basic code points (0033, 0042) are copied to literal portion: "3B-"
10771077- next code point to insert is 5148
10781078- needed delta is 62042, encodes as "ww4c"
10791079- bias becomes 27
10801080- next code point to insert is 516B
10811081- needed delta is 139, encodes as "5e"
10821082- bias becomes 24
10831083- next code point to insert is 5E74
10841084- needed delta is 16683, encodes as "180e"
10851085- bias becomes 67
10861086- next code point to insert is 751F
10871087- needed delta is 34821, encodes as "575a"
10881088- bias becomes 82
10891089- next code point to insert is 7D44
10901090- needed delta is 14592, encodes as "65l"
10911091- bias becomes 67
10921092- next code point to insert is 91D1
10931093- needed delta is 42088, encodes as "sy2b"
10941094- bias becomes 84
10951095- output is "3B-ww4c5e180e575a65lsy2b"
10961096-10971097-8. Security Considerations
10981098-10991099- Users expect each domain name in DNS to be controlled by a single
11001100- authority. If a Unicode string intended for use as a domain label
11011101- could map to multiple ACE labels, then an internationalized domain
11021102- name could map to multiple ASCII domain names, each controlled by a
11031103- different authority, some of which could be spoofs that hijack
11041104- service requests intended for another. Therefore Punycode is
11051105- designed so that each Unicode string has a unique encoding.
11061106-11071107- However, there can still be multiple Unicode representations of the
11081108- "same" text, for various definitions of "same". This problem is
11091109- addressed to some extent by the Unicode standard under the topic of
11101110- canonicalization, and this work is leveraged for domain names by
11111111- Nameprep [NAMEPREP].
11121112-11131113-11141114-11151115-11161116-11171117-11181118-11191119-11201120-11211121-11221122-Costello Standards Track [Page 20]
11231123-11241124-RFC 3492 IDNA Punycode March 2003
11251125-11261126-11271127-9. References
11281128-11291129-9.1 Normative References
11301130-11311131- [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
11321132- Requirement Levels", BCP 14, RFC 2119, March 1997.
11331133-11341134-9.2 Informative References
11351135-11361136- [RFC952] Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet
11371137- Host Table Specification", RFC 952, October 1985.
11381138-11391139- [RFC1034] Mockapetris, P., "Domain Names - Concepts and
11401140- Facilities", STD 13, RFC 1034, November 1987.
11411141-11421142- [IDNA] Faltstrom, P., Hoffman, P. and A. Costello,
11431143- "Internationalizing Domain Names in Applications
11441144- (IDNA)", RFC 3490, March 2003.
11451145-11461146- [NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
11471147- Profile for Internationalized Domain Names (IDN)", RFC
11481148- 3491, March 2003.
11491149-11501150- [ASCII] Cerf, V., "ASCII format for Network Interchange", RFC
11511151- 20, October 1969.
11521152-11531153- [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page",
11541154- http://www.trigeminal.com/samples/provincial.html.
11551155-11561156- [UNICODE] The Unicode Consortium, "The Unicode Standard",
11571157- http://www.unicode.org/unicode/standard/standard.html.
11581158-11591159-11601160-11611161-11621162-11631163-11641164-11651165-11661166-11671167-11681168-11691169-11701170-11711171-11721172-11731173-11741174-11751175-11761176-11771177-11781178-Costello Standards Track [Page 21]
11791179-11801180-RFC 3492 IDNA Punycode March 2003
11811181-11821182-11831183-A. Mixed-case annotation
11841184-11851185- In order to use Punycode to represent case-insensitive strings,
11861186- higher layers need to case-fold the strings prior to Punycode
11871187- encoding. The encoded string can use mixed case as an annotation
11881188- telling how to convert the folded string into a mixed-case string for
11891189- display purposes. Note, however, that mixed-case annotation is not
11901190- used by the ToASCII and ToUnicode operations specified in [IDNA], and
11911191- therefore implementors of IDNA can disregard this appendix.
11921192-11931193- Basic code points can use mixed case directly, because the decoder
11941194- copies them verbatim, leaving lowercase code points lowercase, and
11951195- leaving uppercase code points uppercase. Each non-basic code point
11961196- is represented by a delta, which is represented by a sequence of
11971197- basic code points, the last of which provides the annotation. If it
11981198- is uppercase, it is a suggestion to map the non-basic code point to
11991199- uppercase (if possible); if it is lowercase, it is a suggestion to
12001200- map the non-basic code point to lowercase (if possible).
12011201-12021202- These annotations do not alter the code points returned by decoders;
12031203- the annotations are returned separately, for the caller to use or
12041204- ignore. Encoders can accept annotations in addition to code points,
12051205- but the annotations do not alter the output, except to influence the
12061206- uppercase/lowercase form of ASCII letters.
12071207-12081208- Punycode encoders and decoders need not support these annotations,
12091209- and higher layers need not use them.
12101210-12111211-B. Disclaimer and license
12121212-12131213- Regarding this entire document or any portion of it (including the
12141214- pseudocode and C code), the author makes no guarantees and is not
12151215- responsible for any damage resulting from its use. The author grants
12161216- irrevocable permission to anyone to use, modify, and distribute it in
12171217- any way that does not diminish the rights of anyone else to use,
12181218- modify, and distribute it, provided that redistributed derivative
12191219- works do not contain misleading author or version information.
12201220- Derivative works need not be licensed under similar terms.
12211221-12221222-12231223-12241224-12251225-12261226-12271227-12281228-12291229-12301230-12311231-12321232-12331233-12341234-Costello Standards Track [Page 22]
12351235-12361236-RFC 3492 IDNA Punycode March 2003
12371237-12381238-12391239-C. Punycode sample implementation
12401240-12411241-/*
12421242-punycode.c from RFC 3492
12431243-http://www.nicemice.net/idn/
12441244-Adam M. Costello
12451245-http://www.nicemice.net/amc/
12461246-12471247-This is ANSI C code (C89) implementing Punycode (RFC 3492).
12481248-12491249-*/
12501250-12511251-12521252-/************************************************************/
12531253-/* Public interface (would normally go in its own .h file): */
12541254-12551255-#include <limits.h>
12561256-12571257-enum punycode_status {
12581258- punycode_success,
12591259- punycode_bad_input, /* Input is invalid. */
12601260- punycode_big_output, /* Output would exceed the space provided. */
12611261- punycode_overflow /* Input needs wider integers to process. */
12621262-};
12631263-12641264-#if UINT_MAX >= (1 << 26) - 1
12651265-typedef unsigned int punycode_uint;
12661266-#else
12671267-typedef unsigned long punycode_uint;
12681268-#endif
12691269-12701270-enum punycode_status punycode_encode(
12711271- punycode_uint input_length,
12721272- const punycode_uint input[],
12731273- const unsigned char case_flags[],
12741274- punycode_uint *output_length,
12751275- char output[] );
12761276-12771277- /* punycode_encode() converts Unicode to Punycode. The input */
12781278- /* is represented as an array of Unicode code points (not code */
12791279- /* units; surrogate pairs are not allowed), and the output */
12801280- /* will be represented as an array of ASCII code points. The */
12811281- /* output string is *not* null-terminated; it will contain */
12821282- /* zeros if and only if the input contains zeros. (Of course */
12831283- /* the caller can leave room for a terminator and add one if */
12841284- /* needed.) The input_length is the number of code points in */
12851285- /* the input. The output_length is an in/out argument: the */
12861286- /* caller passes in the maximum number of code points that it */
12871287-12881288-12891289-12901290-Costello Standards Track [Page 23]
12911291-12921292-RFC 3492 IDNA Punycode March 2003
12931293-12941294-12951295- /* can receive, and on successful return it will contain the */
12961296- /* number of code points actually output. The case_flags array */
12971297- /* holds input_length boolean values, where nonzero suggests that */
12981298- /* the corresponding Unicode character be forced to uppercase */
12991299- /* after being decoded (if possible), and zero suggests that */
13001300- /* it be forced to lowercase (if possible). ASCII code points */
13011301- /* are encoded literally, except that ASCII letters are forced */
13021302- /* to uppercase or lowercase according to the corresponding */
13031303- /* uppercase flags. If case_flags is a null pointer then ASCII */
13041304- /* letters are left as they are, and other code points are */
13051305- /* treated as if their uppercase flags were zero. The return */
13061306- /* value can be any of the punycode_status values defined above */
13071307- /* except punycode_bad_input; if not punycode_success, then */
13081308- /* output_size and output might contain garbage. */
13091309-13101310-enum punycode_status punycode_decode(
13111311- punycode_uint input_length,
13121312- const char input[],
13131313- punycode_uint *output_length,
13141314- punycode_uint output[],
13151315- unsigned char case_flags[] );
13161316-13171317- /* punycode_decode() converts Punycode to Unicode. The input is */
13181318- /* represented as an array of ASCII code points, and the output */
13191319- /* will be represented as an array of Unicode code points. The */
13201320- /* input_length is the number of code points in the input. The */
13211321- /* output_length is an in/out argument: the caller passes in */
13221322- /* the maximum number of code points that it can receive, and */
13231323- /* on successful return it will contain the actual number of */
13241324- /* code points output. The case_flags array needs room for at */
13251325- /* least output_length values, or it can be a null pointer if the */
13261326- /* case information is not needed. A nonzero flag suggests that */
13271327- /* the corresponding Unicode character be forced to uppercase */
13281328- /* by the caller (if possible), while zero suggests that it be */
13291329- /* forced to lowercase (if possible). ASCII code points are */
13301330- /* output already in the proper case, but their flags will be set */
13311331- /* appropriately so that applying the flags would be harmless. */
13321332- /* The return value can be any of the punycode_status values */
13331333- /* defined above; if not punycode_success, then output_length, */
13341334- /* output, and case_flags might contain garbage. On success, the */
13351335- /* decoder will never need to write an output_length greater than */
13361336- /* input_length, because of how the encoding is defined. */
13371337-13381338-/**********************************************************/
13391339-/* Implementation (would normally go in its own .c file): */
13401340-13411341-#include <string.h>
13421342-13431343-13441344-13451345-13461346-Costello Standards Track [Page 24]
13471347-13481348-RFC 3492 IDNA Punycode March 2003
13491349-13501350-13511351-/*** Bootstring parameters for Punycode ***/
13521352-13531353-enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
13541354- initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
13551355-13561356-/* basic(cp) tests whether cp is a basic code point: */
13571357-#define basic(cp) ((punycode_uint)(cp) < 0x80)
13581358-13591359-/* delim(cp) tests whether cp is a delimiter: */
13601360-#define delim(cp) ((cp) == delimiter)
13611361-13621362-/* decode_digit(cp) returns the numeric value of a basic code */
13631363-/* point (for use in representing integers) in the range 0 to */
13641364-/* base-1, or base if cp is does not represent a value. */
13651365-13661366-static punycode_uint decode_digit(punycode_uint cp)
13671367-{
13681368- return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 :
13691369- cp - 97 < 26 ? cp - 97 : base;
13701370-}
13711371-13721372-/* encode_digit(d,flag) returns the basic code point whose value */
13731373-/* (when used for representing integers) is d, which needs to be in */
13741374-/* the range 0 to base-1. The lowercase form is used unless flag is */
13751375-/* nonzero, in which case the uppercase form is used. The behavior */
13761376-/* is undefined if flag is nonzero and digit d has no uppercase form. */
13771377-13781378-static char encode_digit(punycode_uint d, int flag)
13791379-{
13801380- return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
13811381- /* 0..25 map to ASCII a..z or A..Z */
13821382- /* 26..35 map to ASCII 0..9 */
13831383-}
13841384-13851385-/* flagged(bcp) tests whether a basic code point is flagged */
13861386-/* (uppercase). The behavior is undefined if bcp is not a */
13871387-/* basic code point. */
13881388-13891389-#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
13901390-13911391-/* encode_basic(bcp,flag) forces a basic code point to lowercase */
13921392-/* if flag is zero, uppercase if flag is nonzero, and returns */
13931393-/* the resulting code point. The code point is unchanged if it */
13941394-/* is caseless. The behavior is undefined if bcp is not a basic */
13951395-/* code point. */
13961396-13971397-static char encode_basic(punycode_uint bcp, int flag)
13981398-{
13991399-14001400-14011401-14021402-Costello Standards Track [Page 25]
14031403-14041404-RFC 3492 IDNA Punycode March 2003
14051405-14061406-14071407- bcp -= (bcp - 97 < 26) << 5;
14081408- return bcp + ((!flag && (bcp - 65 < 26)) << 5);
14091409-}
14101410-14111411-/*** Platform-specific constants ***/
14121412-14131413-/* maxint is the maximum value of a punycode_uint variable: */
14141414-static const punycode_uint maxint = -1;
14151415-/* Because maxint is unsigned, -1 becomes the maximum value. */
14161416-14171417-/*** Bias adaptation function ***/
14181418-14191419-static punycode_uint adapt(
14201420- punycode_uint delta, punycode_uint numpoints, int firsttime )
14211421-{
14221422- punycode_uint k;
14231423-14241424- delta = firsttime ? delta / damp : delta >> 1;
14251425- /* delta >> 1 is a faster way of doing delta / 2 */
14261426- delta += delta / numpoints;
14271427-14281428- for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) {
14291429- delta /= base - tmin;
14301430- }
14311431-14321432- return k + (base - tmin + 1) * delta / (delta + skew);
14331433-}
14341434-14351435-/*** Main encode function ***/
14361436-14371437-enum punycode_status punycode_encode(
14381438- punycode_uint input_length,
14391439- const punycode_uint input[],
14401440- const unsigned char case_flags[],
14411441- punycode_uint *output_length,
14421442- char output[] )
14431443-{
14441444- punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
14451445-14461446- /* Initialize the state: */
14471447-14481448- n = initial_n;
14491449- delta = out = 0;
14501450- max_out = *output_length;
14511451- bias = initial_bias;
14521452-14531453- /* Handle the basic code points: */
14541454-14551455-14561456-14571457-14581458-Costello Standards Track [Page 26]
14591459-14601460-RFC 3492 IDNA Punycode March 2003
14611461-14621462-14631463- for (j = 0; j < input_length; ++j) {
14641464- if (basic(input[j])) {
14651465- if (max_out - out < 2) return punycode_big_output;
14661466- output[out++] =
14671467- case_flags ? encode_basic(input[j], case_flags[j]) : input[j];
14681468- }
14691469- /* else if (input[j] < n) return punycode_bad_input; */
14701470- /* (not needed for Punycode with unsigned code points) */
14711471- }
14721472-14731473- h = b = out;
14741474-14751475- /* h is the number of code points that have been handled, b is the */
14761476- /* number of basic code points, and out is the number of characters */
14771477- /* that have been output. */
14781478-14791479- if (b > 0) output[out++] = delimiter;
14801480-14811481- /* Main encoding loop: */
14821482-14831483- while (h < input_length) {
14841484- /* All non-basic code points < n have been */
14851485- /* handled already. Find the next larger one: */
14861486-14871487- for (m = maxint, j = 0; j < input_length; ++j) {
14881488- /* if (basic(input[j])) continue; */
14891489- /* (not needed for Punycode) */
14901490- if (input[j] >= n && input[j] < m) m = input[j];
14911491- }
14921492-14931493- /* Increase delta enough to advance the decoder's */
14941494- /* <n,i> state to <m,0>, but guard against overflow: */
14951495-14961496- if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
14971497- delta += (m - n) * (h + 1);
14981498- n = m;
14991499-15001500- for (j = 0; j < input_length; ++j) {
15011501- /* Punycode does not need to check whether input[j] is basic: */
15021502- if (input[j] < n /* || basic(input[j]) */ ) {
15031503- if (++delta == 0) return punycode_overflow;
15041504- }
15051505-15061506- if (input[j] == n) {
15071507- /* Represent delta as a generalized variable-length integer: */
15081508-15091509- for (q = delta, k = base; ; k += base) {
15101510- if (out >= max_out) return punycode_big_output;
15111511-15121512-15131513-15141514-Costello Standards Track [Page 27]
15151515-15161516-RFC 3492 IDNA Punycode March 2003
15171517-15181518-15191519- t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
15201520- k >= bias + tmax ? tmax : k - bias;
15211521- if (q < t) break;
15221522- output[out++] = encode_digit(t + (q - t) % (base - t), 0);
15231523- q = (q - t) / (base - t);
15241524- }
15251525-15261526- output[out++] = encode_digit(q, case_flags && case_flags[j]);
15271527- bias = adapt(delta, h + 1, h == b);
15281528- delta = 0;
15291529- ++h;
15301530- }
15311531- }
15321532-15331533- ++delta, ++n;
15341534- }
15351535-15361536- *output_length = out;
15371537- return punycode_success;
15381538-}
15391539-15401540-/*** Main decode function ***/
15411541-15421542-enum punycode_status punycode_decode(
15431543- punycode_uint input_length,
15441544- const char input[],
15451545- punycode_uint *output_length,
15461546- punycode_uint output[],
15471547- unsigned char case_flags[] )
15481548-{
15491549- punycode_uint n, out, i, max_out, bias,
15501550- b, j, in, oldi, w, k, digit, t;
15511551-15521552- /* Initialize the state: */
15531553-15541554- n = initial_n;
15551555- out = i = 0;
15561556- max_out = *output_length;
15571557- bias = initial_bias;
15581558-15591559- /* Handle the basic code points: Let b be the number of input code */
15601560- /* points before the last delimiter, or 0 if there is none, then */
15611561- /* copy the first b code points to the output. */
15621562-15631563- for (b = j = 0; j < input_length; ++j) if (delim(input[j])) b = j;
15641564- if (b > max_out) return punycode_big_output;
15651565-15661566- for (j = 0; j < b; ++j) {
15671567-15681568-15691569-15701570-Costello Standards Track [Page 28]
15711571-15721572-RFC 3492 IDNA Punycode March 2003
15731573-15741574-15751575- if (case_flags) case_flags[out] = flagged(input[j]);
15761576- if (!basic(input[j])) return punycode_bad_input;
15771577- output[out++] = input[j];
15781578- }
15791579-15801580- /* Main decoding loop: Start just after the last delimiter if any */
15811581- /* basic code points were copied; start at the beginning otherwise. */
15821582-15831583- for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) {
15841584-15851585- /* in is the index of the next character to be consumed, and */
15861586- /* out is the number of code points in the output array. */
15871587-15881588- /* Decode a generalized variable-length integer into delta, */
15891589- /* which gets added to i. The overflow checking is easier */
15901590- /* if we increase i as we go, then subtract off its starting */
15911591- /* value at the end to obtain delta. */
15921592-15931593- for (oldi = i, w = 1, k = base; ; k += base) {
15941594- if (in >= input_length) return punycode_bad_input;
15951595- digit = decode_digit(input[in++]);
15961596- if (digit >= base) return punycode_bad_input;
15971597- if (digit > (maxint - i) / w) return punycode_overflow;
15981598- i += digit * w;
15991599- t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
16001600- k >= bias + tmax ? tmax : k - bias;
16011601- if (digit < t) break;
16021602- if (w > maxint / (base - t)) return punycode_overflow;
16031603- w *= (base - t);
16041604- }
16051605-16061606- bias = adapt(i - oldi, out + 1, oldi == 0);
16071607-16081608- /* i was supposed to wrap around from out+1 to 0, */
16091609- /* incrementing n each time, so we'll fix that now: */
16101610-16111611- if (i / (out + 1) > maxint - n) return punycode_overflow;
16121612- n += i / (out + 1);
16131613- i %= (out + 1);
16141614-16151615- /* Insert n at position i of the output: */
16161616-16171617- /* not needed for Punycode: */
16181618- /* if (decode_digit(n) <= base) return punycode_invalid_input; */
16191619- if (out >= max_out) return punycode_big_output;
16201620-16211621- if (case_flags) {
16221622- memmove(case_flags + i + 1, case_flags + i, out - i);
16231623-16241624-16251625-16261626-Costello Standards Track [Page 29]
16271627-16281628-RFC 3492 IDNA Punycode March 2003
16291629-16301630-16311631- /* Case of last character determines uppercase flag: */
16321632- case_flags[i] = flagged(input[in - 1]);
16331633- }
16341634-16351635- memmove(output + i + 1, output + i, (out - i) * sizeof *output);
16361636- output[i++] = n;
16371637- }
16381638-16391639- *output_length = out;
16401640- return punycode_success;
16411641-}
16421642-16431643-/******************************************************************/
16441644-/* Wrapper for testing (would normally go in a separate .c file): */
16451645-16461646-#include <assert.h>
16471647-#include <stdio.h>
16481648-#include <stdlib.h>
16491649-#include <string.h>
16501650-16511651-/* For testing, we'll just set some compile-time limits rather than */
16521652-/* use malloc(), and set a compile-time option rather than using a */
16531653-/* command-line option. */
16541654-16551655-enum {
16561656- unicode_max_length = 256,
16571657- ace_max_length = 256
16581658-};
16591659-16601660-static void usage(char **argv)
16611661-{
16621662- fprintf(stderr,
16631663- "\n"
16641664- "%s -e reads code points and writes a Punycode string.\n"
16651665- "%s -d reads a Punycode string and writes code points.\n"
16661666- "\n"
16671667- "Input and output are plain text in the native character set.\n"
16681668- "Code points are in the form u+hex separated by whitespace.\n"
16691669- "Although the specification allows Punycode strings to contain\n"
16701670- "any characters from the ASCII repertoire, this test code\n"
16711671- "supports only the printable characters, and needs the Punycode\n"
16721672- "string to be followed by a newline.\n"
16731673- "The case of the u in u+hex is the force-to-uppercase flag.\n"
16741674- , argv[0], argv[0]);
16751675- exit(EXIT_FAILURE);
16761676-}
16771677-16781678-static void fail(const char *msg)
16791679-16801680-16811681-16821682-Costello Standards Track [Page 30]
16831683-16841684-RFC 3492 IDNA Punycode March 2003
16851685-16861686-16871687-{
16881688- fputs(msg,stderr);
16891689- exit(EXIT_FAILURE);
16901690-}
16911691-16921692-static const char too_big[] =
16931693- "input or output is too large, recompile with larger limits\n";
16941694-static const char invalid_input[] = "invalid input\n";
16951695-static const char overflow[] = "arithmetic overflow\n";
16961696-static const char io_error[] = "I/O error\n";
16971697-16981698-/* The following string is used to convert printable */
16991699-/* characters between ASCII and the native charset: */
17001700-17011701-static const char print_ascii[] =
17021702- "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
17031703- "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
17041704- " !\"#$%&'()*+,-./"
17051705- "0123456789:;<=>?"
17061706- "@ABCDEFGHIJKLMNO"
17071707- "PQRSTUVWXYZ[\\]^_"
17081708- "`abcdefghijklmno"
17091709- "pqrstuvwxyz{|}~\n";
17101710-17111711-int main(int argc, char **argv)
17121712-{
17131713- enum punycode_status status;
17141714- int r;
17151715- unsigned int input_length, output_length, j;
17161716- unsigned char case_flags[unicode_max_length];
17171717-17181718- if (argc != 2) usage(argv);
17191719- if (argv[1][0] != '-') usage(argv);
17201720- if (argv[1][2] != 0) usage(argv);
17211721-17221722- if (argv[1][1] == 'e') {
17231723- punycode_uint input[unicode_max_length];
17241724- unsigned long codept;
17251725- char output[ace_max_length+1], uplus[3];
17261726- int c;
17271727-17281728- /* Read the input code points: */
17291729-17301730- input_length = 0;
17311731-17321732- for (;;) {
17331733- r = scanf("%2s%lx", uplus, &codept);
17341734- if (ferror(stdin)) fail(io_error);
17351735-17361736-17371737-17381738-Costello Standards Track [Page 31]
17391739-17401740-RFC 3492 IDNA Punycode March 2003
17411741-17421742-17431743- if (r == EOF || r == 0) break;
17441744-17451745- if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {
17461746- fail(invalid_input);
17471747- }
17481748-17491749- if (input_length == unicode_max_length) fail(too_big);
17501750-17511751- if (uplus[0] == 'u') case_flags[input_length] = 0;
17521752- else if (uplus[0] == 'U') case_flags[input_length] = 1;
17531753- else fail(invalid_input);
17541754-17551755- input[input_length++] = codept;
17561756- }
17571757-17581758- /* Encode: */
17591759-17601760- output_length = ace_max_length;
17611761- status = punycode_encode(input_length, input, case_flags,
17621762- &output_length, output);
17631763- if (status == punycode_bad_input) fail(invalid_input);
17641764- if (status == punycode_big_output) fail(too_big);
17651765- if (status == punycode_overflow) fail(overflow);
17661766- assert(status == punycode_success);
17671767-17681768- /* Convert to native charset and output: */
17691769-17701770- for (j = 0; j < output_length; ++j) {
17711771- c = output[j];
17721772- assert(c >= 0 && c <= 127);
17731773- if (print_ascii[c] == 0) fail(invalid_input);
17741774- output[j] = print_ascii[c];
17751775- }
17761776-17771777- output[j] = 0;
17781778- r = puts(output);
17791779- if (r == EOF) fail(io_error);
17801780- return EXIT_SUCCESS;
17811781- }
17821782-17831783- if (argv[1][1] == 'd') {
17841784- char input[ace_max_length+2], *p, *pp;
17851785- punycode_uint output[unicode_max_length];
17861786-17871787- /* Read the Punycode input string and convert to ASCII: */
17881788-17891789- fgets(input, ace_max_length+2, stdin);
17901790- if (ferror(stdin)) fail(io_error);
17911791-17921792-17931793-17941794-Costello Standards Track [Page 32]
17951795-17961796-RFC 3492 IDNA Punycode March 2003
17971797-17981798-17991799- if (feof(stdin)) fail(invalid_input);
18001800- input_length = strlen(input) - 1;
18011801- if (input[input_length] != '\n') fail(too_big);
18021802- input[input_length] = 0;
18031803-18041804- for (p = input; *p != 0; ++p) {
18051805- pp = strchr(print_ascii, *p);
18061806- if (pp == 0) fail(invalid_input);
18071807- *p = pp - print_ascii;
18081808- }
18091809-18101810- /* Decode: */
18111811-18121812- output_length = unicode_max_length;
18131813- status = punycode_decode(input_length, input, &output_length,
18141814- output, case_flags);
18151815- if (status == punycode_bad_input) fail(invalid_input);
18161816- if (status == punycode_big_output) fail(too_big);
18171817- if (status == punycode_overflow) fail(overflow);
18181818- assert(status == punycode_success);
18191819-18201820- /* Output the result: */
18211821-18221822- for (j = 0; j < output_length; ++j) {
18231823- r = printf("%s+%04lX\n",
18241824- case_flags[j] ? "U" : "u",
18251825- (unsigned long) output[j] );
18261826- if (r < 0) fail(io_error);
18271827- }
18281828-18291829- return EXIT_SUCCESS;
18301830- }
18311831-18321832- usage(argv);
18331833- return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
18341834-}
18351835-18361836-18371837-18381838-18391839-18401840-18411841-18421842-18431843-18441844-18451845-18461846-18471847-18481848-18491849-18501850-Costello Standards Track [Page 33]
18511851-18521852-RFC 3492 IDNA Punycode March 2003
18531853-18541854-18551855-Author's Address
18561856-18571857- Adam M. Costello
18581858- University of California, Berkeley
18591859- http://www.nicemice.net/amc/
18601860-18611861-18621862-18631863-18641864-18651865-18661866-18671867-18681868-18691869-18701870-18711871-18721872-18731873-18741874-18751875-18761876-18771877-18781878-18791879-18801880-18811881-18821882-18831883-18841884-18851885-18861886-18871887-18881888-18891889-18901890-18911891-18921892-18931893-18941894-18951895-18961896-18971897-18981898-18991899-19001900-19011901-19021902-19031903-19041904-19051905-19061906-Costello Standards Track [Page 34]
19071907-19081908-RFC 3492 IDNA Punycode March 2003
19091909-19101910-19111911-Full Copyright Statement
19121912-19131913- Copyright (C) The Internet Society (2003). All Rights Reserved.
19141914-19151915- This document and translations of it may be copied and furnished to
19161916- others, and derivative works that comment on or otherwise explain it
19171917- or assist in its implementation may be prepared, copied, published
19181918- and distributed, in whole or in part, without restriction of any
19191919- kind, provided that the above copyright notice and this paragraph are
19201920- included on all such copies and derivative works. However, this
19211921- document itself may not be modified in any way, such as by removing
19221922- the copyright notice or references to the Internet Society or other
19231923- Internet organizations, except as needed for the purpose of
19241924- developing Internet standards in which case the procedures for
19251925- copyrights defined in the Internet Standards process must be
19261926- followed, or as required to translate it into languages other than
19271927- English.
19281928-19291929- The limited permissions granted above are perpetual and will not be
19301930- revoked by the Internet Society or its successors or assigns.
19311931-19321932- This document and the information contained herein is provided on an
19331933- "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
19341934- TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
19351935- BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
19361936- HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
19371937- MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
19381938-19391939-Acknowledgement
19401940-19411941- Funding for the RFC Editor function is currently provided by the
19421942- Internet Society.
19431943-19441944-19451945-19461946-19471947-19481948-19491949-19501950-19511951-19521952-19531953-19541954-19551955-19561956-19571957-19581958-19591959-19601960-19611961-19621962-Costello Standards Track [Page 35]
19631963-
-955
ocaml-punycode/spec/rfc5891.txt
···11-22-33-44-55-66-77-Internet Engineering Task Force (IETF) J. Klensin
88-Request for Comments: 5891 August 2010
99-Obsoletes: 3490, 3491
1010-Updates: 3492
1111-Category: Standards Track
1212-ISSN: 2070-1721
1313-1414-1515- Internationalized Domain Names in Applications (IDNA): Protocol
1616-1717-Abstract
1818-1919- This document is the revised protocol definition for
2020- Internationalized Domain Names (IDNs). The rationale for changes,
2121- the relationship to the older specification, and important
2222- terminology are provided in other documents. This document specifies
2323- the protocol mechanism, called Internationalized Domain Names in
2424- Applications (IDNA), for registering and looking up IDNs in a way
2525- that does not require changes to the DNS itself. IDNA is only meant
2626- for processing domain names, not free text.
2727-2828-Status of This Memo
2929-3030- This is an Internet Standards Track document.
3131-3232- This document is a product of the Internet Engineering Task Force
3333- (IETF). It represents the consensus of the IETF community. It has
3434- received public review and has been approved for publication by the
3535- Internet Engineering Steering Group (IESG). Further information on
3636- Internet Standards is available in Section 2 of RFC 5741.
3737-3838- Information about the current status of this document, any errata,
3939- and how to provide feedback on it may be obtained at
4040- http://www.rfc-editor.org/info/rfc5891.
4141-4242-4343-4444-4545-4646-4747-4848-4949-5050-5151-5252-5353-5454-5555-5656-5757-5858-Klensin Standards Track [Page 1]
5959-6060-RFC 5891 IDNA2008 Protocol August 2010
6161-6262-6363-Copyright Notice
6464-6565- Copyright (c) 2010 IETF Trust and the persons identified as the
6666- document authors. All rights reserved.
6767-6868- This document is subject to BCP 78 and the IETF Trust's Legal
6969- Provisions Relating to IETF Documents
7070- (http://trustee.ietf.org/license-info) in effect on the date of
7171- publication of this document. Please review these documents
7272- carefully, as they describe your rights and restrictions with respect
7373- to this document. Code Components extracted from this document must
7474- include Simplified BSD License text as described in Section 4.e of
7575- the Trust Legal Provisions and are provided without warranty as
7676- described in the Simplified BSD License.
7777-7878- This document may contain material from IETF Documents or IETF
7979- Contributions published or made publicly available before November
8080- 10, 2008. The person(s) controlling the copyright in some of this
8181- material may not have granted the IETF Trust the right to allow
8282- modifications of such material outside the IETF Standards Process.
8383- Without obtaining an adequate license from the person(s) controlling
8484- the copyright in such materials, this document may not be modified
8585- outside the IETF Standards Process, and derivative works of it may
8686- not be created outside the IETF Standards Process, except to format
8787- it for publication as an RFC or to translate it into languages other
8888- than English.
8989-9090-9191-9292-9393-9494-9595-9696-9797-9898-9999-100100-101101-102102-103103-104104-105105-106106-107107-108108-109109-110110-111111-112112-113113-114114-Klensin Standards Track [Page 2]
115115-116116-RFC 5891 IDNA2008 Protocol August 2010
117117-118118-119119-Table of Contents
120120-121121- 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
122122- 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
123123- 3. Requirements and Applicability . . . . . . . . . . . . . . . . 5
124124- 3.1. Requirements . . . . . . . . . . . . . . . . . . . . . . . 5
125125- 3.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 5
126126- 3.2.1. DNS Resource Records . . . . . . . . . . . . . . . . . 6
127127- 3.2.2. Non-Domain-Name Data Types Stored in the DNS . . . . . 6
128128- 4. Registration Protocol . . . . . . . . . . . . . . . . . . . . 6
129129- 4.1. Input to IDNA Registration . . . . . . . . . . . . . . . . 7
130130- 4.2. Permitted Character and Label Validation . . . . . . . . . 7
131131- 4.2.1. Input Format . . . . . . . . . . . . . . . . . . . . . 7
132132- 4.2.2. Rejection of Characters That Are Not Permitted . . . . 8
133133- 4.2.3. Label Validation . . . . . . . . . . . . . . . . . . . 8
134134- 4.2.4. Registration Validation Requirements . . . . . . . . . 9
135135- 4.3. Registry Restrictions . . . . . . . . . . . . . . . . . . 9
136136- 4.4. Punycode Conversion . . . . . . . . . . . . . . . . . . . 9
137137- 4.5. Insertion in the Zone . . . . . . . . . . . . . . . . . . 10
138138- 5. Domain Name Lookup Protocol . . . . . . . . . . . . . . . . . 10
139139- 5.1. Label String Input . . . . . . . . . . . . . . . . . . . . 10
140140- 5.2. Conversion to Unicode . . . . . . . . . . . . . . . . . . 10
141141- 5.3. A-label Input . . . . . . . . . . . . . . . . . . . . . . 10
142142- 5.4. Validation and Character List Testing . . . . . . . . . . 11
143143- 5.5. Punycode Conversion . . . . . . . . . . . . . . . . . . . 13
144144- 5.6. DNS Name Resolution . . . . . . . . . . . . . . . . . . . 13
145145- 6. Security Considerations . . . . . . . . . . . . . . . . . . . 13
146146- 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13
147147- 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 13
148148- 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 14
149149- 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14
150150- 10.1. Normative References . . . . . . . . . . . . . . . . . . . 14
151151- 10.2. Informative References . . . . . . . . . . . . . . . . . . 15
152152- Appendix A. Summary of Major Changes from IDNA2003 . . . . . . . 17
153153-154154-155155-156156-157157-158158-159159-160160-161161-162162-163163-164164-165165-166166-167167-168168-169169-170170-Klensin Standards Track [Page 3]
171171-172172-RFC 5891 IDNA2008 Protocol August 2010
173173-174174-175175-1. Introduction
176176-177177- This document supplies the protocol definition for Internationalized
178178- Domain Names in Applications (IDNA), with the version specified here
179179- known as IDNA2008. Essential definitions and terminology for
180180- understanding this document and a road map of the collection of
181181- documents that make up IDNA2008 appear in a separate Definitions
182182- document [RFC5890]. Appendix A discusses the relationship between
183183- this specification and the earlier version of IDNA (referred to here
184184- as "IDNA2003"). The rationale for these changes, along with
185185- considerable explanatory material and advice to zone administrators
186186- who support IDNs, is provided in another document, known informally
187187- in this series as the "Rationale document" [RFC5894].
188188-189189- IDNA works by allowing applications to use certain ASCII [ASCII]
190190- string labels (beginning with a special prefix) to represent
191191- non-ASCII name labels. Lower-layer protocols need not be aware of
192192- this; therefore, IDNA does not change any infrastructure. In
193193- particular, IDNA does not depend on any changes to DNS servers,
194194- resolvers, or DNS protocol elements, because the ASCII name service
195195- provided by the existing DNS can be used for IDNA.
196196-197197- IDNA applies only to a specific subset of DNS labels. The base DNS
198198- standards [RFC1034] [RFC1035] and their various updates specify how
199199- to combine labels into fully-qualified domain names and parse labels
200200- out of those names.
201201-202202- This document describes two separate protocols, one for IDN
203203- registration (Section 4) and one for IDN lookup (Section 5). These
204204- two protocols share some terminology, reference data, and operations.
205205-206206-2. Terminology
207207-208208- As mentioned above, terminology used as part of the definition of
209209- IDNA appears in the Definitions document [RFC5890]. It is worth
210210- noting that some of this terminology overlaps with, and is consistent
211211- with, that used in Unicode or other character set standards and the
212212- DNS. Readers of this document are assumed to be familiar with the
213213- associated Definitions document and with the DNS-specific terminology
214214- in RFC 1034 [RFC1034].
215215-216216- The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
217217- "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
218218- document are to be interpreted as described in BCP 14, RFC 2119
219219- [RFC2119].
220220-221221-222222-223223-224224-225225-226226-Klensin Standards Track [Page 4]
227227-228228-RFC 5891 IDNA2008 Protocol August 2010
229229-230230-231231-3. Requirements and Applicability
232232-233233-3.1. Requirements
234234-235235- IDNA makes the following requirements:
236236-237237- 1. Whenever a domain name is put into a domain name slot that is not
238238- IDNA-aware (see Section 2.3.2.6 of the Definitions document
239239- [RFC5890]), it MUST contain only ASCII characters (i.e., its
240240- labels must be either A-labels or NR-LDH labels), unless the DNS
241241- application is not subject to historical recommendations for
242242- "hostname"-style names (see RFC 1034 [RFC1034] and
243243- Section 3.2.1).
244244-245245- 2. Labels MUST be compared using equivalent forms: either both
246246- A-label forms or both U-label forms. Because A-labels and
247247- U-labels can be transformed into each other without loss of
248248- information, these comparisons are equivalent (however, in
249249- practice, comparison of U-labels requires first verifying that
250250- they actually are U-labels and not just Unicode strings). A pair
251251- of A-labels MUST be compared as case-insensitive ASCII (as with
252252- all comparisons of ASCII DNS labels). U-labels MUST be compared
253253- as-is, without case folding or other intermediate steps. While
254254- it is not necessary to validate labels in order to compare them,
255255- successful comparison does not imply validity. In many cases,
256256- not limited to comparison, validation may be important for other
257257- reasons and SHOULD be performed.
258258-259259- 3. Labels being registered MUST conform to the requirements of
260260- Section 4. Labels being looked up and the lookup process MUST
261261- conform to the requirements of Section 5.
262262-263263-3.2. Applicability
264264-265265- IDNA applies to all domain names in all domain name slots in
266266- protocols except where it is explicitly excluded. It does not apply
267267- to domain name slots that do not use the LDH syntax rules as
268268- described in the Definitions document [RFC5890].
269269-270270- Because it uses the DNS, IDNA applies to many protocols that were
271271- specified before it was designed. IDNs occupying domain name slots
272272- in those older protocols MUST be in A-label form until and unless
273273- those protocols and their implementations are explicitly upgraded to
274274- be aware of IDNs and to accept the U-label form. IDNs actually
275275- appearing in DNS queries or responses MUST be A-labels.
276276-277277-278278-279279-280280-281281-282282-Klensin Standards Track [Page 5]
283283-284284-RFC 5891 IDNA2008 Protocol August 2010
285285-286286-287287- IDNA-aware protocols and implementations MAY accept U-labels,
288288- A-labels, or both as those particular protocols specify. IDNA is not
289289- defined for extended label types (see RFC 2671 [RFC2671], Section 3).
290290-291291-3.2.1. DNS Resource Records
292292-293293- IDNA applies only to domain names in the NAME and RDATA fields of DNS
294294- resource records whose CLASS is IN. See the DNS specification
295295- [RFC1035] for precise definitions of these terms.
296296-297297- The application of IDNA to DNS resource records depends entirely on
298298- the CLASS of the record, and not on the TYPE except as noted below.
299299- This will remain true, even as new TYPEs are defined, unless a new
300300- TYPE defines TYPE-specific rules. Special naming conventions for SRV
301301- records (and "underscore labels" more generally) are incompatible
302302- with IDNA coding as discussed in the Definitions document [RFC5890],
303303- especially Section 2.3.2.3. Of course, underscore labels may be part
304304- of a domain that uses IDN labels at higher levels in the tree.
305305-306306-3.2.2. Non-Domain-Name Data Types Stored in the DNS
307307-308308- Although IDNA enables the representation of non-ASCII characters in
309309- domain names, that does not imply that IDNA enables the
310310- representation of non-ASCII characters in other data types that are
311311- stored in domain names, specifically in the RDATA field for types
312312- that have structured RDATA format. For example, an email address
313313- local part is stored in a domain name in the RNAME field as part of
314314- the RDATA of an SOA record (e.g., hostmaster@example.com would be
315315- represented as hostmaster.example.com). IDNA does not update the
316316- existing email standards, which allow only ASCII characters in local
317317- parts. Even though work is in progress to define
318318- internationalization for email addresses [RFC4952], changes to the
319319- email address part of the SOA RDATA would require action in, or
320320- updates to, other standards, specifically those that specify the
321321- format of the SOA RR.
322322-323323-4. Registration Protocol
324324-325325- This section defines the model for registering an IDN. The model is
326326- implementation independent; any sequence of steps that produces
327327- exactly the same result for all labels is considered a valid
328328- implementation.
329329-330330- Note that, while the registration (this section) and lookup protocols
331331- (Section 5) are very similar in most respects, they are not
332332- identical, and implementers should carefully follow the steps
333333- described in this specification.
334334-335335-336336-337337-338338-Klensin Standards Track [Page 6]
339339-340340-RFC 5891 IDNA2008 Protocol August 2010
341341-342342-343343-4.1. Input to IDNA Registration
344344-345345- Registration processes, especially processing by entities (often
346346- called "registrars") who deal with registrants before the request
347347- actually reaches the zone manager ("registry") are outside the scope
348348- of this definition and may differ significantly depending on local
349349- needs. By the time a string enters the IDNA registration process as
350350- described in this specification, it MUST be in Unicode and in
351351- Normalization Form C (NFC [Unicode-UAX15]). Entities responsible for
352352- zone files ("registries") MUST accept only the exact string for which
353353- registration is requested, free of any mappings or local adjustments.
354354- They MAY accept that input in any of three forms:
355355-356356- 1. As a pair of A-label and U-label.
357357-358358- 2. As an A-label only.
359359-360360- 3. As a U-label only.
361361-362362- The first two of these forms are RECOMMENDED because the use of
363363- A-labels avoids any possibility of ambiguity. The first is normally
364364- preferred over the second because it permits further verification of
365365- user intent (see Section 4.2.1).
366366-367367-4.2. Permitted Character and Label Validation
368368-369369-4.2.1. Input Format
370370-371371- If both the U-label and A-label forms are available, the registry
372372- MUST ensure that the A-label form is in lowercase, perform a
373373- conversion to a U-label, perform the steps and tests described below
374374- on that U-label, and then verify that the A-label produced by the
375375- step in Section 4.4 matches the one provided as input. In addition,
376376- the U-label that was provided as input and the one obtained by
377377- conversion of the A-label MUST match exactly. If, for some reason,
378378- these tests fail, the registration MUST be rejected.
379379-380380- If only an A-label was provided and the conversion to a U-label is
381381- not performed, the registry MUST still verify that the A-label is
382382- superficially valid, i.e., that it does not violate any of the rules
383383- of Punycode encoding [RFC3492] such as the prohibition on trailing
384384- hyphen-minus, the requirement that all characters be ASCII, and so
385385- on. Strings that appear to be A-labels (e.g., they start with
386386- "xn--") and strings that are supplied to the registry in a context
387387- reserved for A-labels (such as a field in a form to be filled out),
388388- but that are not valid A-labels as described in this paragraph, MUST
389389- NOT be placed in DNS zones that support IDNA.
390390-391391-392392-393393-394394-Klensin Standards Track [Page 7]
395395-396396-RFC 5891 IDNA2008 Protocol August 2010
397397-398398-399399- If only an A-label is provided, the conversion to a U-label is not
400400- performed, but the superficial tests described in the previous
401401- paragraph are performed, registration procedures MAY, and usually
402402- will, bypass the tests and actions in the balance of Section 4.2 and
403403- in Sections 4.3 and 4.4.
404404-405405-4.2.2. Rejection of Characters That Are Not Permitted
406406-407407- The candidate Unicode string MUST NOT contain characters that appear
408408- in the "DISALLOWED" and "UNASSIGNED" lists specified in the Tables
409409- document [RFC5892].
410410-411411-4.2.3. Label Validation
412412-413413- The proposed label (in the form of a Unicode string, i.e., a string
414414- that at least superficially appears to be a U-label) is then examined
415415- using tests that require examination of more than one character.
416416- Character order is considered to be the on-the-wire order. That
417417- order may not be the same as the display order.
418418-419419-4.2.3.1. Hyphen Restrictions
420420-421421- The Unicode string MUST NOT contain "--" (two consecutive hyphens) in
422422- the third and fourth character positions and MUST NOT start or end
423423- with a "-" (hyphen).
424424-425425-4.2.3.2. Leading Combining Marks
426426-427427- The Unicode string MUST NOT begin with a combining mark or combining
428428- character (see The Unicode Standard, Section 2.11 [Unicode] for an
429429- exact definition).
430430-431431-4.2.3.3. Contextual Rules
432432-433433- The Unicode string MUST NOT contain any characters whose validity is
434434- context-dependent, unless the validity is positively confirmed by a
435435- contextual rule. To check this, each code point identified as
436436- CONTEXTJ or CONTEXTO in the Tables document [RFC5892] MUST have a
437437- non-null rule. If such a code point is missing a rule, the label is
438438- invalid. If the rule exists but the result of applying the rule is
439439- negative or inconclusive, the proposed label is invalid.
440440-441441-4.2.3.4. Labels Containing Characters Written Right to Left
442442-443443- If the proposed label contains any characters from scripts that are
444444- written from right to left, it MUST meet the Bidi criteria [RFC5893].
445445-446446-447447-448448-449449-450450-Klensin Standards Track [Page 8]
451451-452452-RFC 5891 IDNA2008 Protocol August 2010
453453-454454-455455-4.2.4. Registration Validation Requirements
456456-457457- Strings that contain at least one non-ASCII character, have been
458458- produced by the steps above, whose contents pass all of the tests in
459459- Section 4.2.3, and are 63 or fewer characters long in
460460- ASCII-compatible encoding (ACE) form (see Section 4.4), are U-labels.
461461-462462- To summarize, tests are made in Section 4.2 for invalid characters,
463463- invalid combinations of characters, for labels that are invalid even
464464- if the characters they contain are valid individually, and for labels
465465- that do not conform to the restrictions for strings containing
466466- right-to-left characters.
467467-468468-4.3. Registry Restrictions
469469-470470- In addition to the rules and tests above, there are many reasons why
471471- a registry could reject a label. Registries at all levels of the
472472- DNS, not just the top level, are expected to establish policies about
473473- label registrations. Policies are likely to be informed by the local
474474- languages and the scripts that are used to write them and may depend
475475- on many factors including what characters are in the label (for
476476- example, a label may be rejected based on other labels already
477477- registered). See the Rationale document [RFC5894], Section 3.2, for
478478- further discussion and recommendations about registry policies.
479479-480480- The string produced by the steps in Section 4.2 is checked and
481481- processed as appropriate to local registry restrictions. Application
482482- of those registry restrictions may result in the rejection of some
483483- labels or the application of special restrictions to others.
484484-485485-4.4. Punycode Conversion
486486-487487- The resulting U-label is converted to an A-label (defined in Section
488488- 2.3.2.1 of the Definitions document [RFC5890]). The A-label is the
489489- encoding of the U-label according to the Punycode algorithm [RFC3492]
490490- with the ACE prefix "xn--" added at the beginning of the string. The
491491- resulting string must, of course, conform to the length limits
492492- imposed by the DNS. This document does not update or alter the
493493- Punycode algorithm specified in RFC 3492 in any way. RFC 3492 does
494494- make a non-normative reference to the information about the value and
495495- construction of the ACE prefix that appears in RFC 3490 or Nameprep
496496- [RFC3491]. For consistency and reader convenience, IDNA2008
497497- effectively updates that reference to point to this document. That
498498- change does not alter the prefix itself. The prefix, "xn--", is the
499499- same in both sets of documents.
500500-501501-502502-503503-504504-505505-506506-Klensin Standards Track [Page 9]
507507-508508-RFC 5891 IDNA2008 Protocol August 2010
509509-510510-511511- With the exception of the maximum string length test on Punycode
512512- output, the failure conditions identified in the Punycode encoding
513513- procedure cannot occur if the input is a U-label as determined by the
514514- steps in Sections 4.1 through 4.3 above.
515515-516516-4.5. Insertion in the Zone
517517-518518- The label is registered in the DNS by inserting the A-label into a
519519- zone.
520520-521521-5. Domain Name Lookup Protocol
522522-523523- Lookup is different from registration and different tests are applied
524524- on the client. Although some validity checks are necessary to avoid
525525- serious problems with the protocol, the lookup-side tests are more
526526- permissive and rely on the assumption that names that are present in
527527- the DNS are valid. That assumption is, however, a weak one because
528528- the presence of wildcards in the DNS might cause a string that is not
529529- actually registered in the DNS to be successfully looked up.
530530-531531-5.1. Label String Input
532532-533533- The user supplies a string in the local character set, for example,
534534- by typing it, clicking on it, or copying and pasting it from a
535535- resource identifier, e.g., a Uniform Resource Identifier (URI)
536536- [RFC3986] or an Internationalized Resource Identifier (IRI)
537537- [RFC3987], from which the domain name is extracted. Alternately,
538538- some process not directly involving the user may read the string from
539539- a file or obtain it in some other way. Processing in this step and
540540- the one specified in Section 5.2 are local matters, to be
541541- accomplished prior to actual invocation of IDNA.
542542-543543-5.2. Conversion to Unicode
544544-545545- The string is converted from the local character set into Unicode, if
546546- it is not already in Unicode. Depending on local needs, this
547547- conversion may involve mapping some characters into other characters
548548- as well as coding conversions. Those issues are discussed in the
549549- mapping-related sections (Sections 4.2, 4.4, 6, and 7.3) of the
550550- Rationale document [RFC5894] and in the separate Mapping document
551551- [IDNA2008-Mapping]. The result MUST be a Unicode string in NFC form.
552552-553553-5.3. A-label Input
554554-555555- If the input to this procedure appears to be an A-label (i.e., it
556556- starts in "xn--", interpreted case-insensitively), the lookup
557557- application MAY attempt to convert it to a U-label, first ensuring
558558- that the A-label is entirely in lowercase (converting it to lowercase
559559-560560-561561-562562-Klensin Standards Track [Page 10]
563563-564564-RFC 5891 IDNA2008 Protocol August 2010
565565-566566-567567- if necessary), and apply the tests of Section 5.4 and the conversion
568568- of Section 5.5 to that form. If the label is converted to Unicode
569569- (i.e., to U-label form) using the Punycode decoding algorithm, then
570570- the processing specified in those two sections MUST be performed, and
571571- the label MUST be rejected if the resulting label is not identical to
572572- the original. See Section 8.1 of the Rationale document [RFC5894]
573573- for additional discussion on this topic.
574574-575575- Conversion from the A-label and testing that the result is a U-label
576576- SHOULD be performed if the domain name will later be presented to the
577577- user in native character form (this requires that the lookup
578578- application be IDNA-aware). If those steps are not performed, the
579579- lookup process SHOULD at least test to determine that the string is
580580- actually an A-label, examining it for the invalid formats specified
581581- in the Punycode decoding specification. Applications that are not
582582- IDNA-aware will obviously omit that testing; others MAY treat the
583583- string as opaque to avoid the additional processing at the expense of
584584- providing less protection and information to users.
585585-586586-5.4. Validation and Character List Testing
587587-588588- As with the registration procedure described in Section 4, the
589589- Unicode string is checked to verify that all characters that appear
590590- in it are valid as input to IDNA lookup processing. As discussed
591591- above and in the Rationale document [RFC5894], the lookup check is
592592- more liberal than the registration one. Labels that have not been
593593- fully evaluated for conformance to the applicable rules are referred
594594- to as "putative" labels as discussed in Section 2.3.2.1 of the
595595- Definitions document [RFC5890]. Putative U-labels with any of the
596596- following characteristics MUST be rejected prior to DNS lookup:
597597-598598- o Labels that are not in NFC [Unicode-UAX15].
599599-600600- o Labels containing "--" (two consecutive hyphens) in the third and
601601- fourth character positions.
602602-603603- o Labels whose first character is a combining mark (see The Unicode
604604- Standard, Section 2.11 [Unicode]).
605605-606606- o Labels containing prohibited code points, i.e., those that are
607607- assigned to the "DISALLOWED" category of the Tables document
608608- [RFC5892].
609609-610610- o Labels containing code points that are identified in the Tables
611611- document as "CONTEXTJ", i.e., requiring exceptional contextual
612612- rule processing on lookup, but that do not conform to those rules.
613613- Note that this implies that a rule must be defined, not null: a
614614-615615-616616-617617-618618-Klensin Standards Track [Page 11]
619619-620620-RFC 5891 IDNA2008 Protocol August 2010
621621-622622-623623- character that requires a contextual rule but for which the rule
624624- is null is treated in this step as having failed to conform to the
625625- rule.
626626-627627- o Labels containing code points that are identified in the Tables
628628- document as "CONTEXTO", but for which no such rule appears in the
629629- table of rules. Applications resolving DNS names or carrying out
630630- equivalent operations are not required to test contextual rules
631631- for "CONTEXTO" characters, only to verify that a rule is defined
632632- (although they MAY make such tests to provide better protection or
633633- give better information to the user).
634634-635635- o Labels containing code points that are unassigned in the version
636636- of Unicode being used by the application, i.e., in the UNASSIGNED
637637- category of the Tables document.
638638-639639- This requirement means that the application must use a list of
640640- unassigned characters that is matched to the version of Unicode
641641- that is being used for the other requirements in this section. It
642642- is not required that the application know which version of Unicode
643643- is being used; that information might be part of the operating
644644- environment in which the application is running.
645645-646646- In addition, the application SHOULD apply the following test.
647647-648648- o Verification that the string is compliant with the requirements
649649- for right-to-left characters specified in the Bidi document
650650- [RFC5893].
651651-652652- This test may be omitted in special circumstances, such as when the
653653- lookup application knows that the conditions are enforced elsewhere,
654654- because an attempt to look up and resolve such strings will almost
655655- certainly lead to a DNS lookup failure except when wildcards are
656656- present in the zone. However, applying the test is likely to give
657657- much better information about the reason for a lookup failure --
658658- information that may be usefully passed to the user when that is
659659- feasible -- than DNS resolution failure information alone.
660660-661661- For all other strings, the lookup application MUST rely on the
662662- presence or absence of labels in the DNS to determine the validity of
663663- those labels and the validity of the characters they contain. If
664664- they are registered, they are presumed to be valid; if they are not,
665665- their possible validity is not relevant. While a lookup application
666666- may reasonably issue warnings about strings it believes may be
667667- problematic, applications that decline to process a string that
668668- conforms to the rules above (i.e., does not look it up in the DNS)
669669- are not in conformance with this protocol.
670670-671671-672672-673673-674674-Klensin Standards Track [Page 12]
675675-676676-RFC 5891 IDNA2008 Protocol August 2010
677677-678678-679679-5.5. Punycode Conversion
680680-681681- The string that has now been validated for lookup is converted to ACE
682682- form by applying the Punycode algorithm to the string and then adding
683683- the ACE prefix ("xn--").
684684-685685-5.6. DNS Name Resolution
686686-687687- The A-label resulting from the conversion in Section 5.5 or supplied
688688- directly (see Section 5.3) is combined with other labels as needed to
689689- form a fully-qualified domain name that is then looked up in the DNS,
690690- using normal DNS resolver procedures. The lookup can obviously
691691- either succeed (returning information) or fail.
692692-693693-6. Security Considerations
694694-695695- Security Considerations for this version of IDNA are described in the
696696- Definitions document [RFC5890], except for the special issues
697697- associated with right-to-left scripts and characters. The latter are
698698- discussed in the Bidi document [RFC5893].
699699-700700- In order to avoid intentional or accidental attacks from labels that
701701- might be confused with others, special problems in rendering, and so
702702- on, the IDNA model requires that registries exercise care and
703703- thoughtfulness about what labels they choose to permit. That issue
704704- is discussed in Section 4.3 of this document which, in turn, points
705705- to a somewhat more extensive discussion in the Rationale document
706706- [RFC5894].
707707-708708-7. IANA Considerations
709709-710710- IANA actions for this version of IDNA are specified in the Tables
711711- document [RFC5892] and discussed informally in the Rationale document
712712- [RFC5894]. The components of IDNA described in this document do not
713713- require any IANA actions.
714714-715715-8. Contributors
716716-717717- While the listed editor held the pen, the original versions of this
718718- document represent the joint work and conclusions of an ad hoc design
719719- team consisting of the editor and, in alphabetic order, Harald
720720- Alvestrand, Tina Dam, Patrik Faltstrom, and Cary Karp. This document
721721- draws significantly on the original version of IDNA [RFC3490] both
722722- conceptually and for specific text. This second-generation version
723723- would not have been possible without the work that went into that
724724- first version and especially the contributions of its authors Patrik
725725- Faltstrom, Paul Hoffman, and Adam Costello. While Faltstrom was
726726-727727-728728-729729-730730-Klensin Standards Track [Page 13]
731731-732732-RFC 5891 IDNA2008 Protocol August 2010
733733-734734-735735- actively involved in the creation of this version, Hoffman and
736736- Costello were not and should not be held responsible for any errors
737737- or omissions.
738738-739739-9. Acknowledgments
740740-741741- This revision to IDNA would have been impossible without the
742742- accumulated experience since RFC 3490 was published and resulting
743743- comments and complaints of many people in the IETF, ICANN, and other
744744- communities (too many people to list here). Nor would it have been
745745- possible without RFC 3490 itself and the efforts of the Working Group
746746- that defined it. Those people whose contributions are acknowledged
747747- in RFC 3490, RFC 4690 [RFC4690], and the Rationale document [RFC5894]
748748- were particularly important.
749749-750750- Specific textual changes were incorporated into this document after
751751- suggestions from the other contributors, Stephane Bortzmeyer, Vint
752752- Cerf, Lisa Dusseault, Paul Hoffman, Kent Karlsson, James Mitchell,
753753- Erik van der Poel, Marcos Sanz, Andrew Sullivan, Wil Tan, Ken
754754- Whistler, Chris Wright, and other WG participants and reviewers
755755- including Martin Duerst, James Mitchell, Subramanian Moonesamy, Peter
756756- Saint-Andre, Margaret Wasserman, and Dan Winship who caught specific
757757- errors and recommended corrections. Special thanks are due to Paul
758758- Hoffman for permission to extract material to form the basis for
759759- Appendix A from a draft document that he prepared.
760760-761761-10. References
762762-763763-10.1. Normative References
764764-765765- [RFC1034] Mockapetris, P., "Domain names - concepts and
766766- facilities", STD 13, RFC 1034, November 1987.
767767-768768- [RFC1035] Mockapetris, P., "Domain names - implementation and
769769- specification", STD 13, RFC 1035, November 1987.
770770-771771- [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
772772- Requirement Levels", BCP 14, RFC 2119, March 1997.
773773-774774- [RFC3492] Costello, A., "Punycode: A Bootstring encoding of
775775- Unicode for Internationalized Domain Names in
776776- Applications (IDNA)", RFC 3492, March 2003.
777777-778778- [RFC5890] Klensin, J., "Internationalized Domain Names for
779779- Applications (IDNA): Definitions and Document
780780- Framework", RFC 5890, August 2010.
781781-782782-783783-784784-785785-786786-Klensin Standards Track [Page 14]
787787-788788-RFC 5891 IDNA2008 Protocol August 2010
789789-790790-791791- [RFC5892] Faltstrom, P., Ed., "The Unicode Code Points and
792792- Internationalized Domain Names for Applications (IDNA)",
793793- RFC 5892, August 2010.
794794-795795- [RFC5893] Alvestrand, H., Ed. and C. Karp, "Right-to-Left Scripts
796796- for Internationalized Domain Names for Applications
797797- (IDNA)", RFC 5893, August 2010.
798798-799799- [Unicode-UAX15]
800800- The Unicode Consortium, "Unicode Standard Annex #15:
801801- Unicode Normalization Forms", September 2009,
802802- <http://www.unicode.org/reports/tr15/>.
803803-804804-10.2. Informative References
805805-806806- [ASCII] American National Standards Institute (formerly United
807807- States of America Standards Institute), "USA Code for
808808- Information Interchange", ANSI X3.4-1968, 1968. ANSI
809809- X3.4-1968 has been replaced by newer versions with
810810- slight modifications, but the 1968 version remains
811811- definitive for the Internet.
812812-813813- [IDNA2008-Mapping]
814814- Resnick, P. and P. Hoffman, "Mapping Characters in
815815- Internationalized Domain Names for Applications (IDNA)",
816816- Work in Progress, April 2010.
817817-818818- [RFC2671] Vixie, P., "Extension Mechanisms for DNS (EDNS0)",
819819- RFC 2671, August 1999.
820820-821821- [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
822822- "Internationalizing Domain Names in Applications
823823- (IDNA)", RFC 3490, March 2003.
824824-825825- [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
826826- Profile for Internationalized Domain Names (IDN)",
827827- RFC 3491, March 2003.
828828-829829- [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
830830- Resource Identifier (URI): Generic Syntax", STD 66,
831831- RFC 3986, January 2005.
832832-833833- [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
834834- Identifiers (IRIs)", RFC 3987, January 2005.
835835-836836- [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review
837837- and Recommendations for Internationalized Domain Names
838838- (IDNs)", RFC 4690, September 2006.
839839-840840-841841-842842-Klensin Standards Track [Page 15]
843843-844844-RFC 5891 IDNA2008 Protocol August 2010
845845-846846-847847- [RFC4952] Klensin, J. and Y. Ko, "Overview and Framework for
848848- Internationalized Email", RFC 4952, July 2007.
849849-850850- [RFC5894] Klensin, J., "Internationalized Domain Names for
851851- Applications (IDNA): Background, Explanation, and
852852- Rationale", RFC 5894, August 2010.
853853-854854- [Unicode] The Unicode Consortium, "The Unicode Standard, Version
855855- 5.0", 2007. Boston, MA, USA: Addison-Wesley. ISBN
856856- 0-321-48091-0. This printed reference has now been
857857- updated online to reflect additional code points. For
858858- code points, the reference at the time this document was
859859- published is to Unicode 5.2.
860860-861861-862862-863863-864864-865865-866866-867867-868868-869869-870870-871871-872872-873873-874874-875875-876876-877877-878878-879879-880880-881881-882882-883883-884884-885885-886886-887887-888888-889889-890890-891891-892892-893893-894894-895895-896896-897897-898898-Klensin Standards Track [Page 16]
899899-900900-RFC 5891 IDNA2008 Protocol August 2010
901901-902902-903903-Appendix A. Summary of Major Changes from IDNA2003
904904-905905- 1. Update base character set from Unicode 3.2 to Unicode version
906906- agnostic.
907907-908908- 2. Separate the definitions for the "registration" and "lookup"
909909- activities.
910910-911911- 3. Disallow symbol and punctuation characters except where special
912912- exceptions are necessary.
913913-914914- 4. Remove the mapping and normalization steps from the protocol and
915915- have them, instead, done by the applications themselves,
916916- possibly in a local fashion, before invoking the protocol.
917917-918918- 5. Change the way that the protocol specifies which characters are
919919- allowed in labels from "humans decide what the table of code
920920- points contains" to "decision about code points are based on
921921- Unicode properties plus a small exclusion list created by
922922- humans".
923923-924924- 6. Introduce the new concept of characters that can be used only in
925925- specific contexts.
926926-927927- 7. Allow typical words and names in languages such as Dhivehi and
928928- Yiddish to be expressed.
929929-930930- 8. Make bidirectional domain names (delimited strings of labels,
931931- not just labels standing on their own) display in a less
932932- surprising fashion, whether they appear in obvious domain name
933933- contexts or as part of running text in paragraphs.
934934-935935- 9. Remove the dot separator from the mandatory part of the
936936- protocol.
937937-938938- 10. Make some currently valid labels that are not actually IDNA
939939- labels invalid.
940940-941941-Author's Address
942942-943943- John C Klensin
944944- 1770 Massachusetts Ave, Ste 322
945945- Cambridge, MA 02140
946946- USA
947947-948948- Phone: +1 617 245 1457
949949- EMail: john+ietf@jck.com
950950-951951-952952-953953-954954-Klensin Standards Track [Page 17]
955955-
-3923
ocaml-punycode/spec/rfc5892.txt
···11-22-33-44-55-66-77-Internet Engineering Task Force (IETF) P. Faltstrom, Ed.
88-Request for Comments: 5892 Cisco
99-Category: Standards Track August 2010
1010-ISSN: 2070-1721
1111-1212-1313- The Unicode Code Points and
1414- Internationalized Domain Names for Applications (IDNA)
1515-1616-Abstract
1717-1818- This document specifies rules for deciding whether a code point,
1919- considered in isolation or in context, is a candidate for inclusion
2020- in an Internationalized Domain Name (IDN).
2121-2222- It is part of the specification of Internationalizing Domain Names in
2323- Applications 2008 (IDNA2008).
2424-2525-Status of This Memo
2626-2727- This is an Internet Standards Track document.
2828-2929- This document is a product of the Internet Engineering Task Force
3030- (IETF). It represents the consensus of the IETF community. It has
3131- received public review and has been approved for publication by the
3232- Internet Engineering Steering Group (IESG). Further information on
3333- Internet Standards is available in Section 2 of RFC 5741.
3434-3535- Information about the current status of this document, any errata,
3636- and how to provide feedback on it may be obtained at
3737- http://www.rfc-editor.org/info/rfc5892.
3838-3939-Copyright Notice
4040-4141- Copyright (c) 2010 IETF Trust and the persons identified as the
4242- document authors. All rights reserved.
4343-4444- This document is subject to BCP 78 and the IETF Trust's Legal
4545- Provisions Relating to IETF Documents
4646- (http://trustee.ietf.org/license-info) in effect on the date of
4747- publication of this document. Please review these documents
4848- carefully, as they describe your rights and restrictions with respect
4949- to this document. Code Components extracted from this document must
5050- include Simplified BSD License text as described in Section 4.e of
5151- the Trust Legal Provisions and are provided without warranty as
5252- described in the Simplified BSD License.
5353-5454-5555-5656-5757-5858-Faltstrom Standards Track [Page 1]
5959-6060-RFC 5892 IDNA Code Points August 2010
6161-6262-6363-Table of Contents
6464-6565- 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
6666- 2. Category Definitions Used to Calculate Derived Property
6767- Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6868- 2.1. LetterDigits (A) . . . . . . . . . . . . . . . . . . . . . 5
6969- 2.2. Unstable (B) . . . . . . . . . . . . . . . . . . . . . . . 6
7070- 2.3. IgnorableProperties (C) . . . . . . . . . . . . . . . . . 6
7171- 2.4. IgnorableBlocks (D) . . . . . . . . . . . . . . . . . . . 7
7272- 2.5. LDH (E) . . . . . . . . . . . . . . . . . . . . . . . . . 7
7373- 2.6. Exceptions (F) . . . . . . . . . . . . . . . . . . . . . . 7
7474- 2.7. BackwardCompatible (G) . . . . . . . . . . . . . . . . . . 9
7575- 2.8. JoinControl (H) . . . . . . . . . . . . . . . . . . . . . 9
7676- 2.9. OldHangulJamo (I) . . . . . . . . . . . . . . . . . . . . 9
7777- 2.10. Unassigned (J) . . . . . . . . . . . . . . . . . . . . . . 9
7878- 3. Calculation of the Derived Property . . . . . . . . . . . . . 10
7979- 4. Code Points . . . . . . . . . . . . . . . . . . . . . . . . . 10
8080- 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11
8181- 5.1. IDNA-Derived Property Value Registry . . . . . . . . . . . 11
8282- 5.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 11
8383- 5.2.1. Template for Context Registry . . . . . . . . . . . . 11
8484- 6. Security Considerations . . . . . . . . . . . . . . . . . . . 12
8585- 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12
8686- Appendix A. Contextual Rules Registry . . . . . . . . . . . . . 13
8787- Appendix A.1. ZERO WIDTH NON-JOINER . . . . . . . . . . . . . . . 15
8888- Appendix A.2. ZERO WIDTH JOINER . . . . . . . . . . . . . . . . . 16
8989- Appendix A.3. MIDDLE DOT . . . . . . . . . . . . . . . . . . . . . 16
9090- Appendix A.4. GREEK LOWER NUMERAL SIGN (KERAIA) . . . . . . . . . 17
9191- Appendix A.5. HEBREW PUNCTUATION GERESH . . . . . . . . . . . . . 17
9292- Appendix A.6. HEBREW PUNCTUATION GERSHAYIM . . . . . . . . . . . . 18
9393- Appendix A.7. KATAKANA MIDDLE DOT . . . . . . . . . . . . . . . . 18
9494- Appendix A.8. ARABIC-INDIC DIGITS . . . . . . . . . . . . . . . . 19
9595- Appendix A.9. EXTENDED ARABIC-INDIC DIGITS . . . . . . . . . . . . 19
9696- Appendix B. Code Points 0x0000 - 0x10FFFF . . . . . . . . . . . 20
9797- Appendix B.1. Code Points in Unicode Character Database (UCD)
9898- Format . . . . . . . . . . . . . . . . . . . . . . . 20
9999- 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 69
100100- 8.1. Normative References . . . . . . . . . . . . . . . . . . . 69
101101- 8.2. Informative References . . . . . . . . . . . . . . . . . . 69
102102-103103-104104-105105-106106-107107-108108-109109-110110-111111-112112-113113-114114-Faltstrom Standards Track [Page 2]
115115-116116-RFC 5892 IDNA Code Points August 2010
117117-118118-119119-1. Introduction
120120-121121- RFC 4690 [RFC4690] suggests an inclusion-based approach for selecting
122122- the code points from The Unicode Standard [Unicode52] that should be
123123- included in the list of code points that may be used in
124124- Internationalized Domain Names.
125125-126126- Specifically, RFC 4690 [RFC4690] says the following:
127127-128128- The IAB has concluded that there is a consensus within the broader
129129- community that lists of code points should be specified by the use
130130- of an inclusion-based mechanism (i.e., identifying the characters
131131- that are permitted), rather than by excluding a small number of
132132- characters from the total Unicode set as Stringprep [RFC3454] and
133133- Nameprep [RFC3491] do today. That conclusion should be reviewed
134134- by the IETF community and action taken as appropriate.
135135-136136- This document reviews and classifies the collections of code points
137137- in the Unicode character set by examining various properties of the
138138- code points. It then defines an algorithm for determining a derived
139139- property value. It specifies a procedure, and not a table, of code
140140- points so that the algorithm can be used to determine code point sets
141141- independent of the version of Unicode that is in use.
142142-143143- This document is not intended to specify precisely how these property
144144- values are to be applied in IDN labels. That information appears in
145145- the Protocol document [RFC5891], but it is important to understand
146146- that the assignment of a value of this property to a particular
147147- character is not sufficient to determine whether it can be used in a
148148- given label. In particular, some combinations of allowed code points
149149- are not advisable for use in IDNs due to rules specific to a script
150150- or class of characters. The requirement for such rules is linked to
151151- the operations in the Protocol document and especially to the
152152- characters designated as requiring contextual rules.
153153-154154- The value of the property is to be interpreted as follows.
155155-156156- o PROTOCOL VALID: Those that are allowed to be used in IDNs. Code
157157- points with this property value are permitted for general use in
158158- IDNs. However, that a label consists only of code points that
159159- have this property value does not imply that the label can be used
160160- in DNS. See the Protocol document for algorithms to make
161161- decisions about labels in domain names. The abbreviated term
162162- PVALID is used to refer to this value in the rest of this
163163- document.
164164-165165-166166-167167-168168-169169-170170-Faltstrom Standards Track [Page 3]
171171-172172-RFC 5892 IDNA Code Points August 2010
173173-174174-175175- o CONTEXTUAL RULE REQUIRED: Some characteristics of the character,
176176- such as it being invisible in certain contexts or problematic in
177177- others, require that it not be used in labels unless specific
178178- other characters or properties are present. The abbreviated term
179179- CONTEXT is used to refer to this value in the rest of this
180180- document. There are two subdivisions of CONTEXTUAL RULE REQUIRED,
181181- one for Join_controls (called CONTEXTJ) and for other characters
182182- (called CONTEXTO). These are discussed in more detail below and
183183- in the Protocol document.
184184-185185- o DISALLOWED: Those that should clearly not be included in IDNs.
186186- Code points with this property value are not permitted in IDNs.
187187-188188- o UNASSIGNED: Those code points that are not designated (i.e., are
189189- unassigned) in the Unicode Standard.
190190-191191- The mechanisms described here allow determination of the value of the
192192- property for future versions of Unicode (including characters added
193193- after Unicode 5.2). Changes in Unicode properties that do not affect
194194- the outcome of this process do not affect IDN. For example, a
195195- character can have its Unicode General_Category value (see
196196- [Unicode52]) change from So to Sm or from Lo to Ll, without affecting
197197- the algorithm results. Moreover, even if such changes were the
198198- result, the BackwardCompatible list (Section 2.7) can be adjusted to
199199- ensure the stability of the results.
200200-201201- Some code points need to be allowed in exceptional circumstances but
202202- should be excluded in all other cases; these rules are also described
203203- in other documents. The most notable of these are the Join Control
204204- characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH
205205- NON-JOINER. Both of them have the derived property value CONTEXTJ.
206206- A character with the derived property value CONTEXTJ or CONTEXTO
207207- (CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate
208208- rule has been established and the context of the character is
209209- consistent with that rule. It is invalid to either register a string
210210- containing these characters or even to look one up unless such a
211211- contextual rule is found and satisfied. Please see Appendix A, "The
212212- Contextual Rules Registry", for more information.
213213-214214- This document is part of a series that, together, constitute a
215215- proposal for updating the IDNA standards to resolve issues uncovered
216216- in recent years, cover a broader range of scripts, and provide for
217217- migration to newer versions of Unicode. See the Rationale document
218218- [RFC5894] for a broader discussion.
219219-220220- The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
221221- "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
222222- document are to be interpreted as described in RFC 2119 [RFC2119].
223223-224224-225225-226226-Faltstrom Standards Track [Page 4]
227227-228228-RFC 5892 IDNA Code Points August 2010
229229-230230-231231-2. Category Definitions Used to Calculate Derived Property Value
232232-233233- The derived property obtains its value based on a two-step procedure.
234234- First, characters are placed in one or more character categories
235235- based on either core properties defined by the Unicode Standard or by
236236- treating the code point as an exception and addressing the code point
237237- by its code point value. These categories are not mutually
238238- exclusive.
239239-240240- In the second step, set operations are used with these categories to
241241- determine the values for an IDN-specific property. Those operations
242242- are specified in Section 3.
243243-244244- Unicode property names and property value names may have short
245245- abbreviations, such as gc for the General_Category property, and Ll
246246- for the Lowercase_Letter property value of the gc property.
247247-248248- In the following specification of categories, the operation that
249249- returns the value of a particular Unicode character property for a
250250- code point is designated by using the formal name of that property
251251- (from PropertyAliases.txt) followed by '(cp)'. For example, the
252252- value of the General_Category property for a code point is indicated
253253- by General_Category(cp).
254254-255255-2.1. LetterDigits (A)
256256-257257- A: General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}
258258-259259- These rules identify characters commonly used in mnemonics and often
260260- informally described as "language characters". In general, only code
261261- points assigned to this category are suitable for use in IDN.
262262-263263- For more information, see Section 4.5 of The Unicode Standard
264264- [Unicode].
265265-266266- The categories used in this rule are:
267267-268268- o Ll - Lowercase_Letter
269269-270270- o Lu - Uppercase_Letter
271271-272272- o Lo - Other_Letter
273273-274274- o Nd - Decimal_Number
275275-276276- o Lm - Modifier_Letter
277277-278278-279279-280280-281281-282282-Faltstrom Standards Track [Page 5]
283283-284284-RFC 5892 IDNA Code Points August 2010
285285-286286-287287- o Mn - Nonspacing_Mark
288288-289289- o Mc - Spacing_Mark
290290-291291-2.2. Unstable (B)
292292-293293- B: toNFKC(toCaseFold(toNFKC(cp))) != cp
294294-295295- This category is used to group the characters that are not stable
296296- under Normalization Form K (NFKC) and case folding. In general,
297297- these code points are not suitable for use for IDN.
298298-299299- The toCaseFold() operation is defined in Section 3.13 of The Unicode
300300- Standard [Unicode].
301301-302302- The toNFKC() operation returns the code point in normalization form
303303- KC. For more information, see Section 5 of Unicode Standard Annex
304304- #15 [TR15].
305305-306306- It should be noted that NFKC is used, although Normalization Form C
307307- (NFC) is used in the "IDNA Protocol" document [RFC5891].
308308-309309-2.3. IgnorableProperties (C)
310310-311311- C: Default_Ignorable_Code_Point(cp) = True or
312312- White_Space(cp) = True or
313313- Noncharacter_Code_Point(cp) = True
314314-315315- This category is used to group code points that are not recommended
316316- for use in identifiers. In general, these code points are not
317317- suitable for use in an IDN.
318318-319319- The definition for Default_Ignorable_Code_Point can be found in
320320- DerivedCoreProperties.txt [DerivedCoreProperties] and is at the time
321321- of Unicode 5.2:
322322-323323- Other_Default_Ignorable_Code_Point + Cf (Format characters)
324324- + Variation_Selector - White_Space - FFF9..FFFB (Annotation
325325- Characters) - 0600..0603, 06DD, 070F (exceptional Cf characters
326326- that should be visible)
327327-328328-329329-330330-331331-332332-333333-334334-335335-336336-337337-338338-Faltstrom Standards Track [Page 6]
339339-340340-RFC 5892 IDNA Code Points August 2010
341341-342342-343343-2.4. IgnorableBlocks (D)
344344-345345- D: Block(cp) is in {Combining Diacritical Marks for Symbols,
346346- Musical Symbols, Ancient Greek Musical Notation}
347347-348348- This category is used to identify code points that are not useful in
349349- mnemonics or that are otherwise impractical for IDN use. In general,
350350- these code points are not suitable for use for IDN.
351351-352352- The definition of blocks can be found in Blocks.txt [BlockNames].
353353-354354-2.5. LDH (E)
355355-356356- E: cp is in {002D, 0030..0039, 0061..007A}
357357-358358- This category is used in the second step to preserve the traditional
359359- "hostname" (LDH -- as described in the Definitions document
360360- [RFC5890]) characters ('-', 0-9, and a-z). In general, these code
361361- points are suitable for use for IDN. Note that there are other rules
362362- regarding the code point U+002D HYPHEN-MINUS that are specified in
363363- the IDNA Protocol Specification [RFC5891].
364364-365365-2.6. Exceptions (F)
366366-367367- F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
368368- 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
369369- 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
370370- 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 0F0B, 3007,
371371- 302E, 302F, 3031, 3032, 3033, 3034, 3035, 303B,
372372- 30FB}
373373-374374- This category explicitly lists code points for which the category
375375- cannot be assigned using only the core property values that exist in
376376- the Unicode standard. The values are according to the table below:
377377-378378- PVALID -- Would otherwise have been DISALLOWED
379379-380380- 00DF; PVALID # LATIN SMALL LETTER SHARP S
381381- 03C2; PVALID # GREEK SMALL LETTER FINAL SIGMA
382382- 06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND
383383- 06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN
384384- 0F0B; PVALID # TIBETAN MARK INTERSYLLABIC TSHEG
385385- 3007; PVALID # IDEOGRAPHIC NUMBER ZERO
386386-387387-388388-389389-390390-391391-392392-393393-394394-Faltstrom Standards Track [Page 7]
395395-396396-RFC 5892 IDNA Code Points August 2010
397397-398398-399399- CONTEXTO -- Would otherwise have been DISALLOWED
400400-401401- 00B7; CONTEXTO # MIDDLE DOT
402402- 0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)
403403- 05F3; CONTEXTO # HEBREW PUNCTUATION GERESH
404404- 05F4; CONTEXTO # HEBREW PUNCTUATION GERSHAYIM
405405- 30FB; CONTEXTO # KATAKANA MIDDLE DOT
406406-407407- CONTEXTO -- Would otherwise have been PVALID
408408-409409- 0660; CONTEXTO # ARABIC-INDIC DIGIT ZERO
410410- 0661; CONTEXTO # ARABIC-INDIC DIGIT ONE
411411- 0662; CONTEXTO # ARABIC-INDIC DIGIT TWO
412412- 0663; CONTEXTO # ARABIC-INDIC DIGIT THREE
413413- 0664; CONTEXTO # ARABIC-INDIC DIGIT FOUR
414414- 0665; CONTEXTO # ARABIC-INDIC DIGIT FIVE
415415- 0666; CONTEXTO # ARABIC-INDIC DIGIT SIX
416416- 0667; CONTEXTO # ARABIC-INDIC DIGIT SEVEN
417417- 0668; CONTEXTO # ARABIC-INDIC DIGIT EIGHT
418418- 0669; CONTEXTO # ARABIC-INDIC DIGIT NINE
419419- 06F0; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ZERO
420420- 06F1; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ONE
421421- 06F2; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT TWO
422422- 06F3; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT THREE
423423- 06F4; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FOUR
424424- 06F5; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FIVE
425425- 06F6; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SIX
426426- 06F7; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SEVEN
427427- 06F8; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT EIGHT
428428- 06F9; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT NINE
429429-430430- DISALLOWED -- Would otherwise have been PVALID
431431-432432- 0640; DISALLOWED # ARABIC TATWEEL
433433- 07FA; DISALLOWED # NKO LAJANYALAN
434434- 302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
435435- 302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK
436436- 3031; DISALLOWED # VERTICAL KANA REPEAT MARK
437437- 3032; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK
438438- 3033; DISALLOWED # VERTICAL KANA REPEAT MARK UPPER HALF
439439- 3034; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HA
440440- 3035; DISALLOWED # VERTICAL KANA REPEAT MARK LOWER HALF
441441- 303B; DISALLOWED # VERTICAL IDEOGRAPHIC ITERATION MARK
442442-443443-444444-445445-446446-447447-448448-449449-450450-Faltstrom Standards Track [Page 8]
451451-452452-RFC 5892 IDNA Code Points August 2010
453453-454454-455455-2.7. BackwardCompatible (G)
456456-457457- G: cp is in {}
458458-459459- This category includes the code points that property values in
460460- versions of Unicode after 5.2 have changed in such a way that the
461461- derived property value would no longer be PVALID or DISALLOWED. If
462462- changes are made to future versions of Unicode so that code points
463463- might change the property value from PVALID or DISALLOWED, then this
464464- table can be updated and keep special exception values so that the
465465- property values for code points stay stable.
466466-467467-2.8. JoinControl (H)
468468-469469- H: Join_Control(cp) = True
470470-471471- This category consists of Join Control characters (i.e., they are not
472472- in LetterDigits (Section 2.1) but are still required in IDN labels
473473- under some circumstances).
474474-475475-2.9. OldHangulJamo (I)
476476-477477- I: Hangul_Syllable_Type(cp) is in {L, V, T}
478478-479479- This category consists of all conjoining Hangul Jamo (Leading Jamo,
480480- Vowel Jamo, and Trailing Jamo).
481481-482482- Elimination of conjoining Hangul Jamo from the set of PVALID
483483- characters results in restricting the set of Korean PVALID characters
484484- just to preformed, modern Hangul syllable characters. Old Hangul
485485- syllables, which must be spelled with sequences of conjoining Hangul
486486- Jamo, are not PVALID for IDNs.
487487-488488-2.10. Unassigned (J)
489489-490490- J: General_Category(cp) is in {Cn} and
491491- Noncharacter_Code_Point(cp) = False
492492-493493- This category consists of code points in the Unicode character set
494494- that are not (yet) assigned. It should be noted that Unicode
495495- distinguishes between "unassigned code points" and "unassigned
496496- characters". The unassigned code points are all but (Cn -
497497- Noncharacters), while the unassigned *characters* are all but (Cn +
498498- Cs).
499499-500500-501501-502502-503503-504504-505505-506506-Faltstrom Standards Track [Page 9]
507507-508508-RFC 5892 IDNA Code Points August 2010
509509-510510-511511-3. Calculation of the Derived Property
512512-513513- As described above (Section 1) and in more detail in the IDNA
514514- Protocol document [RFC5891], possible values of the IDN property are:
515515-516516- o PVALID
517517-518518- o CONTEXTJ
519519-520520- o CONTEXTO
521521-522522- o DISALLOWED
523523-524524- o UNASSIGNED
525525-526526- The algorithm to calculate the value of the derived property is as
527527- follows. If the name of a rule (such as Exception) is used, that
528528- implies the set of code points that the rule defines, while the same
529529- name as a function call (such as Exception(cp)) implies the value cp
530530- has in the Exceptions table.
531531-532532- If .cp. .in. Exceptions Then Exceptions(cp);
533533- Else If .cp. .in. BackwardCompatible Then BackwardCompatible(cp);
534534- Else If .cp. .in. Unassigned Then UNASSIGNED;
535535- Else If .cp. .in. LDH Then PVALID;
536536- Else If .cp. .in. JoinControl Then CONTEXTJ;
537537- Else If .cp. .in. Unstable Then DISALLOWED;
538538- Else If .cp. .in. IgnorableProperties Then DISALLOWED;
539539- Else If .cp. .in. IgnorableBlocks Then DISALLOWED;
540540- Else If .cp. .in. OldHangulJamo Then DISALLOWED;
541541- Else If .cp. .in. LetterDigits Then PVALID;
542542- Else DISALLOWED;
543543-544544-4. Code Points
545545-546546- The categories and rules defined in Sections 2 and 3 apply to all
547547- Unicode code points. The table in Appendix B shows, for illustrative
548548- purposes, the consequences of the categories and classification
549549- rules, and the resulting property values.
550550-551551- The list of code points that can be found in Appendix B is
552552- non-normative. Sections 2 and 3 are normative.
553553-554554-555555-556556-557557-558558-559559-560560-561561-562562-Faltstrom Standards Track [Page 10]
563563-564564-RFC 5892 IDNA Code Points August 2010
565565-566566-567567-5. IANA Considerations
568568-569569-5.1. IDNA-Derived Property Value Registry
570570-571571- IANA has created a registry with the derived properties for the
572572- versions of Unicode released after (and including) version 5.2. The
573573- derived property value is to be calculated in cooperation with a
574574- designated expert [RFC5226] according to the specifications in
575575- Sections 2 and 3 and not by copying the non-normative table found in
576576- Appendix B.
577577-578578- If non-backward-compatible changes or other problems arise during the
579579- creation or designated expert review of the table of derived property
580580- values, they should be flagged for the IESG. Changes to the rules
581581- (as specified in Sections 2 and 3), including BackwardCompatible
582582- (Section 2.7) (a set that is at release of this document is empty)
583583- require IETF Review, as described in RFC 5226 [RFC5226].
584584-585585-5.2. IDNA Context Registry
586586-587587- For characters that are defined in the IDNA derived property value
588588- registry (Section 5.1) as CONTEXTO or CONTEXTJ and that therefore
589589- require a contextual rule, IANA has created and now maintains a list
590590- of approved contextual rules. Additions or changes to these rules
591591- require IETF Review, as described in [RFC5226].
592592-593593- Appendix A contains further discussion and a table from which that
594594- registry can be initialized.
595595-596596-5.2.1. Template for Context Registry
597597-598598- The following information is to be given when a new rule is created.
599599-600600- Name: Unique name of the rule
601601-602602- Code point: Rule that should be applied when this code point
603603- exists in the label
604604-605605- Overview: Description in plain English on what the rule verifies
606606-607607- Lookup: Should the rule be applied at time of lookup?
608608-609609- Rule Set: The set of rules, with a reference to the defining
610610- document.
611611-612612-613613-614614-615615-616616-617617-618618-Faltstrom Standards Track [Page 11]
619619-620620-RFC 5892 IDNA Code Points August 2010
621621-622622-623623-6. Security Considerations
624624-625625- Security Considerations for this version of IDNA, except for the
626626- special issues associated with right-to-left scripts and characters,
627627- are described in the Definitions document [RFC5890]. Specific issues
628628- for labels containing characters associated with scripts written
629629- right to left appear in the Bidi document [RFC5893].
630630-631631-7. Acknowledgements
632632-633633- This document would not have been possible to produce without input
634634- from many people. The main contributors are (in alphabetical order)
635635- Harald Alvestrand, Vint Cerf, Tina Dam, Mark Davis, Gihan Dias,
636636- Mouhammet Diop, Michael Everson, Asmus Freytag, Debbie Garside, Paul
637637- Hoffman, Kent Karlsson, Cary Karp, Jaeyoun Kim, John Klensin, Olaf
638638- Kolkman, Gervase Markham, Ram Mohan, Lisa Moore, Yngve Pettersen,
639639- Erik van der Poel, Hualin Qian, Rick Reed, Pete Resnick, Lakmal
640640- Silva, Michel Suignard, Andrew Sullivan, Wil Tan, Kenneth Whistler,
641641- Chris Wright, and Yoshiro Yoneya.
642642-643643-644644-645645-646646-647647-648648-649649-650650-651651-652652-653653-654654-655655-656656-657657-658658-659659-660660-661661-662662-663663-664664-665665-666666-667667-668668-669669-670670-671671-672672-673673-674674-Faltstrom Standards Track [Page 12]
675675-676676-RFC 5892 IDNA Code Points August 2010
677677-678678-679679-Appendix A. Contextual Rules Registry
680680-681681- As discussed in Section 5.2 and in the IANA Considerations section of
682682- the Rationale document [RFC5894], a registry of rules that define the
683683- contexts in which particular PROTOCOL-VALID characters, characters
684684- associated with a requirement for Contextual Information, are
685685- permitted. These rules are expressed as tests on the label in which
686686- the characters appear (all, or any part of, the label may be tested).
687687-688688- The grammatical rules are expressed in pseudo-code. The conventions
689689- used for that pseudo-code are explained here.
690690-691691- Each rule is constructed as a Boolean expression that evaluates to
692692- either True or False. A simple "True;" or "False;" rule sets the
693693- default result value for the rule set. Subsequent conditional rules
694694- that evaluate to True or False may re-set the result value.
695695-696696- A special value "Undefined" is used to deal with any error
697697- conditions, such as an attempt to test a character before the start
698698- of a label or after the end of a label. If any term of a rule
699699- evaluates to Undefined, further evaluation of the rule immediately
700700- terminates, as the result value of the rule will itself be Undefined.
701701-702702- cp represents the code point to be tested.
703703-704704- FirstChar is a special term that denotes the first code point in a
705705- label.
706706-707707- LastChar is a special term that denotes the last code point in a
708708- label.
709709-710710- .eq. represents the equality relation.
711711-712712- A .eq. B evaluates to True if A equals B.
713713-714714- .is. represents checking the position in a label.
715715-716716- A .is. B evaluates to True if A and B have same position in
717717- the same label.
718718-719719- .ne. represents the non-equality relation.
720720-721721- A .ne. B evaluates to True if A is not equal to B.
722722-723723- .in. represents the set inclusion relation.
724724-725725- A .in. B evaluates to True if A is a member of the set B.
726726-727727-728728-729729-730730-Faltstrom Standards Track [Page 13]
731731-732732-RFC 5892 IDNA Code Points August 2010
733733-734734-735735- A functional notation, Function_Name(cp), is used to express either
736736- string positions within a label, Boolean character property tests of
737737- a code point, or a regular expression match. When such function
738738- names refer to Boolean character property tests, the function names
739739- use the exact Unicode character property name for the property in
740740- question, and "cp" is evaluated as the Unicode value of the code
741741- point to be tested, rather than as its position in the label. When
742742- such function names refer to string positions within a label, "cp" is
743743- evaluated as its position in the label.
744744-745745- RegExpMatch(X) takes as its parameter X a schematic regular
746746- expression consisting of a mix of Unicode character property values
747747- and literal Unicode code points.
748748-749749- Script(cp) returns the value of the Unicode Script property, as
750750- defined in Scripts.txt in the Unicode Character Database.
751751-752752- Canonical_Combining_Class(cp) returns the value of the Unicode
753753- Canonical_Combining_Class property, as defined in UnicodeData.txt in
754754- the Unicode Character Database.
755755-756756- Before(cp) returns the code point of the character immediately
757757- preceding cp in logical order in the string representing the label.
758758- Before(FirstChar) evaluates to Undefined.
759759-760760- After(cp) returns the code point of the character immediately
761761- following cp in logical order in the string representing the label.
762762- After(LastChar) evaluates to Undefined.
763763-764764- Note that "Before" and "After" do not refer to the visual display
765765- order of the character in a label, which may be reversed or otherwise
766766- modified by the bidirectional algorithm for labels including
767767- characters from scripts written right to left. Instead, "Before" and
768768- "After" refer to the network order of the character in the label.
769769-770770- The clauses "Then True" and "Then False" imply exit from the
771771- pseudo-code routine with the corresponding result.
772772-773773- Repeated evaluation for all characters in a label makes use of the
774774- special construct:
775775-776776- For All Characters:
777777-778778- Expression;
779779-780780- End For;
781781-782782-783783-784784-785785-786786-Faltstrom Standards Track [Page 14]
787787-788788-RFC 5892 IDNA Code Points August 2010
789789-790790-791791- This construct requires repeated evaluation of "Expression" for each
792792- code point in the label, starting from FirstChar and proceeding to
793793- LastChar.
794794-795795- The different fields in the rules are to be interpreted as follows:
796796-797797- Code point:
798798- The code point, or code points, to which this rule is to be
799799- applied. Normally, this implies that if any of the code points in
800800- a label is as defined, then the rules should be applied. If
801801- evaluated to True, the code point is OK as used; if evaluated to
802802- False, it is not OK.
803803-804804- Overview:
805805- A description of the goal with the rule, in plain English.
806806-807807- Lookup:
808808- True if application of this rule is recommended at lookup time;
809809- False otherwise.
810810-811811- Rule Set:
812812- The rule set itself, as described above.
813813-814814-Appendix A.1. ZERO WIDTH NON-JOINER
815815-816816- Code point:
817817- U+200C
818818-819819- Overview:
820820- This may occur in a formally cursive script (such as Arabic) in a
821821- context where it breaks a cursive connection as required for
822822- orthographic rules, as in the Persian language, for example. It
823823- also may occur in Indic scripts in a consonant-conjunct context
824824- (immediately following a virama), to control required display of
825825- such conjuncts.
826826-827827- Lookup:
828828- True
829829-830830- Rule Set:
831831-832832- False;
833833-834834- If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
835835-836836- If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
837837-838838- (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
839839-840840-841841-842842-Faltstrom Standards Track [Page 15]
843843-844844-RFC 5892 IDNA Code Points August 2010
845845-846846-847847-Appendix A.2. ZERO WIDTH JOINER
848848-849849- Code point:
850850- U+200D
851851-852852- Overview:
853853- This may occur in Indic scripts in a consonant-conjunct context
854854- (immediately following a virama), to control required display of
855855- such conjuncts.
856856-857857- Lookup:
858858- True
859859-860860- Rule Set:
861861-862862- False;
863863-864864- If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
865865-866866-Appendix A.3. MIDDLE DOT
867867-868868- Code point:
869869- U+00B7
870870-871871- Overview:
872872- Between 'l' (U+006C) characters only, used to permit the Catalan
873873- character ela geminada to be expressed.
874874-875875- Lookup:
876876- False
877877-878878- Rule Set:
879879-880880- False;
881881-882882- If Before(cp) .eq. U+006C And
883883-884884- After(cp) .eq. U+006C Then True;
885885-886886-887887-888888-889889-890890-891891-892892-893893-894894-895895-896896-897897-898898-Faltstrom Standards Track [Page 16]
899899-900900-RFC 5892 IDNA Code Points August 2010
901901-902902-903903-Appendix A.4. GREEK LOWER NUMERAL SIGN (KERAIA)
904904-905905- Code point:
906906- U+0375
907907-908908- Overview:
909909- The script of the following character MUST be Greek.
910910-911911- Lookup:
912912- False
913913-914914- Rule Set:
915915-916916- False;
917917-918918- If Script(After(cp)) .eq. Greek Then True;
919919-920920-Appendix A.5. HEBREW PUNCTUATION GERESH
921921-922922- Code point:
923923- U+05F3
924924-925925- Overview:
926926- The script of the preceding character MUST be Hebrew.
927927-928928- Lookup:
929929- False
930930-931931- Rule Set:
932932-933933- False;
934934-935935- If Script(Before(cp)) .eq. Hebrew Then True;
936936-937937-938938-939939-940940-941941-942942-943943-944944-945945-946946-947947-948948-949949-950950-951951-952952-953953-954954-Faltstrom Standards Track [Page 17]
955955-956956-RFC 5892 IDNA Code Points August 2010
957957-958958-959959-Appendix A.6. HEBREW PUNCTUATION GERSHAYIM
960960-961961- Code point:
962962- U+05F4
963963-964964- Overview:
965965- The script of the preceding character MUST be Hebrew.
966966-967967- Lookup:
968968- False
969969-970970- Rule Set:
971971-972972- False;
973973-974974- If Script(Before(cp)) .eq. Hebrew Then True;
975975-976976-Appendix A.7. KATAKANA MIDDLE DOT
977977-978978- Code point:
979979- U+30FB
980980-981981- Overview:
982982- Note that the Script of Katakana Middle Dot is not any of
983983- "Hiragana", "Katakana", or "Han". The effect of this rule is to
984984- require at least one character in the label to be in one of those
985985- scripts.
986986-987987- Lookup:
988988- False
989989-990990- Rule Set:
991991-992992- False;
993993-994994- For All Characters:
995995-996996- If Script(cp) .in. {Hiragana, Katakana, Han} Then True;
997997-998998- End For;
999999-10001000-10011001-10021002-10031003-10041004-10051005-10061006-10071007-10081008-10091009-10101010-Faltstrom Standards Track [Page 18]
10111011-10121012-RFC 5892 IDNA Code Points August 2010
10131013-10141014-10151015-Appendix A.8. ARABIC-INDIC DIGITS
10161016-10171017- Code point:
10181018- 0660..0669
10191019-10201020- Overview:
10211021- Can not be mixed with Extended Arabic-Indic Digits.
10221022-10231023- Lookup:
10241024- False
10251025-10261026- Rule Set:
10271027-10281028- True;
10291029-10301030- For All Characters:
10311031-10321032- If cp .in. 06F0..06F9 Then False;
10331033-10341034- End For;
10351035-10361036-Appendix A.9. EXTENDED ARABIC-INDIC DIGITS
10371037-10381038- Code point:
10391039- 06F0..06F9
10401040-10411041- Overview:
10421042- Can not be mixed with Arabic-Indic Digits.
10431043-10441044- Lookup:
10451045- False
10461046-10471047- Rule Set:
10481048-10491049- True;
10501050-10511051- For All Characters:
10521052-10531053- If cp .in. 0660..0669 Then False;
10541054-10551055- End For;
10561056-10571057-10581058-10591059-10601060-10611061-10621062-10631063-10641064-10651065-10661066-Faltstrom Standards Track [Page 19]
10671067-10681068-RFC 5892 IDNA Code Points August 2010
10691069-10701070-10711071-Appendix B. Code Points 0x0000 - 0x10FFFF
10721072-10731073- If one applies the rules (Section 3) to the code points 0x0000 to
10741074- 0x10FFFF to Unicode 5.2, the result is as follows.
10751075-10761076- This list is non-normative, and only included for illustrative
10771077- purposes. Specifically, what is displayed in the third column is not
10781078- the formal name of the code point (as defined in Section 4.8 of The
10791079- Unicode Standard [Unicode52]). The differences exist, for example,
10801080- for the code points that have the code point value as part of the
10811081- name (for example, CJK UNIFIED IDEOGRAPH-4E00) and the naming of
10821082- Hangul syllables. For many code points, what you see is the official
10831083- name.
10841084-10851085-Appendix B.1. Code Points in Unicode Character Database (UCD) Format
10861086-10871087-0000..002C ; DISALLOWED # <control>..COMMA
10881088-002D ; PVALID # HYPHEN-MINUS
10891089-002E..002F ; DISALLOWED # FULL STOP..SOLIDUS
10901090-0030..0039 ; PVALID # DIGIT ZERO..DIGIT NINE
10911091-003A..0060 ; DISALLOWED # COLON..GRAVE ACCENT
10921092-0061..007A ; PVALID # LATIN SMALL LETTER A..LATIN SMALL LETTER Z
10931093-007B..00B6 ; DISALLOWED # LEFT CURLY BRACKET..PILCROW SIGN
10941094-00B7 ; CONTEXTO # MIDDLE DOT
10951095-00B8..00DE ; DISALLOWED # CEDILLA..LATIN CAPITAL LETTER THORN
10961096-00DF..00F6 ; PVALID # LATIN SMALL LETTER SHARP S..LATIN SMALL LETT
10971097-00F7 ; DISALLOWED # DIVISION SIGN
10981098-00F8..00FF ; PVALID # LATIN SMALL LETTER O WITH STROKE..LATIN SMAL
10991099-0100 ; DISALLOWED # LATIN CAPITAL LETTER A WITH MACRON
11001100-0101 ; PVALID # LATIN SMALL LETTER A WITH MACRON
11011101-0102 ; DISALLOWED # LATIN CAPITAL LETTER A WITH BREVE
11021102-0103 ; PVALID # LATIN SMALL LETTER A WITH BREVE
11031103-0104 ; DISALLOWED # LATIN CAPITAL LETTER A WITH OGONEK
11041104-0105 ; PVALID # LATIN SMALL LETTER A WITH OGONEK
11051105-0106 ; DISALLOWED # LATIN CAPITAL LETTER C WITH ACUTE
11061106-0107 ; PVALID # LATIN SMALL LETTER C WITH ACUTE
11071107-0108 ; DISALLOWED # LATIN CAPITAL LETTER C WITH CIRCUMFLEX
11081108-0109 ; PVALID # LATIN SMALL LETTER C WITH CIRCUMFLEX
11091109-010A ; DISALLOWED # LATIN CAPITAL LETTER C WITH DOT ABOVE
11101110-010B ; PVALID # LATIN SMALL LETTER C WITH DOT ABOVE
11111111-010C ; DISALLOWED # LATIN CAPITAL LETTER C WITH CARON
11121112-010D ; PVALID # LATIN SMALL LETTER C WITH CARON
11131113-010E ; DISALLOWED # LATIN CAPITAL LETTER D WITH CARON
11141114-010F ; PVALID # LATIN SMALL LETTER D WITH CARON
11151115-0110 ; DISALLOWED # LATIN CAPITAL LETTER D WITH STROKE
11161116-0111 ; PVALID # LATIN SMALL LETTER D WITH STROKE
11171117-0112 ; DISALLOWED # LATIN CAPITAL LETTER E WITH MACRON
11181118-0113 ; PVALID # LATIN SMALL LETTER E WITH MACRON
11191119-11201120-11211121-11221122-Faltstrom Standards Track [Page 20]
11231123-11241124-RFC 5892 IDNA Code Points August 2010
11251125-11261126-11271127-0114 ; DISALLOWED # LATIN CAPITAL LETTER E WITH BREVE
11281128-0115 ; PVALID # LATIN SMALL LETTER E WITH BREVE
11291129-0116 ; DISALLOWED # LATIN CAPITAL LETTER E WITH DOT ABOVE
11301130-0117 ; PVALID # LATIN SMALL LETTER E WITH DOT ABOVE
11311131-0118 ; DISALLOWED # LATIN CAPITAL LETTER E WITH OGONEK
11321132-0119 ; PVALID # LATIN SMALL LETTER E WITH OGONEK
11331133-011A ; DISALLOWED # LATIN CAPITAL LETTER E WITH CARON
11341134-011B ; PVALID # LATIN SMALL LETTER E WITH CARON
11351135-011C ; DISALLOWED # LATIN CAPITAL LETTER G WITH CIRCUMFLEX
11361136-011D ; PVALID # LATIN SMALL LETTER G WITH CIRCUMFLEX
11371137-011E ; DISALLOWED # LATIN CAPITAL LETTER G WITH BREVE
11381138-011F ; PVALID # LATIN SMALL LETTER G WITH BREVE
11391139-0120 ; DISALLOWED # LATIN CAPITAL LETTER G WITH DOT ABOVE
11401140-0121 ; PVALID # LATIN SMALL LETTER G WITH DOT ABOVE
11411141-0122 ; DISALLOWED # LATIN CAPITAL LETTER G WITH CEDILLA
11421142-0123 ; PVALID # LATIN SMALL LETTER G WITH CEDILLA
11431143-0124 ; DISALLOWED # LATIN CAPITAL LETTER H WITH CIRCUMFLEX
11441144-0125 ; PVALID # LATIN SMALL LETTER H WITH CIRCUMFLEX
11451145-0126 ; DISALLOWED # LATIN CAPITAL LETTER H WITH STROKE
11461146-0127 ; PVALID # LATIN SMALL LETTER H WITH STROKE
11471147-0128 ; DISALLOWED # LATIN CAPITAL LETTER I WITH TILDE
11481148-0129 ; PVALID # LATIN SMALL LETTER I WITH TILDE
11491149-012A ; DISALLOWED # LATIN CAPITAL LETTER I WITH MACRON
11501150-012B ; PVALID # LATIN SMALL LETTER I WITH MACRON
11511151-012C ; DISALLOWED # LATIN CAPITAL LETTER I WITH BREVE
11521152-012D ; PVALID # LATIN SMALL LETTER I WITH BREVE
11531153-012E ; DISALLOWED # LATIN CAPITAL LETTER I WITH OGONEK
11541154-012F ; PVALID # LATIN SMALL LETTER I WITH OGONEK
11551155-0130 ; DISALLOWED # LATIN CAPITAL LETTER I WITH DOT ABOVE
11561156-0131 ; PVALID # LATIN SMALL LETTER DOTLESS I
11571157-0132..0134 ; DISALLOWED # LATIN CAPITAL LIGATURE IJ..LATIN CAPITAL LET
11581158-0135 ; PVALID # LATIN SMALL LETTER J WITH CIRCUMFLEX
11591159-0136 ; DISALLOWED # LATIN CAPITAL LETTER K WITH CEDILLA
11601160-0137..0138 ; PVALID # LATIN SMALL LETTER K WITH CEDILLA..LATIN SMA
11611161-0139 ; DISALLOWED # LATIN CAPITAL LETTER L WITH ACUTE
11621162-013A ; PVALID # LATIN SMALL LETTER L WITH ACUTE
11631163-013B ; DISALLOWED # LATIN CAPITAL LETTER L WITH CEDILLA
11641164-013C ; PVALID # LATIN SMALL LETTER L WITH CEDILLA
11651165-013D ; DISALLOWED # LATIN CAPITAL LETTER L WITH CARON
11661166-013E ; PVALID # LATIN SMALL LETTER L WITH CARON
11671167-013F..0141 ; DISALLOWED # LATIN CAPITAL LETTER L WITH MIDDLE DOT..LATI
11681168-0142 ; PVALID # LATIN SMALL LETTER L WITH STROKE
11691169-0143 ; DISALLOWED # LATIN CAPITAL LETTER N WITH ACUTE
11701170-0144 ; PVALID # LATIN SMALL LETTER N WITH ACUTE
11711171-0145 ; DISALLOWED # LATIN CAPITAL LETTER N WITH CEDILLA
11721172-0146 ; PVALID # LATIN SMALL LETTER N WITH CEDILLA
11731173-0147 ; DISALLOWED # LATIN CAPITAL LETTER N WITH CARON
11741174-0148 ; PVALID # LATIN SMALL LETTER N WITH CARON
11751175-11761176-11771177-11781178-Faltstrom Standards Track [Page 21]
11791179-11801180-RFC 5892 IDNA Code Points August 2010
11811181-11821182-11831183-0149..014A ; DISALLOWED # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE.
11841184-014B ; PVALID # LATIN SMALL LETTER ENG
11851185-014C ; DISALLOWED # LATIN CAPITAL LETTER O WITH MACRON
11861186-014D ; PVALID # LATIN SMALL LETTER O WITH MACRON
11871187-014E ; DISALLOWED # LATIN CAPITAL LETTER O WITH BREVE
11881188-014F ; PVALID # LATIN SMALL LETTER O WITH BREVE
11891189-0150 ; DISALLOWED # LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
11901190-0151 ; PVALID # LATIN SMALL LETTER O WITH DOUBLE ACUTE
11911191-0152 ; DISALLOWED # LATIN CAPITAL LIGATURE OE
11921192-0153 ; PVALID # LATIN SMALL LIGATURE OE
11931193-0154 ; DISALLOWED # LATIN CAPITAL LETTER R WITH ACUTE
11941194-0155 ; PVALID # LATIN SMALL LETTER R WITH ACUTE
11951195-0156 ; DISALLOWED # LATIN CAPITAL LETTER R WITH CEDILLA
11961196-0157 ; PVALID # LATIN SMALL LETTER R WITH CEDILLA
11971197-0158 ; DISALLOWED # LATIN CAPITAL LETTER R WITH CARON
11981198-0159 ; PVALID # LATIN SMALL LETTER R WITH CARON
11991199-015A ; DISALLOWED # LATIN CAPITAL LETTER S WITH ACUTE
12001200-015B ; PVALID # LATIN SMALL LETTER S WITH ACUTE
12011201-015C ; DISALLOWED # LATIN CAPITAL LETTER S WITH CIRCUMFLEX
12021202-015D ; PVALID # LATIN SMALL LETTER S WITH CIRCUMFLEX
12031203-015E ; DISALLOWED # LATIN CAPITAL LETTER S WITH CEDILLA
12041204-015F ; PVALID # LATIN SMALL LETTER S WITH CEDILLA
12051205-0160 ; DISALLOWED # LATIN CAPITAL LETTER S WITH CARON
12061206-0161 ; PVALID # LATIN SMALL LETTER S WITH CARON
12071207-0162 ; DISALLOWED # LATIN CAPITAL LETTER T WITH CEDILLA
12081208-0163 ; PVALID # LATIN SMALL LETTER T WITH CEDILLA
12091209-0164 ; DISALLOWED # LATIN CAPITAL LETTER T WITH CARON
12101210-0165 ; PVALID # LATIN SMALL LETTER T WITH CARON
12111211-0166 ; DISALLOWED # LATIN CAPITAL LETTER T WITH STROKE
12121212-0167 ; PVALID # LATIN SMALL LETTER T WITH STROKE
12131213-0168 ; DISALLOWED # LATIN CAPITAL LETTER U WITH TILDE
12141214-0169 ; PVALID # LATIN SMALL LETTER U WITH TILDE
12151215-016A ; DISALLOWED # LATIN CAPITAL LETTER U WITH MACRON
12161216-016B ; PVALID # LATIN SMALL LETTER U WITH MACRON
12171217-016C ; DISALLOWED # LATIN CAPITAL LETTER U WITH BREVE
12181218-016D ; PVALID # LATIN SMALL LETTER U WITH BREVE
12191219-016E ; DISALLOWED # LATIN CAPITAL LETTER U WITH RING ABOVE
12201220-016F ; PVALID # LATIN SMALL LETTER U WITH RING ABOVE
12211221-0170 ; DISALLOWED # LATIN CAPITAL LETTER U WITH DOUBLE ACUTE
12221222-0171 ; PVALID # LATIN SMALL LETTER U WITH DOUBLE ACUTE
12231223-0172 ; DISALLOWED # LATIN CAPITAL LETTER U WITH OGONEK
12241224-0173 ; PVALID # LATIN SMALL LETTER U WITH OGONEK
12251225-0174 ; DISALLOWED # LATIN CAPITAL LETTER W WITH CIRCUMFLEX
12261226-0175 ; PVALID # LATIN SMALL LETTER W WITH CIRCUMFLEX
12271227-0176 ; DISALLOWED # LATIN CAPITAL LETTER Y WITH CIRCUMFLEX
12281228-0177 ; PVALID # LATIN SMALL LETTER Y WITH CIRCUMFLEX
12291229-0178..0179 ; DISALLOWED # LATIN CAPITAL LETTER Y WITH DIAERESIS..LATIN
12301230-017A ; PVALID # LATIN SMALL LETTER Z WITH ACUTE
12311231-12321232-12331233-12341234-Faltstrom Standards Track [Page 22]
12351235-12361236-RFC 5892 IDNA Code Points August 2010
12371237-12381238-12391239-017B ; DISALLOWED # LATIN CAPITAL LETTER Z WITH DOT ABOVE
12401240-017C ; PVALID # LATIN SMALL LETTER Z WITH DOT ABOVE
12411241-017D ; DISALLOWED # LATIN CAPITAL LETTER Z WITH CARON
12421242-017E ; PVALID # LATIN SMALL LETTER Z WITH CARON
12431243-017F ; DISALLOWED # LATIN SMALL LETTER LONG S
12441244-0180 ; PVALID # LATIN SMALL LETTER B WITH STROKE
12451245-0181..0182 ; DISALLOWED # LATIN CAPITAL LETTER B WITH HOOK..LATIN CAPI
12461246-0183 ; PVALID # LATIN SMALL LETTER B WITH TOPBAR
12471247-0184 ; DISALLOWED # LATIN CAPITAL LETTER TONE SIX
12481248-0185 ; PVALID # LATIN SMALL LETTER TONE SIX
12491249-0186..0187 ; DISALLOWED # LATIN CAPITAL LETTER OPEN O..LATIN CAPITAL L
12501250-0188 ; PVALID # LATIN SMALL LETTER C WITH HOOK
12511251-0189..018B ; DISALLOWED # LATIN CAPITAL LETTER AFRICAN D..LATIN CAPITA
12521252-018C..018D ; PVALID # LATIN SMALL LETTER D WITH TOPBAR..LATIN SMAL
12531253-018E..0191 ; DISALLOWED # LATIN CAPITAL LETTER REVERSED E..LATIN CAPIT
12541254-0192 ; PVALID # LATIN SMALL LETTER F WITH HOOK
12551255-0193..0194 ; DISALLOWED # LATIN CAPITAL LETTER G WITH HOOK..LATIN CAPI
12561256-0195 ; PVALID # LATIN SMALL LETTER HV
12571257-0196..0198 ; DISALLOWED # LATIN CAPITAL LETTER IOTA..LATIN CAPITAL LET
12581258-0199..019B ; PVALID # LATIN SMALL LETTER K WITH HOOK..LATIN SMALL
12591259-019C..019D ; DISALLOWED # LATIN CAPITAL LETTER TURNED M..LATIN CAPITAL
12601260-019E ; PVALID # LATIN SMALL LETTER N WITH LONG RIGHT LEG
12611261-019F..01A0 ; DISALLOWED # LATIN CAPITAL LETTER O WITH MIDDLE TILDE..LA
12621262-01A1 ; PVALID # LATIN SMALL LETTER O WITH HORN
12631263-01A2 ; DISALLOWED # LATIN CAPITAL LETTER OI
12641264-01A3 ; PVALID # LATIN SMALL LETTER OI
12651265-01A4 ; DISALLOWED # LATIN CAPITAL LETTER P WITH HOOK
12661266-01A5 ; PVALID # LATIN SMALL LETTER P WITH HOOK
12671267-01A6..01A7 ; DISALLOWED # LATIN LETTER YR..LATIN CAPITAL LETTER TONE T
12681268-01A8 ; PVALID # LATIN SMALL LETTER TONE TWO
12691269-01A9 ; DISALLOWED # LATIN CAPITAL LETTER ESH
12701270-01AA..01AB ; PVALID # LATIN LETTER REVERSED ESH LOOP..LATIN SMALL
12711271-01AC ; DISALLOWED # LATIN CAPITAL LETTER T WITH HOOK
12721272-01AD ; PVALID # LATIN SMALL LETTER T WITH HOOK
12731273-01AE..01AF ; DISALLOWED # LATIN CAPITAL LETTER T WITH RETROFLEX HOOK..
12741274-01B0 ; PVALID # LATIN SMALL LETTER U WITH HORN
12751275-01B1..01B3 ; DISALLOWED # LATIN CAPITAL LETTER UPSILON..LATIN CAPITAL
12761276-01B4 ; PVALID # LATIN SMALL LETTER Y WITH HOOK
12771277-01B5 ; DISALLOWED # LATIN CAPITAL LETTER Z WITH STROKE
12781278-01B6 ; PVALID # LATIN SMALL LETTER Z WITH STROKE
12791279-01B7..01B8 ; DISALLOWED # LATIN CAPITAL LETTER EZH..LATIN CAPITAL LETT
12801280-01B9..01BB ; PVALID # LATIN SMALL LETTER EZH REVERSED..LATIN LETTE
12811281-01BC ; DISALLOWED # LATIN CAPITAL LETTER TONE FIVE
12821282-01BD..01C3 ; PVALID # LATIN SMALL LETTER TONE FIVE..LATIN LETTER R
12831283-01C4..01CD ; DISALLOWED # LATIN CAPITAL LETTER DZ WITH CARON..LATIN CA
12841284-01CE ; PVALID # LATIN SMALL LETTER A WITH CARON
12851285-01CF ; DISALLOWED # LATIN CAPITAL LETTER I WITH CARON
12861286-01D0 ; PVALID # LATIN SMALL LETTER I WITH CARON
12871287-12881288-12891289-12901290-Faltstrom Standards Track [Page 23]
12911291-12921292-RFC 5892 IDNA Code Points August 2010
12931293-12941294-12951295-01D1 ; DISALLOWED # LATIN CAPITAL LETTER O WITH CARON
12961296-01D2 ; PVALID # LATIN SMALL LETTER O WITH CARON
12971297-01D3 ; DISALLOWED # LATIN CAPITAL LETTER U WITH CARON
12981298-01D4 ; PVALID # LATIN SMALL LETTER U WITH CARON
12991299-01D5 ; DISALLOWED # LATIN CAPITAL LETTER U WITH DIAERESIS AND MA
13001300-01D6 ; PVALID # LATIN SMALL LETTER U WITH DIAERESIS AND MACR
13011301-01D7 ; DISALLOWED # LATIN CAPITAL LETTER U WITH DIAERESIS AND AC
13021302-01D8 ; PVALID # LATIN SMALL LETTER U WITH DIAERESIS AND ACUT
13031303-01D9 ; DISALLOWED # LATIN CAPITAL LETTER U WITH DIAERESIS AND CA
13041304-01DA ; PVALID # LATIN SMALL LETTER U WITH DIAERESIS AND CARO
13051305-01DB ; DISALLOWED # LATIN CAPITAL LETTER U WITH DIAERESIS AND GR
13061306-01DC..01DD ; PVALID # LATIN SMALL LETTER U WITH DIAERESIS AND GRAV
13071307-01DE ; DISALLOWED # LATIN CAPITAL LETTER A WITH DIAERESIS AND MA
13081308-01DF ; PVALID # LATIN SMALL LETTER A WITH DIAERESIS AND MACR
13091309-01E0 ; DISALLOWED # LATIN CAPITAL LETTER A WITH DOT ABOVE AND MA
13101310-01E1 ; PVALID # LATIN SMALL LETTER A WITH DOT ABOVE AND MACR
13111311-01E2 ; DISALLOWED # LATIN CAPITAL LETTER AE WITH MACRON
13121312-01E3 ; PVALID # LATIN SMALL LETTER AE WITH MACRON
13131313-01E4 ; DISALLOWED # LATIN CAPITAL LETTER G WITH STROKE
13141314-01E5 ; PVALID # LATIN SMALL LETTER G WITH STROKE
13151315-01E6 ; DISALLOWED # LATIN CAPITAL LETTER G WITH CARON
13161316-01E7 ; PVALID # LATIN SMALL LETTER G WITH CARON
13171317-01E8 ; DISALLOWED # LATIN CAPITAL LETTER K WITH CARON
13181318-01E9 ; PVALID # LATIN SMALL LETTER K WITH CARON
13191319-01EA ; DISALLOWED # LATIN CAPITAL LETTER O WITH OGONEK
13201320-01EB ; PVALID # LATIN SMALL LETTER O WITH OGONEK
13211321-01EC ; DISALLOWED # LATIN CAPITAL LETTER O WITH OGONEK AND MACRO
13221322-01ED ; PVALID # LATIN SMALL LETTER O WITH OGONEK AND MACRON
13231323-01EE ; DISALLOWED # LATIN CAPITAL LETTER EZH WITH CARON
13241324-01EF..01F0 ; PVALID # LATIN SMALL LETTER EZH WITH CARON..LATIN SMA
13251325-01F1..01F4 ; DISALLOWED # LATIN CAPITAL LETTER DZ..LATIN CAPITAL LETTE
13261326-01F5 ; PVALID # LATIN SMALL LETTER G WITH ACUTE
13271327-01F6..01F8 ; DISALLOWED # LATIN CAPITAL LETTER HWAIR..LATIN CAPITAL LE
13281328-01F9 ; PVALID # LATIN SMALL LETTER N WITH GRAVE
13291329-01FA ; DISALLOWED # LATIN CAPITAL LETTER A WITH RING ABOVE AND A
13301330-01FB ; PVALID # LATIN SMALL LETTER A WITH RING ABOVE AND ACU
13311331-01FC ; DISALLOWED # LATIN CAPITAL LETTER AE WITH ACUTE
13321332-01FD ; PVALID # LATIN SMALL LETTER AE WITH ACUTE
13331333-01FE ; DISALLOWED # LATIN CAPITAL LETTER O WITH STROKE AND ACUTE
13341334-01FF ; PVALID # LATIN SMALL LETTER O WITH STROKE AND ACUTE
13351335-0200 ; DISALLOWED # LATIN CAPITAL LETTER A WITH DOUBLE GRAVE
13361336-0201 ; PVALID # LATIN SMALL LETTER A WITH DOUBLE GRAVE
13371337-0202 ; DISALLOWED # LATIN CAPITAL LETTER A WITH INVERTED BREVE
13381338-0203 ; PVALID # LATIN SMALL LETTER A WITH INVERTED BREVE
13391339-0204 ; DISALLOWED # LATIN CAPITAL LETTER E WITH DOUBLE GRAVE
13401340-0205 ; PVALID # LATIN SMALL LETTER E WITH DOUBLE GRAVE
13411341-0206 ; DISALLOWED # LATIN CAPITAL LETTER E WITH INVERTED BREVE
13421342-0207 ; PVALID # LATIN SMALL LETTER E WITH INVERTED BREVE
13431343-13441344-13451345-13461346-Faltstrom Standards Track [Page 24]
13471347-13481348-RFC 5892 IDNA Code Points August 2010
13491349-13501350-13511351-0208 ; DISALLOWED # LATIN CAPITAL LETTER I WITH DOUBLE GRAVE
13521352-0209 ; PVALID # LATIN SMALL LETTER I WITH DOUBLE GRAVE
13531353-020A ; DISALLOWED # LATIN CAPITAL LETTER I WITH INVERTED BREVE
13541354-020B ; PVALID # LATIN SMALL LETTER I WITH INVERTED BREVE
13551355-020C ; DISALLOWED # LATIN CAPITAL LETTER O WITH DOUBLE GRAVE
13561356-020D ; PVALID # LATIN SMALL LETTER O WITH DOUBLE GRAVE
13571357-020E ; DISALLOWED # LATIN CAPITAL LETTER O WITH INVERTED BREVE
13581358-020F ; PVALID # LATIN SMALL LETTER O WITH INVERTED BREVE
13591359-0210 ; DISALLOWED # LATIN CAPITAL LETTER R WITH DOUBLE GRAVE
13601360-0211 ; PVALID # LATIN SMALL LETTER R WITH DOUBLE GRAVE
13611361-0212 ; DISALLOWED # LATIN CAPITAL LETTER R WITH INVERTED BREVE
13621362-0213 ; PVALID # LATIN SMALL LETTER R WITH INVERTED BREVE
13631363-0214 ; DISALLOWED # LATIN CAPITAL LETTER U WITH DOUBLE GRAVE
13641364-0215 ; PVALID # LATIN SMALL LETTER U WITH DOUBLE GRAVE
13651365-0216 ; DISALLOWED # LATIN CAPITAL LETTER U WITH INVERTED BREVE
13661366-0217 ; PVALID # LATIN SMALL LETTER U WITH INVERTED BREVE
13671367-0218 ; DISALLOWED # LATIN CAPITAL LETTER S WITH COMMA BELOW
13681368-0219 ; PVALID # LATIN SMALL LETTER S WITH COMMA BELOW
13691369-021A ; DISALLOWED # LATIN CAPITAL LETTER T WITH COMMA BELOW
13701370-021B ; PVALID # LATIN SMALL LETTER T WITH COMMA BELOW
13711371-021C ; DISALLOWED # LATIN CAPITAL LETTER YOGH
13721372-021D ; PVALID # LATIN SMALL LETTER YOGH
13731373-021E ; DISALLOWED # LATIN CAPITAL LETTER H WITH CARON
13741374-021F ; PVALID # LATIN SMALL LETTER H WITH CARON
13751375-0220 ; DISALLOWED # LATIN CAPITAL LETTER N WITH LONG RIGHT LEG
13761376-0221 ; PVALID # LATIN SMALL LETTER D WITH CURL
13771377-0222 ; DISALLOWED # LATIN CAPITAL LETTER OU
13781378-0223 ; PVALID # LATIN SMALL LETTER OU
13791379-0224 ; DISALLOWED # LATIN CAPITAL LETTER Z WITH HOOK
13801380-0225 ; PVALID # LATIN SMALL LETTER Z WITH HOOK
13811381-0226 ; DISALLOWED # LATIN CAPITAL LETTER A WITH DOT ABOVE
13821382-0227 ; PVALID # LATIN SMALL LETTER A WITH DOT ABOVE
13831383-0228 ; DISALLOWED # LATIN CAPITAL LETTER E WITH CEDILLA
13841384-0229 ; PVALID # LATIN SMALL LETTER E WITH CEDILLA
13851385-022A ; DISALLOWED # LATIN CAPITAL LETTER O WITH DIAERESIS AND MA
13861386-022B ; PVALID # LATIN SMALL LETTER O WITH DIAERESIS AND MACR
13871387-022C ; DISALLOWED # LATIN CAPITAL LETTER O WITH TILDE AND MACRON
13881388-022D ; PVALID # LATIN SMALL LETTER O WITH TILDE AND MACRON
13891389-022E ; DISALLOWED # LATIN CAPITAL LETTER O WITH DOT ABOVE
13901390-022F ; PVALID # LATIN SMALL LETTER O WITH DOT ABOVE
13911391-0230 ; DISALLOWED # LATIN CAPITAL LETTER O WITH DOT ABOVE AND MA
13921392-0231 ; PVALID # LATIN SMALL LETTER O WITH DOT ABOVE AND MACR
13931393-0232 ; DISALLOWED # LATIN CAPITAL LETTER Y WITH MACRON
13941394-0233..0239 ; PVALID # LATIN SMALL LETTER Y WITH MACRON..LATIN SMAL
13951395-023A..023B ; DISALLOWED # LATIN CAPITAL LETTER A WITH STROKE..LATIN CA
13961396-023C ; PVALID # LATIN SMALL LETTER C WITH STROKE
13971397-023D..023E ; DISALLOWED # LATIN CAPITAL LETTER L WITH BAR..LATIN CAPIT
13981398-023F..0240 ; PVALID # LATIN SMALL LETTER S WITH SWASH TAIL..LATIN
13991399-14001400-14011401-14021402-Faltstrom Standards Track [Page 25]
14031403-14041404-RFC 5892 IDNA Code Points August 2010
14051405-14061406-14071407-0241 ; DISALLOWED # LATIN CAPITAL LETTER GLOTTAL STOP
14081408-0242 ; PVALID # LATIN SMALL LETTER GLOTTAL STOP
14091409-0243..0246 ; DISALLOWED # LATIN CAPITAL LETTER B WITH STROKE..LATIN CA
14101410-0247 ; PVALID # LATIN SMALL LETTER E WITH STROKE
14111411-0248 ; DISALLOWED # LATIN CAPITAL LETTER J WITH STROKE
14121412-0249 ; PVALID # LATIN SMALL LETTER J WITH STROKE
14131413-024A ; DISALLOWED # LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL
14141414-024B ; PVALID # LATIN SMALL LETTER Q WITH HOOK TAIL
14151415-024C ; DISALLOWED # LATIN CAPITAL LETTER R WITH STROKE
14161416-024D ; PVALID # LATIN SMALL LETTER R WITH STROKE
14171417-024E ; DISALLOWED # LATIN CAPITAL LETTER Y WITH STROKE
14181418-024F..02AF ; PVALID # LATIN SMALL LETTER Y WITH STROKE..LATIN SMAL
14191419-02B0..02B8 ; DISALLOWED # MODIFIER LETTER SMALL H..MODIFIER LETTER SMA
14201420-02B9..02C1 ; PVALID # MODIFIER LETTER PRIME..MODIFIER LETTER REVER
14211421-02C2..02C5 ; DISALLOWED # MODIFIER LETTER LEFT ARROWHEAD..MODIFIER LET
14221422-02C6..02D1 ; PVALID # MODIFIER LETTER CIRCUMFLEX ACCENT..MODIFIER
14231423-02D2..02EB ; DISALLOWED # MODIFIER LETTER CENTRED RIGHT HALF RING..MOD
14241424-02EC ; PVALID # MODIFIER LETTER VOICING
14251425-02ED ; DISALLOWED # MODIFIER LETTER UNASPIRATED
14261426-02EE ; PVALID # MODIFIER LETTER DOUBLE APOSTROPHE
14271427-02EF..02FF ; DISALLOWED # MODIFIER LETTER LOW DOWN ARROWHEAD..MODIFIER
14281428-0300..033F ; PVALID # COMBINING GRAVE ACCENT..COMBINING DOUBLE OVE
14291429-0340..0341 ; DISALLOWED # COMBINING GRAVE TONE MARK..COMBINING ACUTE T
14301430-0342 ; PVALID # COMBINING GREEK PERISPOMENI
14311431-0343..0345 ; DISALLOWED # COMBINING GREEK KORONIS..COMBINING GREEK YPO
14321432-0346..034E ; PVALID # COMBINING BRIDGE ABOVE..COMBINING UPWARDS AR
14331433-034F ; DISALLOWED # COMBINING GRAPHEME JOINER
14341434-0350..036F ; PVALID # COMBINING RIGHT ARROWHEAD ABOVE..COMBINING L
14351435-0370 ; DISALLOWED # GREEK CAPITAL LETTER HETA
14361436-0371 ; PVALID # GREEK SMALL LETTER HETA
14371437-0372 ; DISALLOWED # GREEK CAPITAL LETTER ARCHAIC SAMPI
14381438-0373 ; PVALID # GREEK SMALL LETTER ARCHAIC SAMPI
14391439-0374 ; DISALLOWED # GREEK NUMERAL SIGN
14401440-0375 ; CONTEXTO # GREEK LOWER NUMERAL SIGN
14411441-0376 ; DISALLOWED # GREEK CAPITAL LETTER PAMPHYLIAN DIGAMMA
14421442-0377 ; PVALID # GREEK SMALL LETTER PAMPHYLIAN DIGAMMA
14431443-0378..0379 ; UNASSIGNED # <reserved>..<reserved>
14441444-037A ; DISALLOWED # GREEK YPOGEGRAMMENI
14451445-037B..037D ; PVALID # GREEK SMALL REVERSED LUNATE SIGMA SYMBOL..GR
14461446-037E ; DISALLOWED # GREEK QUESTION MARK
14471447-037F..0383 ; UNASSIGNED # <reserved>..<reserved>
14481448-0384..038A ; DISALLOWED # GREEK TONOS..GREEK CAPITAL LETTER IOTA WITH
14491449-038B ; UNASSIGNED # <reserved>
14501450-038C ; DISALLOWED # GREEK CAPITAL LETTER OMICRON WITH TONOS
14511451-038D ; UNASSIGNED # <reserved>
14521452-038E..038F ; DISALLOWED # GREEK CAPITAL LETTER UPSILON WITH TONOS..GRE
14531453-0390 ; PVALID # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND T
14541454-0391..03A1 ; DISALLOWED # GREEK CAPITAL LETTER ALPHA..GREEK CAPITAL LE
14551455-14561456-14571457-14581458-Faltstrom Standards Track [Page 26]
14591459-14601460-RFC 5892 IDNA Code Points August 2010
14611461-14621462-14631463-03A2 ; UNASSIGNED # <reserved>
14641464-03A3..03AB ; DISALLOWED # GREEK CAPITAL LETTER SIGMA..GREEK CAPITAL LE
14651465-03AC..03CE ; PVALID # GREEK SMALL LETTER ALPHA WITH TONOS..GREEK S
14661466-03CF..03D6 ; DISALLOWED # GREEK CAPITAL KAI SYMBOL..GREEK PI SYMBOL
14671467-03D7 ; PVALID # GREEK KAI SYMBOL
14681468-03D8 ; DISALLOWED # GREEK LETTER ARCHAIC KOPPA
14691469-03D9 ; PVALID # GREEK SMALL LETTER ARCHAIC KOPPA
14701470-03DA ; DISALLOWED # GREEK LETTER STIGMA
14711471-03DB ; PVALID # GREEK SMALL LETTER STIGMA
14721472-03DC ; DISALLOWED # GREEK LETTER DIGAMMA
14731473-03DD ; PVALID # GREEK SMALL LETTER DIGAMMA
14741474-03DE ; DISALLOWED # GREEK LETTER KOPPA
14751475-03DF ; PVALID # GREEK SMALL LETTER KOPPA
14761476-03E0 ; DISALLOWED # GREEK LETTER SAMPI
14771477-03E1 ; PVALID # GREEK SMALL LETTER SAMPI
14781478-03E2 ; DISALLOWED # COPTIC CAPITAL LETTER SHEI
14791479-03E3 ; PVALID # COPTIC SMALL LETTER SHEI
14801480-03E4 ; DISALLOWED # COPTIC CAPITAL LETTER FEI
14811481-03E5 ; PVALID # COPTIC SMALL LETTER FEI
14821482-03E6 ; DISALLOWED # COPTIC CAPITAL LETTER KHEI
14831483-03E7 ; PVALID # COPTIC SMALL LETTER KHEI
14841484-03E8 ; DISALLOWED # COPTIC CAPITAL LETTER HORI
14851485-03E9 ; PVALID # COPTIC SMALL LETTER HORI
14861486-03EA ; DISALLOWED # COPTIC CAPITAL LETTER GANGIA
14871487-03EB ; PVALID # COPTIC SMALL LETTER GANGIA
14881488-03EC ; DISALLOWED # COPTIC CAPITAL LETTER SHIMA
14891489-03ED ; PVALID # COPTIC SMALL LETTER SHIMA
14901490-03EE ; DISALLOWED # COPTIC CAPITAL LETTER DEI
14911491-03EF ; PVALID # COPTIC SMALL LETTER DEI
14921492-03F0..03F2 ; DISALLOWED # GREEK KAPPA SYMBOL..GREEK LUNATE SIGMA SYMBO
14931493-03F3 ; PVALID # GREEK LETTER YOT
14941494-03F4..03F7 ; DISALLOWED # GREEK CAPITAL THETA SYMBOL..GREEK CAPITAL LE
14951495-03F8 ; PVALID # GREEK SMALL LETTER SHO
14961496-03F9..03FA ; DISALLOWED # GREEK CAPITAL LUNATE SIGMA SYMBOL..GREEK CAP
14971497-03FB..03FC ; PVALID # GREEK SMALL LETTER SAN..GREEK RHO WITH STROK
14981498-03FD..042F ; DISALLOWED # GREEK CAPITAL REVERSED LUNATE SIGMA SYMBOL..
14991499-0430..045F ; PVALID # CYRILLIC SMALL LETTER A..CYRILLIC SMALL LETT
15001500-0460 ; DISALLOWED # CYRILLIC CAPITAL LETTER OMEGA
15011501-0461 ; PVALID # CYRILLIC SMALL LETTER OMEGA
15021502-0462 ; DISALLOWED # CYRILLIC CAPITAL LETTER YAT
15031503-0463 ; PVALID # CYRILLIC SMALL LETTER YAT
15041504-0464 ; DISALLOWED # CYRILLIC CAPITAL LETTER IOTIFIED E
15051505-0465 ; PVALID # CYRILLIC SMALL LETTER IOTIFIED E
15061506-0466 ; DISALLOWED # CYRILLIC CAPITAL LETTER LITTLE YUS
15071507-0467 ; PVALID # CYRILLIC SMALL LETTER LITTLE YUS
15081508-0468 ; DISALLOWED # CYRILLIC CAPITAL LETTER IOTIFIED LITTLE YUS
15091509-0469 ; PVALID # CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS
15101510-046A ; DISALLOWED # CYRILLIC CAPITAL LETTER BIG YUS
15111511-15121512-15131513-15141514-Faltstrom Standards Track [Page 27]
15151515-15161516-RFC 5892 IDNA Code Points August 2010
15171517-15181518-15191519-046B ; PVALID # CYRILLIC SMALL LETTER BIG YUS
15201520-046C ; DISALLOWED # CYRILLIC CAPITAL LETTER IOTIFIED BIG YUS
15211521-046D ; PVALID # CYRILLIC SMALL LETTER IOTIFIED BIG YUS
15221522-046E ; DISALLOWED # CYRILLIC CAPITAL LETTER KSI
15231523-046F ; PVALID # CYRILLIC SMALL LETTER KSI
15241524-0470 ; DISALLOWED # CYRILLIC CAPITAL LETTER PSI
15251525-0471 ; PVALID # CYRILLIC SMALL LETTER PSI
15261526-0472 ; DISALLOWED # CYRILLIC CAPITAL LETTER FITA
15271527-0473 ; PVALID # CYRILLIC SMALL LETTER FITA
15281528-0474 ; DISALLOWED # CYRILLIC CAPITAL LETTER IZHITSA
15291529-0475 ; PVALID # CYRILLIC SMALL LETTER IZHITSA
15301530-0476 ; DISALLOWED # CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE
15311531-0477 ; PVALID # CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GR
15321532-0478 ; DISALLOWED # CYRILLIC CAPITAL LETTER UK
15331533-0479 ; PVALID # CYRILLIC SMALL LETTER UK
15341534-047A ; DISALLOWED # CYRILLIC CAPITAL LETTER ROUND OMEGA
15351535-047B ; PVALID # CYRILLIC SMALL LETTER ROUND OMEGA
15361536-047C ; DISALLOWED # CYRILLIC CAPITAL LETTER OMEGA WITH TITLO
15371537-047D ; PVALID # CYRILLIC SMALL LETTER OMEGA WITH TITLO
15381538-047E ; DISALLOWED # CYRILLIC CAPITAL LETTER OT
15391539-047F ; PVALID # CYRILLIC SMALL LETTER OT
15401540-0480 ; DISALLOWED # CYRILLIC CAPITAL LETTER KOPPA
15411541-0481 ; PVALID # CYRILLIC SMALL LETTER KOPPA
15421542-0482 ; DISALLOWED # CYRILLIC THOUSANDS SIGN
15431543-0483..0487 ; PVALID # COMBINING CYRILLIC TITLO..COMBINING CYRILLIC
15441544-0488..048A ; DISALLOWED # COMBINING CYRILLIC HUNDRED THOUSANDS SIGN..C
15451545-048B ; PVALID # CYRILLIC SMALL LETTER SHORT I WITH TAIL
15461546-048C ; DISALLOWED # CYRILLIC CAPITAL LETTER SEMISOFT SIGN
15471547-048D ; PVALID # CYRILLIC SMALL LETTER SEMISOFT SIGN
15481548-048E ; DISALLOWED # CYRILLIC CAPITAL LETTER ER WITH TICK
15491549-048F ; PVALID # CYRILLIC SMALL LETTER ER WITH TICK
15501550-0490 ; DISALLOWED # CYRILLIC CAPITAL LETTER GHE WITH UPTURN
15511551-0491 ; PVALID # CYRILLIC SMALL LETTER GHE WITH UPTURN
15521552-0492 ; DISALLOWED # CYRILLIC CAPITAL LETTER GHE WITH STROKE
15531553-0493 ; PVALID # CYRILLIC SMALL LETTER GHE WITH STROKE
15541554-0494 ; DISALLOWED # CYRILLIC CAPITAL LETTER GHE WITH MIDDLE HOOK
15551555-0495 ; PVALID # CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK
15561556-0496 ; DISALLOWED # CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER
15571557-0497 ; PVALID # CYRILLIC SMALL LETTER ZHE WITH DESCENDER
15581558-0498 ; DISALLOWED # CYRILLIC CAPITAL LETTER ZE WITH DESCENDER
15591559-0499 ; PVALID # CYRILLIC SMALL LETTER ZE WITH DESCENDER
15601560-049A ; DISALLOWED # CYRILLIC CAPITAL LETTER KA WITH DESCENDER
15611561-049B ; PVALID # CYRILLIC SMALL LETTER KA WITH DESCENDER
15621562-049C ; DISALLOWED # CYRILLIC CAPITAL LETTER KA WITH VERTICAL STR
15631563-049D ; PVALID # CYRILLIC SMALL LETTER KA WITH VERTICAL STROK
15641564-049E ; DISALLOWED # CYRILLIC CAPITAL LETTER KA WITH STROKE
15651565-049F ; PVALID # CYRILLIC SMALL LETTER KA WITH STROKE
15661566-04A0 ; DISALLOWED # CYRILLIC CAPITAL LETTER BASHKIR KA
15671567-15681568-15691569-15701570-Faltstrom Standards Track [Page 28]
15711571-15721572-RFC 5892 IDNA Code Points August 2010
15731573-15741574-15751575-04A1 ; PVALID # CYRILLIC SMALL LETTER BASHKIR KA
15761576-04A2 ; DISALLOWED # CYRILLIC CAPITAL LETTER EN WITH DESCENDER
15771577-04A3 ; PVALID # CYRILLIC SMALL LETTER EN WITH DESCENDER
15781578-04A4 ; DISALLOWED # CYRILLIC CAPITAL LIGATURE EN GHE
15791579-04A5 ; PVALID # CYRILLIC SMALL LIGATURE EN GHE
15801580-04A6 ; DISALLOWED # CYRILLIC CAPITAL LETTER PE WITH MIDDLE HOOK
15811581-04A7 ; PVALID # CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK
15821582-04A8 ; DISALLOWED # CYRILLIC CAPITAL LETTER ABKHASIAN HA
15831583-04A9 ; PVALID # CYRILLIC SMALL LETTER ABKHASIAN HA
15841584-04AA ; DISALLOWED # CYRILLIC CAPITAL LETTER ES WITH DESCENDER
15851585-04AB ; PVALID # CYRILLIC SMALL LETTER ES WITH DESCENDER
15861586-04AC ; DISALLOWED # CYRILLIC CAPITAL LETTER TE WITH DESCENDER
15871587-04AD ; PVALID # CYRILLIC SMALL LETTER TE WITH DESCENDER
15881588-04AE ; DISALLOWED # CYRILLIC CAPITAL LETTER STRAIGHT U
15891589-04AF ; PVALID # CYRILLIC SMALL LETTER STRAIGHT U
15901590-04B0 ; DISALLOWED # CYRILLIC CAPITAL LETTER STRAIGHT U WITH STRO
15911591-04B1 ; PVALID # CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE
15921592-04B2 ; DISALLOWED # CYRILLIC CAPITAL LETTER HA WITH DESCENDER
15931593-04B3 ; PVALID # CYRILLIC SMALL LETTER HA WITH DESCENDER
15941594-04B4 ; DISALLOWED # CYRILLIC CAPITAL LIGATURE TE TSE
15951595-04B5 ; PVALID # CYRILLIC SMALL LIGATURE TE TSE
15961596-04B6 ; DISALLOWED # CYRILLIC CAPITAL LETTER CHE WITH DESCENDER
15971597-04B7 ; PVALID # CYRILLIC SMALL LETTER CHE WITH DESCENDER
15981598-04B8 ; DISALLOWED # CYRILLIC CAPITAL LETTER CHE WITH VERTICAL ST
15991599-04B9 ; PVALID # CYRILLIC SMALL LETTER CHE WITH VERTICAL STRO
16001600-04BA ; DISALLOWED # CYRILLIC CAPITAL LETTER SHHA
16011601-04BB ; PVALID # CYRILLIC SMALL LETTER SHHA
16021602-04BC ; DISALLOWED # CYRILLIC CAPITAL LETTER ABKHASIAN CHE
16031603-04BD ; PVALID # CYRILLIC SMALL LETTER ABKHASIAN CHE
16041604-04BE ; DISALLOWED # CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH D
16051605-04BF ; PVALID # CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DES
16061606-04C0..04C1 ; DISALLOWED # CYRILLIC LETTER PALOCHKA..CYRILLIC CAPITAL L
16071607-04C2 ; PVALID # CYRILLIC SMALL LETTER ZHE WITH BREVE
16081608-04C3 ; DISALLOWED # CYRILLIC CAPITAL LETTER KA WITH HOOK
16091609-04C4 ; PVALID # CYRILLIC SMALL LETTER KA WITH HOOK
16101610-04C5 ; DISALLOWED # CYRILLIC CAPITAL LETTER EL WITH TAIL
16111611-04C6 ; PVALID # CYRILLIC SMALL LETTER EL WITH TAIL
16121612-04C7 ; DISALLOWED # CYRILLIC CAPITAL LETTER EN WITH HOOK
16131613-04C8 ; PVALID # CYRILLIC SMALL LETTER EN WITH HOOK
16141614-04C9 ; DISALLOWED # CYRILLIC CAPITAL LETTER EN WITH TAIL
16151615-04CA ; PVALID # CYRILLIC SMALL LETTER EN WITH TAIL
16161616-04CB ; DISALLOWED # CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
16171617-04CC ; PVALID # CYRILLIC SMALL LETTER KHAKASSIAN CHE
16181618-04CD ; DISALLOWED # CYRILLIC CAPITAL LETTER EM WITH TAIL
16191619-04CE..04CF ; PVALID # CYRILLIC SMALL LETTER EM WITH TAIL..CYRILLIC
16201620-04D0 ; DISALLOWED # CYRILLIC CAPITAL LETTER A WITH BREVE
16211621-04D1 ; PVALID # CYRILLIC SMALL LETTER A WITH BREVE
16221622-04D2 ; DISALLOWED # CYRILLIC CAPITAL LETTER A WITH DIAERESIS
16231623-16241624-16251625-16261626-Faltstrom Standards Track [Page 29]
16271627-16281628-RFC 5892 IDNA Code Points August 2010
16291629-16301630-16311631-04D3 ; PVALID # CYRILLIC SMALL LETTER A WITH DIAERESIS
16321632-04D4 ; DISALLOWED # CYRILLIC CAPITAL LIGATURE A IE
16331633-04D5 ; PVALID # CYRILLIC SMALL LIGATURE A IE
16341634-04D6 ; DISALLOWED # CYRILLIC CAPITAL LETTER IE WITH BREVE
16351635-04D7 ; PVALID # CYRILLIC SMALL LETTER IE WITH BREVE
16361636-04D8 ; DISALLOWED # CYRILLIC CAPITAL LETTER SCHWA
16371637-04D9 ; PVALID # CYRILLIC SMALL LETTER SCHWA
16381638-04DA ; DISALLOWED # CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS
16391639-04DB ; PVALID # CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
16401640-04DC ; DISALLOWED # CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
16411641-04DD ; PVALID # CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
16421642-04DE ; DISALLOWED # CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
16431643-04DF ; PVALID # CYRILLIC SMALL LETTER ZE WITH DIAERESIS
16441644-04E0 ; DISALLOWED # CYRILLIC CAPITAL LETTER ABKHASIAN DZE
16451645-04E1 ; PVALID # CYRILLIC SMALL LETTER ABKHASIAN DZE
16461646-04E2 ; DISALLOWED # CYRILLIC CAPITAL LETTER I WITH MACRON
16471647-04E3 ; PVALID # CYRILLIC SMALL LETTER I WITH MACRON
16481648-04E4 ; DISALLOWED # CYRILLIC CAPITAL LETTER I WITH DIAERESIS
16491649-04E5 ; PVALID # CYRILLIC SMALL LETTER I WITH DIAERESIS
16501650-04E6 ; DISALLOWED # CYRILLIC CAPITAL LETTER O WITH DIAERESIS
16511651-04E7 ; PVALID # CYRILLIC SMALL LETTER O WITH DIAERESIS
16521652-04E8 ; DISALLOWED # CYRILLIC CAPITAL LETTER BARRED O
16531653-04E9 ; PVALID # CYRILLIC SMALL LETTER BARRED O
16541654-04EA ; DISALLOWED # CYRILLIC CAPITAL LETTER BARRED O WITH DIAERE
16551655-04EB ; PVALID # CYRILLIC SMALL LETTER BARRED O WITH DIAERESI
16561656-04EC ; DISALLOWED # CYRILLIC CAPITAL LETTER E WITH DIAERESIS
16571657-04ED ; PVALID # CYRILLIC SMALL LETTER E WITH DIAERESIS
16581658-04EE ; DISALLOWED # CYRILLIC CAPITAL LETTER U WITH MACRON
16591659-04EF ; PVALID # CYRILLIC SMALL LETTER U WITH MACRON
16601660-04F0 ; DISALLOWED # CYRILLIC CAPITAL LETTER U WITH DIAERESIS
16611661-04F1 ; PVALID # CYRILLIC SMALL LETTER U WITH DIAERESIS
16621662-04F2 ; DISALLOWED # CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
16631663-04F3 ; PVALID # CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
16641664-04F4 ; DISALLOWED # CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
16651665-04F5 ; PVALID # CYRILLIC SMALL LETTER CHE WITH DIAERESIS
16661666-04F6 ; DISALLOWED # CYRILLIC CAPITAL LETTER GHE WITH DESCENDER
16671667-04F7 ; PVALID # CYRILLIC SMALL LETTER GHE WITH DESCENDER
16681668-04F8 ; DISALLOWED # CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
16691669-04F9 ; PVALID # CYRILLIC SMALL LETTER YERU WITH DIAERESIS
16701670-04FA ; DISALLOWED # CYRILLIC CAPITAL LETTER GHE WITH STROKE AND
16711671-04FB ; PVALID # CYRILLIC SMALL LETTER GHE WITH STROKE AND HO
16721672-04FC ; DISALLOWED # CYRILLIC CAPITAL LETTER HA WITH HOOK
16731673-04FD ; PVALID # CYRILLIC SMALL LETTER HA WITH HOOK
16741674-04FE ; DISALLOWED # CYRILLIC CAPITAL LETTER HA WITH STROKE
16751675-04FF ; PVALID # CYRILLIC SMALL LETTER HA WITH STROKE
16761676-0500 ; DISALLOWED # CYRILLIC CAPITAL LETTER KOMI DE
16771677-0501 ; PVALID # CYRILLIC SMALL LETTER KOMI DE
16781678-0502 ; DISALLOWED # CYRILLIC CAPITAL LETTER KOMI DJE
16791679-16801680-16811681-16821682-Faltstrom Standards Track [Page 30]
16831683-16841684-RFC 5892 IDNA Code Points August 2010
16851685-16861686-16871687-0503 ; PVALID # CYRILLIC SMALL LETTER KOMI DJE
16881688-0504 ; DISALLOWED # CYRILLIC CAPITAL LETTER KOMI ZJE
16891689-0505 ; PVALID # CYRILLIC SMALL LETTER KOMI ZJE
16901690-0506 ; DISALLOWED # CYRILLIC CAPITAL LETTER KOMI DZJE
16911691-0507 ; PVALID # CYRILLIC SMALL LETTER KOMI DZJE
16921692-0508 ; DISALLOWED # CYRILLIC CAPITAL LETTER KOMI LJE
16931693-0509 ; PVALID # CYRILLIC SMALL LETTER KOMI LJE
16941694-050A ; DISALLOWED # CYRILLIC CAPITAL LETTER KOMI NJE
16951695-050B ; PVALID # CYRILLIC SMALL LETTER KOMI NJE
16961696-050C ; DISALLOWED # CYRILLIC CAPITAL LETTER KOMI SJE
16971697-050D ; PVALID # CYRILLIC SMALL LETTER KOMI SJE
16981698-050E ; DISALLOWED # CYRILLIC CAPITAL LETTER KOMI TJE
16991699-050F ; PVALID # CYRILLIC SMALL LETTER KOMI TJE
17001700-0510 ; DISALLOWED # CYRILLIC CAPITAL LETTER REVERSED ZE
17011701-0511 ; PVALID # CYRILLIC SMALL LETTER REVERSED ZE
17021702-0512 ; DISALLOWED # CYRILLIC CAPITAL LETTER EL WITH HOOK
17031703-0513 ; PVALID # CYRILLIC SMALL LETTER EL WITH HOOK
17041704-0514 ; DISALLOWED # CYRILLIC CAPITAL LETTER LHA
17051705-0515 ; PVALID # CYRILLIC SMALL LETTER LHA
17061706-0516 ; DISALLOWED # CYRILLIC CAPITAL LETTER RHA
17071707-0517 ; PVALID # CYRILLIC SMALL LETTER RHA
17081708-0518 ; DISALLOWED # CYRILLIC CAPITAL LETTER YAE
17091709-0519 ; PVALID # CYRILLIC SMALL LETTER YAE
17101710-051A ; DISALLOWED # CYRILLIC CAPITAL LETTER QA
17111711-051B ; PVALID # CYRILLIC SMALL LETTER QA
17121712-051C ; DISALLOWED # CYRILLIC CAPITAL LETTER WE
17131713-051D ; PVALID # CYRILLIC SMALL LETTER WE
17141714-051E ; DISALLOWED # CYRILLIC CAPITAL LETTER ALEUT KA
17151715-051F ; PVALID # CYRILLIC SMALL LETTER ALEUT KA
17161716-0520 ; DISALLOWED # CYRILLIC CAPITAL LETTER EL WITH MIDDLE HOOK
17171717-0521 ; PVALID # CYRILLIC SMALL LETTER EL WITH MIDDLE HOOK
17181718-0522 ; DISALLOWED # CYRILLIC CAPITAL LETTER EN WITH MIDDLE HOOK
17191719-0523 ; PVALID # CYRILLIC SMALL LETTER EN WITH MIDDLE HOOK
17201720-0524 ; DISALLOWED # CYRILLIC CAPITAL LETTER PE WITH DESCENDER
17211721-0525 ; PVALID # CYRILLIC SMALL LETTER PE WITH DESCENDER
17221722-0526..0530 ; UNASSIGNED # <reserved>..<reserved>
17231723-0531..0556 ; DISALLOWED # ARMENIAN CAPITAL LETTER AYB..ARMENIAN CAPITA
17241724-0557..0558 ; UNASSIGNED # <reserved>..<reserved>
17251725-0559 ; PVALID # ARMENIAN MODIFIER LETTER LEFT HALF RING
17261726-055A..055F ; DISALLOWED # ARMENIAN APOSTROPHE..ARMENIAN ABBREVIATION M
17271727-0560 ; UNASSIGNED # <reserved>
17281728-0561..0586 ; PVALID # ARMENIAN SMALL LETTER AYB..ARMENIAN SMALL LE
17291729-0587 ; DISALLOWED # ARMENIAN SMALL LIGATURE ECH YIWN
17301730-0588 ; UNASSIGNED # <reserved>
17311731-0589..058A ; DISALLOWED # ARMENIAN FULL STOP..ARMENIAN HYPHEN
17321732-058B..0590 ; UNASSIGNED # <reserved>..<reserved>
17331733-0591..05BD ; PVALID # HEBREW ACCENT ETNAHTA..HEBREW POINT METEG
17341734-05BE ; DISALLOWED # HEBREW PUNCTUATION MAQAF
17351735-17361736-17371737-17381738-Faltstrom Standards Track [Page 31]
17391739-17401740-RFC 5892 IDNA Code Points August 2010
17411741-17421742-17431743-05BF ; PVALID # HEBREW POINT RAFE
17441744-05C0 ; DISALLOWED # HEBREW PUNCTUATION PASEQ
17451745-05C1..05C2 ; PVALID # HEBREW POINT SHIN DOT..HEBREW POINT SIN DOT
17461746-05C3 ; DISALLOWED # HEBREW PUNCTUATION SOF PASUQ
17471747-05C4..05C5 ; PVALID # HEBREW MARK UPPER DOT..HEBREW MARK LOWER DOT
17481748-05C6 ; DISALLOWED # HEBREW PUNCTUATION NUN HAFUKHA
17491749-05C7 ; PVALID # HEBREW POINT QAMATS QATAN
17501750-05C8..05CF ; UNASSIGNED # <reserved>..<reserved>
17511751-05D0..05EA ; PVALID # HEBREW LETTER ALEF..HEBREW LETTER TAV
17521752-05EB..05EF ; UNASSIGNED # <reserved>..<reserved>
17531753-05F0..05F2 ; PVALID # HEBREW LIGATURE YIDDISH DOUBLE VAV..HEBREW L
17541754-05F3..05F4 ; CONTEXTO # HEBREW PUNCTUATION GERESH..HEBREW PUNCTUATIO
17551755-05F5..05FF ; UNASSIGNED # <reserved>..<reserved>
17561756-0600..0603 ; DISALLOWED # ARABIC NUMBER SIGN..ARABIC SIGN SAFHA
17571757-0604..0605 ; UNASSIGNED # <reserved>..<reserved>
17581758-0606..060F ; DISALLOWED # ARABIC-INDIC CUBE ROOT..ARABIC SIGN MISRA
17591759-0610..061A ; PVALID # ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM..AR
17601760-061B ; DISALLOWED # ARABIC SEMICOLON
17611761-061C..061D ; UNASSIGNED # <reserved>..<reserved>
17621762-061E..061F ; DISALLOWED # ARABIC TRIPLE DOT PUNCTUATION MARK..ARABIC Q
17631763-0620 ; UNASSIGNED # <reserved>
17641764-0621..063F ; PVALID # ARABIC LETTER HAMZA..ARABIC LETTER FARSI YEH
17651765-0640 ; DISALLOWED # ARABIC TATWEEL
17661766-0641..065E ; PVALID # ARABIC LETTER FEH..ARABIC FATHA WITH TWO DOT
17671767-065F ; UNASSIGNED # <reserved>
17681768-0660..0669 ; CONTEXTO # ARABIC-INDIC DIGIT ZERO..ARABIC-INDIC DIGIT
17691769-066A..066D ; DISALLOWED # ARABIC PERCENT SIGN..ARABIC FIVE POINTED STA
17701770-066E..0674 ; PVALID # ARABIC LETTER DOTLESS BEH..ARABIC LETTER HIG
17711771-0675..0678 ; DISALLOWED # ARABIC LETTER HIGH HAMZA ALEF..ARABIC LETTER
17721772-0679..06D3 ; PVALID # ARABIC LETTER TTEH..ARABIC LETTER YEH BARREE
17731773-06D4 ; DISALLOWED # ARABIC FULL STOP
17741774-06D5..06DC ; PVALID # ARABIC LETTER AE..ARABIC SMALL HIGH SEEN
17751775-06DD..06DE ; DISALLOWED # ARABIC END OF AYAH..ARABIC START OF RUB EL H
17761776-06DF..06E8 ; PVALID # ARABIC SMALL HIGH ROUNDED ZERO..ARABIC SMALL
17771777-06E9 ; DISALLOWED # ARABIC PLACE OF SAJDAH
17781778-06EA..06EF ; PVALID # ARABIC EMPTY CENTRE LOW STOP..ARABIC LETTER
17791779-06F0..06F9 ; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ZERO..EXTENDED A
17801780-06FA..06FF ; PVALID # ARABIC LETTER SHEEN WITH DOT BELOW..ARABIC L
17811781-0700..070D ; DISALLOWED # SYRIAC END OF PARAGRAPH..SYRIAC HARKLEAN AST
17821782-070E ; UNASSIGNED # <reserved>
17831783-070F ; DISALLOWED # SYRIAC ABBREVIATION MARK
17841784-0710..074A ; PVALID # SYRIAC LETTER ALAPH..SYRIAC BARREKH
17851785-074B..074C ; UNASSIGNED # <reserved>..<reserved>
17861786-074D..07B1 ; PVALID # SYRIAC LETTER SOGDIAN ZHAIN..THAANA LETTER N
17871787-07B2..07BF ; UNASSIGNED # <reserved>..<reserved>
17881788-07C0..07F5 ; PVALID # NKO DIGIT ZERO..NKO LOW TONE APOSTROPHE
17891789-07F6..07FA ; DISALLOWED # NKO SYMBOL OO DENNEN..NKO LAJANYALAN
17901790-07FB..07FF ; UNASSIGNED # <reserved>..<reserved>
17911791-17921792-17931793-17941794-Faltstrom Standards Track [Page 32]
17951795-17961796-RFC 5892 IDNA Code Points August 2010
17971797-17981798-17991799-0800..082D ; PVALID # SAMARITAN LETTER ALAF..SAMARITAN MARK NEQUDA
18001800-082E..082F ; UNASSIGNED # <reserved>..<reserved>
18011801-0830..083E ; DISALLOWED # SAMARITAN PUNCTUATION NEQUDAA..SAMARITAN PUN
18021802-083F..08FF ; UNASSIGNED # <reserved>..<reserved>
18031803-0900..0939 ; PVALID # DEVANAGARI SIGN INVERTED CANDRABINDU..DEVANA
18041804-093A..093B ; UNASSIGNED # <reserved>..<reserved>
18051805-093C..094E ; PVALID # DEVANAGARI SIGN NUKTA..DEVANAGARI VOWEL SIGN
18061806-094F ; UNASSIGNED # <reserved>
18071807-0950..0955 ; PVALID # DEVANAGARI OM..DEVANAGARI VOWEL SIGN CANDRA
18081808-0956..0957 ; UNASSIGNED # <reserved>..<reserved>
18091809-0958..095F ; DISALLOWED # DEVANAGARI LETTER QA..DEVANAGARI LETTER YYA
18101810-0960..0963 ; PVALID # DEVANAGARI LETTER VOCALIC RR..DEVANAGARI VOW
18111811-0964..0965 ; DISALLOWED # DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA
18121812-0966..096F ; PVALID # DEVANAGARI DIGIT ZERO..DEVANAGARI DIGIT NINE
18131813-0970 ; DISALLOWED # DEVANAGARI ABBREVIATION SIGN
18141814-0971..0972 ; PVALID # DEVANAGARI SIGN HIGH SPACING DOT..DEVANAGARI
18151815-0973..0978 ; UNASSIGNED # <reserved>..<reserved>
18161816-0979..097F ; PVALID # DEVANAGARI LETTER ZHA..DEVANAGARI LETTER BBA
18171817-0980 ; UNASSIGNED # <reserved>
18181818-0981..0983 ; PVALID # BENGALI SIGN CANDRABINDU..BENGALI SIGN VISAR
18191819-0984 ; UNASSIGNED # <reserved>
18201820-0985..098C ; PVALID # BENGALI LETTER A..BENGALI LETTER VOCALIC L
18211821-098D..098E ; UNASSIGNED # <reserved>..<reserved>
18221822-098F..0990 ; PVALID # BENGALI LETTER E..BENGALI LETTER AI
18231823-0991..0992 ; UNASSIGNED # <reserved>..<reserved>
18241824-0993..09A8 ; PVALID # BENGALI LETTER O..BENGALI LETTER NA
18251825-09A9 ; UNASSIGNED # <reserved>
18261826-09AA..09B0 ; PVALID # BENGALI LETTER PA..BENGALI LETTER RA
18271827-09B1 ; UNASSIGNED # <reserved>
18281828-09B2 ; PVALID # BENGALI LETTER LA
18291829-09B3..09B5 ; UNASSIGNED # <reserved>..<reserved>
18301830-09B6..09B9 ; PVALID # BENGALI LETTER SHA..BENGALI LETTER HA
18311831-09BA..09BB ; UNASSIGNED # <reserved>..<reserved>
18321832-09BC..09C4 ; PVALID # BENGALI SIGN NUKTA..BENGALI VOWEL SIGN VOCAL
18331833-09C5..09C6 ; UNASSIGNED # <reserved>..<reserved>
18341834-09C7..09C8 ; PVALID # BENGALI VOWEL SIGN E..BENGALI VOWEL SIGN AI
18351835-09C9..09CA ; UNASSIGNED # <reserved>..<reserved>
18361836-09CB..09CE ; PVALID # BENGALI VOWEL SIGN O..BENGALI LETTER KHANDA
18371837-09CF..09D6 ; UNASSIGNED # <reserved>..<reserved>
18381838-09D7 ; PVALID # BENGALI AU LENGTH MARK
18391839-09D8..09DB ; UNASSIGNED # <reserved>..<reserved>
18401840-09DC..09DD ; DISALLOWED # BENGALI LETTER RRA..BENGALI LETTER RHA
18411841-09DE ; UNASSIGNED # <reserved>
18421842-09DF ; DISALLOWED # BENGALI LETTER YYA
18431843-09E0..09E3 ; PVALID # BENGALI LETTER VOCALIC RR..BENGALI VOWEL SIG
18441844-09E4..09E5 ; UNASSIGNED # <reserved>..<reserved>
18451845-09E6..09F1 ; PVALID # BENGALI DIGIT ZERO..BENGALI LETTER RA WITH L
18461846-09F2..09FB ; DISALLOWED # BENGALI RUPEE MARK..BENGALI GANDA MARK
18471847-18481848-18491849-18501850-Faltstrom Standards Track [Page 33]
18511851-18521852-RFC 5892 IDNA Code Points August 2010
18531853-18541854-18551855-09FC..0A00 ; UNASSIGNED # <reserved>..<reserved>
18561856-0A01..0A03 ; PVALID # GURMUKHI SIGN ADAK BINDI..GURMUKHI SIGN VISA
18571857-0A04 ; UNASSIGNED # <reserved>
18581858-0A05..0A0A ; PVALID # GURMUKHI LETTER A..GURMUKHI LETTER UU
18591859-0A0B..0A0E ; UNASSIGNED # <reserved>..<reserved>
18601860-0A0F..0A10 ; PVALID # GURMUKHI LETTER EE..GURMUKHI LETTER AI
18611861-0A11..0A12 ; UNASSIGNED # <reserved>..<reserved>
18621862-0A13..0A28 ; PVALID # GURMUKHI LETTER OO..GURMUKHI LETTER NA
18631863-0A29 ; UNASSIGNED # <reserved>
18641864-0A2A..0A30 ; PVALID # GURMUKHI LETTER PA..GURMUKHI LETTER RA
18651865-0A31 ; UNASSIGNED # <reserved>
18661866-0A32 ; PVALID # GURMUKHI LETTER LA
18671867-0A33 ; DISALLOWED # GURMUKHI LETTER LLA
18681868-0A34 ; UNASSIGNED # <reserved>
18691869-0A35 ; PVALID # GURMUKHI LETTER VA
18701870-0A36 ; DISALLOWED # GURMUKHI LETTER SHA
18711871-0A37 ; UNASSIGNED # <reserved>
18721872-0A38..0A39 ; PVALID # GURMUKHI LETTER SA..GURMUKHI LETTER HA
18731873-0A3A..0A3B ; UNASSIGNED # <reserved>..<reserved>
18741874-0A3C ; PVALID # GURMUKHI SIGN NUKTA
18751875-0A3D ; UNASSIGNED # <reserved>
18761876-0A3E..0A42 ; PVALID # GURMUKHI VOWEL SIGN AA..GURMUKHI VOWEL SIGN
18771877-0A43..0A46 ; UNASSIGNED # <reserved>..<reserved>
18781878-0A47..0A48 ; PVALID # GURMUKHI VOWEL SIGN EE..GURMUKHI VOWEL SIGN
18791879-0A49..0A4A ; UNASSIGNED # <reserved>..<reserved>
18801880-0A4B..0A4D ; PVALID # GURMUKHI VOWEL SIGN OO..GURMUKHI SIGN VIRAMA
18811881-0A4E..0A50 ; UNASSIGNED # <reserved>..<reserved>
18821882-0A51 ; PVALID # GURMUKHI SIGN UDAAT
18831883-0A52..0A58 ; UNASSIGNED # <reserved>..<reserved>
18841884-0A59..0A5B ; DISALLOWED # GURMUKHI LETTER KHHA..GURMUKHI LETTER ZA
18851885-0A5C ; PVALID # GURMUKHI LETTER RRA
18861886-0A5D ; UNASSIGNED # <reserved>
18871887-0A5E ; DISALLOWED # GURMUKHI LETTER FA
18881888-0A5F..0A65 ; UNASSIGNED # <reserved>..<reserved>
18891889-0A66..0A75 ; PVALID # GURMUKHI DIGIT ZERO..GURMUKHI SIGN YAKASH
18901890-0A76..0A80 ; UNASSIGNED # <reserved>..<reserved>
18911891-0A81..0A83 ; PVALID # GUJARATI SIGN CANDRABINDU..GUJARATI SIGN VIS
18921892-0A84 ; UNASSIGNED # <reserved>
18931893-0A85..0A8D ; PVALID # GUJARATI LETTER A..GUJARATI VOWEL CANDRA E
18941894-0A8E ; UNASSIGNED # <reserved>
18951895-0A8F..0A91 ; PVALID # GUJARATI LETTER E..GUJARATI VOWEL CANDRA O
18961896-0A92 ; UNASSIGNED # <reserved>
18971897-0A93..0AA8 ; PVALID # GUJARATI LETTER O..GUJARATI LETTER NA
18981898-0AA9 ; UNASSIGNED # <reserved>
18991899-0AAA..0AB0 ; PVALID # GUJARATI LETTER PA..GUJARATI LETTER RA
19001900-0AB1 ; UNASSIGNED # <reserved>
19011901-0AB2..0AB3 ; PVALID # GUJARATI LETTER LA..GUJARATI LETTER LLA
19021902-0AB4 ; UNASSIGNED # <reserved>
19031903-19041904-19051905-19061906-Faltstrom Standards Track [Page 34]
19071907-19081908-RFC 5892 IDNA Code Points August 2010
19091909-19101910-19111911-0AB5..0AB9 ; PVALID # GUJARATI LETTER VA..GUJARATI LETTER HA
19121912-0ABA..0ABB ; UNASSIGNED # <reserved>..<reserved>
19131913-0ABC..0AC5 ; PVALID # GUJARATI SIGN NUKTA..GUJARATI VOWEL SIGN CAN
19141914-0AC6 ; UNASSIGNED # <reserved>
19151915-0AC7..0AC9 ; PVALID # GUJARATI VOWEL SIGN E..GUJARATI VOWEL SIGN C
19161916-0ACA ; UNASSIGNED # <reserved>
19171917-0ACB..0ACD ; PVALID # GUJARATI VOWEL SIGN O..GUJARATI SIGN VIRAMA
19181918-0ACE..0ACF ; UNASSIGNED # <reserved>..<reserved>
19191919-0AD0 ; PVALID # GUJARATI OM
19201920-0AD1..0ADF ; UNASSIGNED # <reserved>..<reserved>
19211921-0AE0..0AE3 ; PVALID # GUJARATI LETTER VOCALIC RR..GUJARATI VOWEL S
19221922-0AE4..0AE5 ; UNASSIGNED # <reserved>..<reserved>
19231923-0AE6..0AEF ; PVALID # GUJARATI DIGIT ZERO..GUJARATI DIGIT NINE
19241924-0AF0 ; UNASSIGNED # <reserved>
19251925-0AF1 ; DISALLOWED # GUJARATI RUPEE SIGN
19261926-0AF2..0B00 ; UNASSIGNED # <reserved>..<reserved>
19271927-0B01..0B03 ; PVALID # ORIYA SIGN CANDRABINDU..ORIYA SIGN VISARGA
19281928-0B04 ; UNASSIGNED # <reserved>
19291929-0B05..0B0C ; PVALID # ORIYA LETTER A..ORIYA LETTER VOCALIC L
19301930-0B0D..0B0E ; UNASSIGNED # <reserved>..<reserved>
19311931-0B0F..0B10 ; PVALID # ORIYA LETTER E..ORIYA LETTER AI
19321932-0B11..0B12 ; UNASSIGNED # <reserved>..<reserved>
19331933-0B13..0B28 ; PVALID # ORIYA LETTER O..ORIYA LETTER NA
19341934-0B29 ; UNASSIGNED # <reserved>
19351935-0B2A..0B30 ; PVALID # ORIYA LETTER PA..ORIYA LETTER RA
19361936-0B31 ; UNASSIGNED # <reserved>
19371937-0B32..0B33 ; PVALID # ORIYA LETTER LA..ORIYA LETTER LLA
19381938-0B34 ; UNASSIGNED # <reserved>
19391939-0B35..0B39 ; PVALID # ORIYA LETTER VA..ORIYA LETTER HA
19401940-0B3A..0B3B ; UNASSIGNED # <reserved>..<reserved>
19411941-0B3C..0B44 ; PVALID # ORIYA SIGN NUKTA..ORIYA VOWEL SIGN VOCALIC R
19421942-0B45..0B46 ; UNASSIGNED # <reserved>..<reserved>
19431943-0B47..0B48 ; PVALID # ORIYA VOWEL SIGN E..ORIYA VOWEL SIGN AI
19441944-0B49..0B4A ; UNASSIGNED # <reserved>..<reserved>
19451945-0B4B..0B4D ; PVALID # ORIYA VOWEL SIGN O..ORIYA SIGN VIRAMA
19461946-0B4E..0B55 ; UNASSIGNED # <reserved>..<reserved>
19471947-0B56..0B57 ; PVALID # ORIYA AI LENGTH MARK..ORIYA AU LENGTH MARK
19481948-0B58..0B5B ; UNASSIGNED # <reserved>..<reserved>
19491949-0B5C..0B5D ; DISALLOWED # ORIYA LETTER RRA..ORIYA LETTER RHA
19501950-0B5E ; UNASSIGNED # <reserved>
19511951-0B5F..0B63 ; PVALID # ORIYA LETTER YYA..ORIYA VOWEL SIGN VOCALIC L
19521952-0B64..0B65 ; UNASSIGNED # <reserved>..<reserved>
19531953-0B66..0B6F ; PVALID # ORIYA DIGIT ZERO..ORIYA DIGIT NINE
19541954-0B70 ; DISALLOWED # ORIYA ISSHAR
19551955-0B71 ; PVALID # ORIYA LETTER WA
19561956-0B72..0B81 ; UNASSIGNED # <reserved>..<reserved>
19571957-0B82..0B83 ; PVALID # TAMIL SIGN ANUSVARA..TAMIL SIGN VISARGA
19581958-0B84 ; UNASSIGNED # <reserved>
19591959-19601960-19611961-19621962-Faltstrom Standards Track [Page 35]
19631963-19641964-RFC 5892 IDNA Code Points August 2010
19651965-19661966-19671967-0B85..0B8A ; PVALID # TAMIL LETTER A..TAMIL LETTER UU
19681968-0B8B..0B8D ; UNASSIGNED # <reserved>..<reserved>
19691969-0B8E..0B90 ; PVALID # TAMIL LETTER E..TAMIL LETTER AI
19701970-0B91 ; UNASSIGNED # <reserved>
19711971-0B92..0B95 ; PVALID # TAMIL LETTER O..TAMIL LETTER KA
19721972-0B96..0B98 ; UNASSIGNED # <reserved>..<reserved>
19731973-0B99..0B9A ; PVALID # TAMIL LETTER NGA..TAMIL LETTER CA
19741974-0B9B ; UNASSIGNED # <reserved>
19751975-0B9C ; PVALID # TAMIL LETTER JA
19761976-0B9D ; UNASSIGNED # <reserved>
19771977-0B9E..0B9F ; PVALID # TAMIL LETTER NYA..TAMIL LETTER TTA
19781978-0BA0..0BA2 ; UNASSIGNED # <reserved>..<reserved>
19791979-0BA3..0BA4 ; PVALID # TAMIL LETTER NNA..TAMIL LETTER TA
19801980-0BA5..0BA7 ; UNASSIGNED # <reserved>..<reserved>
19811981-0BA8..0BAA ; PVALID # TAMIL LETTER NA..TAMIL LETTER PA
19821982-0BAB..0BAD ; UNASSIGNED # <reserved>..<reserved>
19831983-0BAE..0BB9 ; PVALID # TAMIL LETTER MA..TAMIL LETTER HA
19841984-0BBA..0BBD ; UNASSIGNED # <reserved>..<reserved>
19851985-0BBE..0BC2 ; PVALID # TAMIL VOWEL SIGN AA..TAMIL VOWEL SIGN UU
19861986-0BC3..0BC5 ; UNASSIGNED # <reserved>..<reserved>
19871987-0BC6..0BC8 ; PVALID # TAMIL VOWEL SIGN E..TAMIL VOWEL SIGN AI
19881988-0BC9 ; UNASSIGNED # <reserved>
19891989-0BCA..0BCD ; PVALID # TAMIL VOWEL SIGN O..TAMIL SIGN VIRAMA
19901990-0BCE..0BCF ; UNASSIGNED # <reserved>..<reserved>
19911991-0BD0 ; PVALID # TAMIL OM
19921992-0BD1..0BD6 ; UNASSIGNED # <reserved>..<reserved>
19931993-0BD7 ; PVALID # TAMIL AU LENGTH MARK
19941994-0BD8..0BE5 ; UNASSIGNED # <reserved>..<reserved>
19951995-0BE6..0BEF ; PVALID # TAMIL DIGIT ZERO..TAMIL DIGIT NINE
19961996-0BF0..0BFA ; DISALLOWED # TAMIL NUMBER TEN..TAMIL NUMBER SIGN
19971997-0BFB..0C00 ; UNASSIGNED # <reserved>..<reserved>
19981998-0C01..0C03 ; PVALID # TELUGU SIGN CANDRABINDU..TELUGU SIGN VISARGA
19991999-0C04 ; UNASSIGNED # <reserved>
20002000-0C05..0C0C ; PVALID # TELUGU LETTER A..TELUGU LETTER VOCALIC L
20012001-0C0D ; UNASSIGNED # <reserved>
20022002-0C0E..0C10 ; PVALID # TELUGU LETTER E..TELUGU LETTER AI
20032003-0C11 ; UNASSIGNED # <reserved>
20042004-0C12..0C28 ; PVALID # TELUGU LETTER O..TELUGU LETTER NA
20052005-0C29 ; UNASSIGNED # <reserved>
20062006-0C2A..0C33 ; PVALID # TELUGU LETTER PA..TELUGU LETTER LLA
20072007-0C34 ; UNASSIGNED # <reserved>
20082008-0C35..0C39 ; PVALID # TELUGU LETTER VA..TELUGU LETTER HA
20092009-0C3A..0C3C ; UNASSIGNED # <reserved>..<reserved>
20102010-0C3D..0C44 ; PVALID # TELUGU SIGN AVAGRAHA..TELUGU VOWEL SIGN VOCA
20112011-0C45 ; UNASSIGNED # <reserved>
20122012-0C46..0C48 ; PVALID # TELUGU VOWEL SIGN E..TELUGU VOWEL SIGN AI
20132013-0C49 ; UNASSIGNED # <reserved>
20142014-0C4A..0C4D ; PVALID # TELUGU VOWEL SIGN O..TELUGU SIGN VIRAMA
20152015-20162016-20172017-20182018-Faltstrom Standards Track [Page 36]
20192019-20202020-RFC 5892 IDNA Code Points August 2010
20212021-20222022-20232023-0C4E..0C54 ; UNASSIGNED # <reserved>..<reserved>
20242024-0C55..0C56 ; PVALID # TELUGU LENGTH MARK..TELUGU AI LENGTH MARK
20252025-0C57 ; UNASSIGNED # <reserved>
20262026-0C58..0C59 ; PVALID # TELUGU LETTER TSA..TELUGU LETTER DZA
20272027-0C5A..0C5F ; UNASSIGNED # <reserved>..<reserved>
20282028-0C60..0C63 ; PVALID # TELUGU LETTER VOCALIC RR..TELUGU VOWEL SIGN
20292029-0C64..0C65 ; UNASSIGNED # <reserved>..<reserved>
20302030-0C66..0C6F ; PVALID # TELUGU DIGIT ZERO..TELUGU DIGIT NINE
20312031-0C70..0C77 ; UNASSIGNED # <reserved>..<reserved>
20322032-0C78..0C7F ; DISALLOWED # TELUGU FRACTION DIGIT ZERO FOR ODD POWERS OF
20332033-0C80..0C81 ; UNASSIGNED # <reserved>..<reserved>
20342034-0C82..0C83 ; PVALID # KANNADA SIGN ANUSVARA..KANNADA SIGN VISARGA
20352035-0C84 ; UNASSIGNED # <reserved>
20362036-0C85..0C8C ; PVALID # KANNADA LETTER A..KANNADA LETTER VOCALIC L
20372037-0C8D ; UNASSIGNED # <reserved>
20382038-0C8E..0C90 ; PVALID # KANNADA LETTER E..KANNADA LETTER AI
20392039-0C91 ; UNASSIGNED # <reserved>
20402040-0C92..0CA8 ; PVALID # KANNADA LETTER O..KANNADA LETTER NA
20412041-0CA9 ; UNASSIGNED # <reserved>
20422042-0CAA..0CB3 ; PVALID # KANNADA LETTER PA..KANNADA LETTER LLA
20432043-0CB4 ; UNASSIGNED # <reserved>
20442044-0CB5..0CB9 ; PVALID # KANNADA LETTER VA..KANNADA LETTER HA
20452045-0CBA..0CBB ; UNASSIGNED # <reserved>..<reserved>
20462046-0CBC..0CC4 ; PVALID # KANNADA SIGN NUKTA..KANNADA VOWEL SIGN VOCAL
20472047-0CC5 ; UNASSIGNED # <reserved>
20482048-0CC6..0CC8 ; PVALID # KANNADA VOWEL SIGN E..KANNADA VOWEL SIGN AI
20492049-0CC9 ; UNASSIGNED # <reserved>
20502050-0CCA..0CCD ; PVALID # KANNADA VOWEL SIGN O..KANNADA SIGN VIRAMA
20512051-0CCE..0CD4 ; UNASSIGNED # <reserved>..<reserved>
20522052-0CD5..0CD6 ; PVALID # KANNADA LENGTH MARK..KANNADA AI LENGTH MARK
20532053-0CD7..0CDD ; UNASSIGNED # <reserved>..<reserved>
20542054-0CDE ; PVALID # KANNADA LETTER FA
20552055-0CDF ; UNASSIGNED # <reserved>
20562056-0CE0..0CE3 ; PVALID # KANNADA LETTER VOCALIC RR..KANNADA VOWEL SIG
20572057-0CE4..0CE5 ; UNASSIGNED # <reserved>..<reserved>
20582058-0CE6..0CEF ; PVALID # KANNADA DIGIT ZERO..KANNADA DIGIT NINE
20592059-0CF0 ; UNASSIGNED # <reserved>
20602060-0CF1..0CF2 ; DISALLOWED # KANNADA SIGN JIHVAMULIYA..KANNADA SIGN UPADH
20612061-0CF3..0D01 ; UNASSIGNED # <reserved>..<reserved>
20622062-0D02..0D03 ; PVALID # MALAYALAM SIGN ANUSVARA..MALAYALAM SIGN VISA
20632063-0D04 ; UNASSIGNED # <reserved>
20642064-0D05..0D0C ; PVALID # MALAYALAM LETTER A..MALAYALAM LETTER VOCALIC
20652065-0D0D ; UNASSIGNED # <reserved>
20662066-0D0E..0D10 ; PVALID # MALAYALAM LETTER E..MALAYALAM LETTER AI
20672067-0D11 ; UNASSIGNED # <reserved>
20682068-0D12..0D28 ; PVALID # MALAYALAM LETTER O..MALAYALAM LETTER NA
20692069-0D29 ; UNASSIGNED # <reserved>
20702070-0D2A..0D39 ; PVALID # MALAYALAM LETTER PA..MALAYALAM LETTER HA
20712071-20722072-20732073-20742074-Faltstrom Standards Track [Page 37]
20752075-20762076-RFC 5892 IDNA Code Points August 2010
20772077-20782078-20792079-0D3A..0D3C ; UNASSIGNED # <reserved>..<reserved>
20802080-0D3D..0D44 ; PVALID # MALAYALAM SIGN AVAGRAHA..MALAYALAM VOWEL SIG
20812081-0D45 ; UNASSIGNED # <reserved>
20822082-0D46..0D48 ; PVALID # MALAYALAM VOWEL SIGN E..MALAYALAM VOWEL SIGN
20832083-0D49 ; UNASSIGNED # <reserved>
20842084-0D4A..0D4D ; PVALID # MALAYALAM VOWEL SIGN O..MALAYALAM SIGN VIRAM
20852085-0D4E..0D56 ; UNASSIGNED # <reserved>..<reserved>
20862086-0D57 ; PVALID # MALAYALAM AU LENGTH MARK
20872087-0D58..0D5F ; UNASSIGNED # <reserved>..<reserved>
20882088-0D60..0D63 ; PVALID # MALAYALAM LETTER VOCALIC RR..MALAYALAM VOWEL
20892089-0D64..0D65 ; UNASSIGNED # <reserved>..<reserved>
20902090-0D66..0D6F ; PVALID # MALAYALAM DIGIT ZERO..MALAYALAM DIGIT NINE
20912091-0D70..0D75 ; DISALLOWED # MALAYALAM NUMBER TEN..MALAYALAM FRACTION THR
20922092-0D76..0D78 ; UNASSIGNED # <reserved>..<reserved>
20932093-0D79 ; DISALLOWED # MALAYALAM DATE MARK
20942094-0D7A..0D7F ; PVALID # MALAYALAM LETTER CHILLU NN..MALAYALAM LETTER
20952095-0D80..0D81 ; UNASSIGNED # <reserved>..<reserved>
20962096-0D82..0D83 ; PVALID # SINHALA SIGN ANUSVARAYA..SINHALA SIGN VISARG
20972097-0D84 ; UNASSIGNED # <reserved>
20982098-0D85..0D96 ; PVALID # SINHALA LETTER AYANNA..SINHALA LETTER AUYANN
20992099-0D97..0D99 ; UNASSIGNED # <reserved>..<reserved>
21002100-0D9A..0DB1 ; PVALID # SINHALA LETTER ALPAPRAANA KAYANNA..SINHALA L
21012101-0DB2 ; UNASSIGNED # <reserved>
21022102-0DB3..0DBB ; PVALID # SINHALA LETTER SANYAKA DAYANNA..SINHALA LETT
21032103-0DBC ; UNASSIGNED # <reserved>
21042104-0DBD ; PVALID # SINHALA LETTER DANTAJA LAYANNA
21052105-0DBE..0DBF ; UNASSIGNED # <reserved>..<reserved>
21062106-0DC0..0DC6 ; PVALID # SINHALA LETTER VAYANNA..SINHALA LETTER FAYAN
21072107-0DC7..0DC9 ; UNASSIGNED # <reserved>..<reserved>
21082108-0DCA ; PVALID # SINHALA SIGN AL-LAKUNA
21092109-0DCB..0DCE ; UNASSIGNED # <reserved>..<reserved>
21102110-0DCF..0DD4 ; PVALID # SINHALA VOWEL SIGN AELA-PILLA..SINHALA VOWEL
21112111-0DD5 ; UNASSIGNED # <reserved>
21122112-0DD6 ; PVALID # SINHALA VOWEL SIGN DIGA PAA-PILLA
21132113-0DD7 ; UNASSIGNED # <reserved>
21142114-0DD8..0DDF ; PVALID # SINHALA VOWEL SIGN GAETTA-PILLA..SINHALA VOW
21152115-0DE0..0DF1 ; UNASSIGNED # <reserved>..<reserved>
21162116-0DF2..0DF3 ; PVALID # SINHALA VOWEL SIGN DIGA GAETTA-PILLA..SINHAL
21172117-0DF4 ; DISALLOWED # SINHALA PUNCTUATION KUNDDALIYA
21182118-0DF5..0E00 ; UNASSIGNED # <reserved>..<reserved>
21192119-0E01..0E32 ; PVALID # THAI CHARACTER KO KAI..THAI CHARACTER SARA A
21202120-0E33 ; DISALLOWED # THAI CHARACTER SARA AM
21212121-0E34..0E3A ; PVALID # THAI CHARACTER SARA I..THAI CHARACTER PHINTH
21222122-0E3B..0E3E ; UNASSIGNED # <reserved>..<reserved>
21232123-0E3F ; DISALLOWED # THAI CURRENCY SYMBOL BAHT
21242124-0E40..0E4E ; PVALID # THAI CHARACTER SARA E..THAI CHARACTER YAMAKK
21252125-0E4F ; DISALLOWED # THAI CHARACTER FONGMAN
21262126-0E50..0E59 ; PVALID # THAI DIGIT ZERO..THAI DIGIT NINE
21272127-21282128-21292129-21302130-Faltstrom Standards Track [Page 38]
21312131-21322132-RFC 5892 IDNA Code Points August 2010
21332133-21342134-21352135-0E5A..0E5B ; DISALLOWED # THAI CHARACTER ANGKHANKHU..THAI CHARACTER KH
21362136-0E5C..0E80 ; UNASSIGNED # <reserved>..<reserved>
21372137-0E81..0E82 ; PVALID # LAO LETTER KO..LAO LETTER KHO SUNG
21382138-0E83 ; UNASSIGNED # <reserved>
21392139-0E84 ; PVALID # LAO LETTER KHO TAM
21402140-0E85..0E86 ; UNASSIGNED # <reserved>..<reserved>
21412141-0E87..0E88 ; PVALID # LAO LETTER NGO..LAO LETTER CO
21422142-0E89 ; UNASSIGNED # <reserved>
21432143-0E8A ; PVALID # LAO LETTER SO TAM
21442144-0E8B..0E8C ; UNASSIGNED # <reserved>..<reserved>
21452145-0E8D ; PVALID # LAO LETTER NYO
21462146-0E8E..0E93 ; UNASSIGNED # <reserved>..<reserved>
21472147-0E94..0E97 ; PVALID # LAO LETTER DO..LAO LETTER THO TAM
21482148-0E98 ; UNASSIGNED # <reserved>
21492149-0E99..0E9F ; PVALID # LAO LETTER NO..LAO LETTER FO SUNG
21502150-0EA0 ; UNASSIGNED # <reserved>
21512151-0EA1..0EA3 ; PVALID # LAO LETTER MO..LAO LETTER LO LING
21522152-0EA4 ; UNASSIGNED # <reserved>
21532153-0EA5 ; PVALID # LAO LETTER LO LOOT
21542154-0EA6 ; UNASSIGNED # <reserved>
21552155-0EA7 ; PVALID # LAO LETTER WO
21562156-0EA8..0EA9 ; UNASSIGNED # <reserved>..<reserved>
21572157-0EAA..0EAB ; PVALID # LAO LETTER SO SUNG..LAO LETTER HO SUNG
21582158-0EAC ; UNASSIGNED # <reserved>
21592159-0EAD..0EB2 ; PVALID # LAO LETTER O..LAO VOWEL SIGN AA
21602160-0EB3 ; DISALLOWED # LAO VOWEL SIGN AM
21612161-0EB4..0EB9 ; PVALID # LAO VOWEL SIGN I..LAO VOWEL SIGN UU
21622162-0EBA ; UNASSIGNED # <reserved>
21632163-0EBB..0EBD ; PVALID # LAO VOWEL SIGN MAI KON..LAO SEMIVOWEL SIGN N
21642164-0EBE..0EBF ; UNASSIGNED # <reserved>..<reserved>
21652165-0EC0..0EC4 ; PVALID # LAO VOWEL SIGN E..LAO VOWEL SIGN AI
21662166-0EC5 ; UNASSIGNED # <reserved>
21672167-0EC6 ; PVALID # LAO KO LA
21682168-0EC7 ; UNASSIGNED # <reserved>
21692169-0EC8..0ECD ; PVALID # LAO TONE MAI EK..LAO NIGGAHITA
21702170-0ECE..0ECF ; UNASSIGNED # <reserved>..<reserved>
21712171-0ED0..0ED9 ; PVALID # LAO DIGIT ZERO..LAO DIGIT NINE
21722172-0EDA..0EDB ; UNASSIGNED # <reserved>..<reserved>
21732173-0EDC..0EDD ; DISALLOWED # LAO HO NO..LAO HO MO
21742174-0EDE..0EFF ; UNASSIGNED # <reserved>..<reserved>
21752175-0F00 ; PVALID # TIBETAN SYLLABLE OM
21762176-0F01..0F0A ; DISALLOWED # TIBETAN MARK GTER YIG MGO TRUNCATED A..TIBET
21772177-0F0B ; PVALID # TIBETAN MARK INTERSYLLABIC TSHEG
21782178-0F0C..0F17 ; DISALLOWED # TIBETAN MARK DELIMITER TSHEG BSTAR..TIBETAN
21792179-0F18..0F19 ; PVALID # TIBETAN ASTROLOGICAL SIGN -KHYUD PA..TIBETAN
21802180-0F1A..0F1F ; DISALLOWED # TIBETAN SIGN RDEL DKAR GCIG..TIBETAN SIGN RD
21812181-0F20..0F29 ; PVALID # TIBETAN DIGIT ZERO..TIBETAN DIGIT NINE
21822182-0F2A..0F34 ; DISALLOWED # TIBETAN DIGIT HALF ONE..TIBETAN MARK BSDUS R
21832183-21842184-21852185-21862186-Faltstrom Standards Track [Page 39]
21872187-21882188-RFC 5892 IDNA Code Points August 2010
21892189-21902190-21912191-0F35 ; PVALID # TIBETAN MARK NGAS BZUNG NYI ZLA
21922192-0F36 ; DISALLOWED # TIBETAN MARK CARET -DZUD RTAGS BZHI MIG CAN
21932193-0F37 ; PVALID # TIBETAN MARK NGAS BZUNG SGOR RTAGS
21942194-0F38 ; DISALLOWED # TIBETAN MARK CHE MGO
21952195-0F39 ; PVALID # TIBETAN MARK TSA -PHRU
21962196-0F3A..0F3D ; DISALLOWED # TIBETAN MARK GUG RTAGS GYON..TIBETAN MARK AN
21972197-0F3E..0F42 ; PVALID # TIBETAN SIGN YAR TSHES..TIBETAN LETTER GA
21982198-0F43 ; DISALLOWED # TIBETAN LETTER GHA
21992199-0F44..0F47 ; PVALID # TIBETAN LETTER NGA..TIBETAN LETTER JA
22002200-0F48 ; UNASSIGNED # <reserved>
22012201-0F49..0F4C ; PVALID # TIBETAN LETTER NYA..TIBETAN LETTER DDA
22022202-0F4D ; DISALLOWED # TIBETAN LETTER DDHA
22032203-0F4E..0F51 ; PVALID # TIBETAN LETTER NNA..TIBETAN LETTER DA
22042204-0F52 ; DISALLOWED # TIBETAN LETTER DHA
22052205-0F53..0F56 ; PVALID # TIBETAN LETTER NA..TIBETAN LETTER BA
22062206-0F57 ; DISALLOWED # TIBETAN LETTER BHA
22072207-0F58..0F5B ; PVALID # TIBETAN LETTER MA..TIBETAN LETTER DZA
22082208-0F5C ; DISALLOWED # TIBETAN LETTER DZHA
22092209-0F5D..0F68 ; PVALID # TIBETAN LETTER WA..TIBETAN LETTER A
22102210-0F69 ; DISALLOWED # TIBETAN LETTER KSSA
22112211-0F6A..0F6C ; PVALID # TIBETAN LETTER FIXED-FORM RA..TIBETAN LETTER
22122212-0F6D..0F70 ; UNASSIGNED # <reserved>..<reserved>
22132213-0F71..0F72 ; PVALID # TIBETAN VOWEL SIGN AA..TIBETAN VOWEL SIGN I
22142214-0F73 ; DISALLOWED # TIBETAN VOWEL SIGN II
22152215-0F74 ; PVALID # TIBETAN VOWEL SIGN U
22162216-0F75..0F79 ; DISALLOWED # TIBETAN VOWEL SIGN UU..TIBETAN VOWEL SIGN VO
22172217-0F7A..0F80 ; PVALID # TIBETAN VOWEL SIGN E..TIBETAN VOWEL SIGN REV
22182218-0F81 ; DISALLOWED # TIBETAN VOWEL SIGN REVERSED II
22192219-0F82..0F84 ; PVALID # TIBETAN SIGN NYI ZLA NAA DA..TIBETAN MARK HA
22202220-0F85 ; DISALLOWED # TIBETAN MARK PALUTA
22212221-0F86..0F8B ; PVALID # TIBETAN SIGN LCI RTAGS..TIBETAN SIGN GRU MED
22222222-0F8C..0F8F ; UNASSIGNED # <reserved>..<reserved>
22232223-0F90..0F92 ; PVALID # TIBETAN SUBJOINED LETTER KA..TIBETAN SUBJOIN
22242224-0F93 ; DISALLOWED # TIBETAN SUBJOINED LETTER GHA
22252225-0F94..0F97 ; PVALID # TIBETAN SUBJOINED LETTER NGA..TIBETAN SUBJOI
22262226-0F98 ; UNASSIGNED # <reserved>
22272227-0F99..0F9C ; PVALID # TIBETAN SUBJOINED LETTER NYA..TIBETAN SUBJOI
22282228-0F9D ; DISALLOWED # TIBETAN SUBJOINED LETTER DDHA
22292229-0F9E..0FA1 ; PVALID # TIBETAN SUBJOINED LETTER NNA..TIBETAN SUBJOI
22302230-0FA2 ; DISALLOWED # TIBETAN SUBJOINED LETTER DHA
22312231-0FA3..0FA6 ; PVALID # TIBETAN SUBJOINED LETTER NA..TIBETAN SUBJOIN
22322232-0FA7 ; DISALLOWED # TIBETAN SUBJOINED LETTER BHA
22332233-0FA8..0FAB ; PVALID # TIBETAN SUBJOINED LETTER MA..TIBETAN SUBJOIN
22342234-0FAC ; DISALLOWED # TIBETAN SUBJOINED LETTER DZHA
22352235-0FAD..0FB8 ; PVALID # TIBETAN SUBJOINED LETTER WA..TIBETAN SUBJOIN
22362236-0FB9 ; DISALLOWED # TIBETAN SUBJOINED LETTER KSSA
22372237-0FBA..0FBC ; PVALID # TIBETAN SUBJOINED LETTER FIXED-FORM WA..TIBE
22382238-0FBD ; UNASSIGNED # <reserved>
22392239-22402240-22412241-22422242-Faltstrom Standards Track [Page 40]
22432243-22442244-RFC 5892 IDNA Code Points August 2010
22452245-22462246-22472247-0FBE..0FC5 ; DISALLOWED # TIBETAN KU RU KHA..TIBETAN SYMBOL RDO RJE
22482248-0FC6 ; PVALID # TIBETAN SYMBOL PADMA GDAN
22492249-0FC7..0FCC ; DISALLOWED # TIBETAN SYMBOL RDO RJE RGYA GRAM..TIBETAN SY
22502250-0FCD ; UNASSIGNED # <reserved>
22512251-0FCE..0FD8 ; DISALLOWED # TIBETAN SIGN RDEL NAG RDEL DKAR..LEFT-FACING
22522252-0FD9..0FFF ; UNASSIGNED # <reserved>..<reserved>
22532253-1000..1049 ; PVALID # MYANMAR LETTER KA..MYANMAR DIGIT NINE
22542254-104A..104F ; DISALLOWED # MYANMAR SIGN LITTLE SECTION..MYANMAR SYMBOL
22552255-1050..109D ; PVALID # MYANMAR LETTER SHA..MYANMAR VOWEL SIGN AITON
22562256-109E..10C5 ; DISALLOWED # MYANMAR SYMBOL SHAN ONE..GEORGIAN CAPITAL LE
22572257-10C6..10CF ; UNASSIGNED # <reserved>..<reserved>
22582258-10D0..10FA ; PVALID # GEORGIAN LETTER AN..GEORGIAN LETTER AIN
22592259-10FB..10FC ; DISALLOWED # GEORGIAN PARAGRAPH SEPARATOR..MODIFIER LETTE
22602260-10FD..10FF ; UNASSIGNED # <reserved>..<reserved>
22612261-1100..11FF ; DISALLOWED # HANGUL CHOSEONG KIYEOK..HANGUL JONGSEONG SSA
22622262-1200..1248 ; PVALID # ETHIOPIC SYLLABLE HA..ETHIOPIC SYLLABLE QWA
22632263-1249 ; UNASSIGNED # <reserved>
22642264-124A..124D ; PVALID # ETHIOPIC SYLLABLE QWI..ETHIOPIC SYLLABLE QWE
22652265-124E..124F ; UNASSIGNED # <reserved>..<reserved>
22662266-1250..1256 ; PVALID # ETHIOPIC SYLLABLE QHA..ETHIOPIC SYLLABLE QHO
22672267-1257 ; UNASSIGNED # <reserved>
22682268-1258 ; PVALID # ETHIOPIC SYLLABLE QHWA
22692269-1259 ; UNASSIGNED # <reserved>
22702270-125A..125D ; PVALID # ETHIOPIC SYLLABLE QHWI..ETHIOPIC SYLLABLE QH
22712271-125E..125F ; UNASSIGNED # <reserved>..<reserved>
22722272-1260..1288 ; PVALID # ETHIOPIC SYLLABLE BA..ETHIOPIC SYLLABLE XWA
22732273-1289 ; UNASSIGNED # <reserved>
22742274-128A..128D ; PVALID # ETHIOPIC SYLLABLE XWI..ETHIOPIC SYLLABLE XWE
22752275-128E..128F ; UNASSIGNED # <reserved>..<reserved>
22762276-1290..12B0 ; PVALID # ETHIOPIC SYLLABLE NA..ETHIOPIC SYLLABLE KWA
22772277-12B1 ; UNASSIGNED # <reserved>
22782278-12B2..12B5 ; PVALID # ETHIOPIC SYLLABLE KWI..ETHIOPIC SYLLABLE KWE
22792279-12B6..12B7 ; UNASSIGNED # <reserved>..<reserved>
22802280-12B8..12BE ; PVALID # ETHIOPIC SYLLABLE KXA..ETHIOPIC SYLLABLE KXO
22812281-12BF ; UNASSIGNED # <reserved>
22822282-12C0 ; PVALID # ETHIOPIC SYLLABLE KXWA
22832283-12C1 ; UNASSIGNED # <reserved>
22842284-12C2..12C5 ; PVALID # ETHIOPIC SYLLABLE KXWI..ETHIOPIC SYLLABLE KX
22852285-12C6..12C7 ; UNASSIGNED # <reserved>..<reserved>
22862286-12C8..12D6 ; PVALID # ETHIOPIC SYLLABLE WA..ETHIOPIC SYLLABLE PHAR
22872287-12D7 ; UNASSIGNED # <reserved>
22882288-12D8..1310 ; PVALID # ETHIOPIC SYLLABLE ZA..ETHIOPIC SYLLABLE GWA
22892289-1311 ; UNASSIGNED # <reserved>
22902290-1312..1315 ; PVALID # ETHIOPIC SYLLABLE GWI..ETHIOPIC SYLLABLE GWE
22912291-1316..1317 ; UNASSIGNED # <reserved>..<reserved>
22922292-1318..135A ; PVALID # ETHIOPIC SYLLABLE GGA..ETHIOPIC SYLLABLE FYA
22932293-135B..135E ; UNASSIGNED # <reserved>..<reserved>
22942294-135F ; PVALID # ETHIOPIC COMBINING GEMINATION MARK
22952295-22962296-22972297-22982298-Faltstrom Standards Track [Page 41]
22992299-23002300-RFC 5892 IDNA Code Points August 2010
23012301-23022302-23032303-1360..137C ; DISALLOWED # ETHIOPIC SECTION MARK..ETHIOPIC NUMBER TEN T
23042304-137D..137F ; UNASSIGNED # <reserved>..<reserved>
23052305-1380..138F ; PVALID # ETHIOPIC SYLLABLE SEBATBEIT MWA..ETHIOPIC SY
23062306-1390..1399 ; DISALLOWED # ETHIOPIC TONAL MARK YIZET..ETHIOPIC TONAL MA
23072307-139A..139F ; UNASSIGNED # <reserved>..<reserved>
23082308-13A0..13F4 ; PVALID # CHEROKEE LETTER A..CHEROKEE LETTER YV
23092309-13F5..13FF ; UNASSIGNED # <reserved>..<reserved>
23102310-1400 ; DISALLOWED # CANADIAN SYLLABICS HYPHEN
23112311-1401..166C ; PVALID # CANADIAN SYLLABICS E..CANADIAN SYLLABICS CAR
23122312-166D..166E ; DISALLOWED # CANADIAN SYLLABICS CHI SIGN..CANADIAN SYLLAB
23132313-166F..167F ; PVALID # CANADIAN SYLLABICS QAI..CANADIAN SYLLABICS B
23142314-1680 ; DISALLOWED # OGHAM SPACE MARK
23152315-1681..169A ; PVALID # OGHAM LETTER BEITH..OGHAM LETTER PEITH
23162316-169B..169C ; DISALLOWED # OGHAM FEATHER MARK..OGHAM REVERSED FEATHER M
23172317-169D..169F ; UNASSIGNED # <reserved>..<reserved>
23182318-16A0..16EA ; PVALID # RUNIC LETTER FEHU FEOH FE F..RUNIC LETTER X
23192319-16EB..16F0 ; DISALLOWED # RUNIC SINGLE PUNCTUATION..RUNIC BELGTHOR SYM
23202320-16F1..16FF ; UNASSIGNED # <reserved>..<reserved>
23212321-1700..170C ; PVALID # TAGALOG LETTER A..TAGALOG LETTER YA
23222322-170D ; UNASSIGNED # <reserved>
23232323-170E..1714 ; PVALID # TAGALOG LETTER LA..TAGALOG SIGN VIRAMA
23242324-1715..171F ; UNASSIGNED # <reserved>..<reserved>
23252325-1720..1734 ; PVALID # HANUNOO LETTER A..HANUNOO SIGN PAMUDPOD
23262326-1735..1736 ; DISALLOWED # PHILIPPINE SINGLE PUNCTUATION..PHILIPPINE DO
23272327-1737..173F ; UNASSIGNED # <reserved>..<reserved>
23282328-1740..1753 ; PVALID # BUHID LETTER A..BUHID VOWEL SIGN U
23292329-1754..175F ; UNASSIGNED # <reserved>..<reserved>
23302330-1760..176C ; PVALID # TAGBANWA LETTER A..TAGBANWA LETTER YA
23312331-176D ; UNASSIGNED # <reserved>
23322332-176E..1770 ; PVALID # TAGBANWA LETTER LA..TAGBANWA LETTER SA
23332333-1771 ; UNASSIGNED # <reserved>
23342334-1772..1773 ; PVALID # TAGBANWA VOWEL SIGN I..TAGBANWA VOWEL SIGN U
23352335-1774..177F ; UNASSIGNED # <reserved>..<reserved>
23362336-1780..17B3 ; PVALID # KHMER LETTER KA..KHMER INDEPENDENT VOWEL QAU
23372337-17B4..17B5 ; DISALLOWED # KHMER VOWEL INHERENT AQ..KHMER VOWEL INHEREN
23382338-17B6..17D3 ; PVALID # KHMER VOWEL SIGN AA..KHMER SIGN BATHAMASAT
23392339-17D4..17D6 ; DISALLOWED # KHMER SIGN KHAN..KHMER SIGN CAMNUC PII KUUH
23402340-17D7 ; PVALID # KHMER SIGN LEK TOO
23412341-17D8..17DB ; DISALLOWED # KHMER SIGN BEYYAL..KHMER CURRENCY SYMBOL RIE
23422342-17DC..17DD ; PVALID # KHMER SIGN AVAKRAHASANYA..KHMER SIGN ATTHACA
23432343-17DE..17DF ; UNASSIGNED # <reserved>..<reserved>
23442344-17E0..17E9 ; PVALID # KHMER DIGIT ZERO..KHMER DIGIT NINE
23452345-17EA..17EF ; UNASSIGNED # <reserved>..<reserved>
23462346-17F0..17F9 ; DISALLOWED # KHMER SYMBOL LEK ATTAK SON..KHMER SYMBOL LEK
23472347-17FA..17FF ; UNASSIGNED # <reserved>..<reserved>
23482348-1800..180E ; DISALLOWED # MONGOLIAN BIRGA..MONGOLIAN VOWEL SEPARATOR
23492349-180F ; UNASSIGNED # <reserved>
23502350-1810..1819 ; PVALID # MONGOLIAN DIGIT ZERO..MONGOLIAN DIGIT NINE
23512351-23522352-23532353-23542354-Faltstrom Standards Track [Page 42]
23552355-23562356-RFC 5892 IDNA Code Points August 2010
23572357-23582358-23592359-181A..181F ; UNASSIGNED # <reserved>..<reserved>
23602360-1820..1877 ; PVALID # MONGOLIAN LETTER A..MONGOLIAN LETTER MANCHU
23612361-1878..187F ; UNASSIGNED # <reserved>..<reserved>
23622362-1880..18AA ; PVALID # MONGOLIAN LETTER ALI GALI ANUSVARA ONE..MONG
23632363-18AB..18AF ; UNASSIGNED # <reserved>..<reserved>
23642364-18B0..18F5 ; PVALID # CANADIAN SYLLABICS OY..CANADIAN SYLLABICS CA
23652365-18F6..18FF ; UNASSIGNED # <reserved>..<reserved>
23662366-1900..191C ; PVALID # LIMBU VOWEL-CARRIER LETTER..LIMBU LETTER HA
23672367-191D..191F ; UNASSIGNED # <reserved>..<reserved>
23682368-1920..192B ; PVALID # LIMBU VOWEL SIGN A..LIMBU SUBJOINED LETTER W
23692369-192C..192F ; UNASSIGNED # <reserved>..<reserved>
23702370-1930..193B ; PVALID # LIMBU SMALL LETTER KA..LIMBU SIGN SA-I
23712371-193C..193F ; UNASSIGNED # <reserved>..<reserved>
23722372-1940 ; DISALLOWED # LIMBU SIGN LOO
23732373-1941..1943 ; UNASSIGNED # <reserved>..<reserved>
23742374-1944..1945 ; DISALLOWED # LIMBU EXCLAMATION MARK..LIMBU QUESTION MARK
23752375-1946..196D ; PVALID # LIMBU DIGIT ZERO..TAI LE LETTER AI
23762376-196E..196F ; UNASSIGNED # <reserved>..<reserved>
23772377-1970..1974 ; PVALID # TAI LE LETTER TONE-2..TAI LE LETTER TONE-6
23782378-1975..197F ; UNASSIGNED # <reserved>..<reserved>
23792379-1980..19AB ; PVALID # NEW TAI LUE LETTER HIGH QA..NEW TAI LUE LETT
23802380-19AC..19AF ; UNASSIGNED # <reserved>..<reserved>
23812381-19B0..19C9 ; PVALID # NEW TAI LUE VOWEL SIGN VOWEL SHORTENER..NEW
23822382-19CA..19CF ; UNASSIGNED # <reserved>..<reserved>
23832383-19D0..19DA ; PVALID # NEW TAI LUE DIGIT ZERO..NEW TAI LUE THAM DIG
23842384-19DB..19DD ; UNASSIGNED # <reserved>..<reserved>
23852385-19DE..19FF ; DISALLOWED # NEW TAI LUE SIGN LAE..KHMER SYMBOL DAP-PRAM
23862386-1A00..1A1B ; PVALID # BUGINESE LETTER KA..BUGINESE VOWEL SIGN AE
23872387-1A1C..1A1D ; UNASSIGNED # <reserved>..<reserved>
23882388-1A1E..1A1F ; DISALLOWED # BUGINESE PALLAWA..BUGINESE END OF SECTION
23892389-1A20..1A5E ; PVALID # TAI THAM LETTER HIGH KA..TAI THAM CONSONANT
23902390-1A5F ; UNASSIGNED # <reserved>
23912391-1A60..1A7C ; PVALID # TAI THAM SIGN SAKOT..TAI THAM SIGN KHUEN-LUE
23922392-1A7D..1A7E ; UNASSIGNED # <reserved>..<reserved>
23932393-1A7F..1A89 ; PVALID # TAI THAM COMBINING CRYPTOGRAMMIC DOT..TAI TH
23942394-1A8A..1A8F ; UNASSIGNED # <reserved>..<reserved>
23952395-1A90..1A99 ; PVALID # TAI THAM THAM DIGIT ZERO..TAI THAM THAM DIGI
23962396-1A9A..1A9F ; UNASSIGNED # <reserved>..<reserved>
23972397-1AA0..1AA6 ; DISALLOWED # TAI THAM SIGN WIANG..TAI THAM SIGN REVERSED
23982398-1AA7 ; PVALID # TAI THAM SIGN MAI YAMOK
23992399-1AA8..1AAD ; DISALLOWED # TAI THAM SIGN KAAN..TAI THAM SIGN CAANG
24002400-1AAE..1AFF ; UNASSIGNED # <reserved>..<reserved>
24012401-1B00..1B4B ; PVALID # BALINESE SIGN ULU RICEM..BALINESE LETTER ASY
24022402-1B4C..1B4F ; UNASSIGNED # <reserved>..<reserved>
24032403-1B50..1B59 ; PVALID # BALINESE DIGIT ZERO..BALINESE DIGIT NINE
24042404-1B5A..1B6A ; DISALLOWED # BALINESE PANTI..BALINESE MUSICAL SYMBOL DANG
24052405-1B6B..1B73 ; PVALID # BALINESE MUSICAL SYMBOL COMBINING TEGEH..BAL
24062406-1B74..1B7C ; DISALLOWED # BALINESE MUSICAL SYMBOL RIGHT-HAND OPEN DUG.
24072407-24082408-24092409-24102410-Faltstrom Standards Track [Page 43]
24112411-24122412-RFC 5892 IDNA Code Points August 2010
24132413-24142414-24152415-1B7D..1B7F ; UNASSIGNED # <reserved>..<reserved>
24162416-1B80..1BAA ; PVALID # SUNDANESE SIGN PANYECEK..SUNDANESE SIGN PAMA
24172417-1BAB..1BAD ; UNASSIGNED # <reserved>..<reserved>
24182418-1BAE..1BB9 ; PVALID # SUNDANESE LETTER KHA..SUNDANESE DIGIT NINE
24192419-1BBA..1BFF ; UNASSIGNED # <reserved>..<reserved>
24202420-1C00..1C37 ; PVALID # LEPCHA LETTER KA..LEPCHA SIGN NUKTA
24212421-1C38..1C3A ; UNASSIGNED # <reserved>..<reserved>
24222422-1C3B..1C3F ; DISALLOWED # LEPCHA PUNCTUATION TA-ROL..LEPCHA PUNCTUATIO
24232423-1C40..1C49 ; PVALID # LEPCHA DIGIT ZERO..LEPCHA DIGIT NINE
24242424-1C4A..1C4C ; UNASSIGNED # <reserved>..<reserved>
24252425-1C4D..1C7D ; PVALID # LEPCHA LETTER TTA..OL CHIKI AHAD
24262426-1C7E..1C7F ; DISALLOWED # OL CHIKI PUNCTUATION MUCAAD..OL CHIKI PUNCTU
24272427-1C80..1CCF ; UNASSIGNED # <reserved>..<reserved>
24282428-1CD0..1CD2 ; PVALID # VEDIC TONE KARSHANA..VEDIC TONE PRENKHA
24292429-1CD3 ; DISALLOWED # VEDIC SIGN NIHSHVASA
24302430-1CD4..1CF2 ; PVALID # VEDIC SIGN YAJURVEDIC MIDLINE SVARITA..VEDIC
24312431-1CF3..1CFF ; UNASSIGNED # <reserved>..<reserved>
24322432-1D00..1D2B ; PVALID # LATIN LETTER SMALL CAPITAL A..CYRILLIC LETTE
24332433-1D2C..1D2E ; DISALLOWED # MODIFIER LETTER CAPITAL A..MODIFIER LETTER C
24342434-1D2F ; PVALID # MODIFIER LETTER CAPITAL BARRED B
24352435-1D30..1D3A ; DISALLOWED # MODIFIER LETTER CAPITAL D..MODIFIER LETTER C
24362436-1D3B ; PVALID # MODIFIER LETTER CAPITAL REVERSED N
24372437-1D3C..1D4D ; DISALLOWED # MODIFIER LETTER CAPITAL O..MODIFIER LETTER S
24382438-1D4E ; PVALID # MODIFIER LETTER SMALL TURNED I
24392439-1D4F..1D6A ; DISALLOWED # MODIFIER LETTER SMALL K..GREEK SUBSCRIPT SMA
24402440-1D6B..1D77 ; PVALID # LATIN SMALL LETTER UE..LATIN SMALL LETTER TU
24412441-1D78 ; DISALLOWED # MODIFIER LETTER CYRILLIC EN
24422442-1D79..1D9A ; PVALID # LATIN SMALL LETTER INSULAR G..LATIN SMALL LE
24432443-1D9B..1DBF ; DISALLOWED # MODIFIER LETTER SMALL TURNED ALPHA..MODIFIER
24442444-1DC0..1DE6 ; PVALID # COMBINING DOTTED GRAVE ACCENT..COMBINING LAT
24452445-1DE7..1DFC ; UNASSIGNED # <reserved>..<reserved>
24462446-1DFD..1DFF ; PVALID # COMBINING ALMOST EQUAL TO BELOW..COMBINING R
24472447-1E00 ; DISALLOWED # LATIN CAPITAL LETTER A WITH RING BELOW
24482448-1E01 ; PVALID # LATIN SMALL LETTER A WITH RING BELOW
24492449-1E02 ; DISALLOWED # LATIN CAPITAL LETTER B WITH DOT ABOVE
24502450-1E03 ; PVALID # LATIN SMALL LETTER B WITH DOT ABOVE
24512451-1E04 ; DISALLOWED # LATIN CAPITAL LETTER B WITH DOT BELOW
24522452-1E05 ; PVALID # LATIN SMALL LETTER B WITH DOT BELOW
24532453-1E06 ; DISALLOWED # LATIN CAPITAL LETTER B WITH LINE BELOW
24542454-1E07 ; PVALID # LATIN SMALL LETTER B WITH LINE BELOW
24552455-1E08 ; DISALLOWED # LATIN CAPITAL LETTER C WITH CEDILLA AND ACUT
24562456-1E09 ; PVALID # LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
24572457-1E0A ; DISALLOWED # LATIN CAPITAL LETTER D WITH DOT ABOVE
24582458-1E0B ; PVALID # LATIN SMALL LETTER D WITH DOT ABOVE
24592459-1E0C ; DISALLOWED # LATIN CAPITAL LETTER D WITH DOT BELOW
24602460-1E0D ; PVALID # LATIN SMALL LETTER D WITH DOT BELOW
24612461-1E0E ; DISALLOWED # LATIN CAPITAL LETTER D WITH LINE BELOW
24622462-1E0F ; PVALID # LATIN SMALL LETTER D WITH LINE BELOW
24632463-24642464-24652465-24662466-Faltstrom Standards Track [Page 44]
24672467-24682468-RFC 5892 IDNA Code Points August 2010
24692469-24702470-24712471-1E10 ; DISALLOWED # LATIN CAPITAL LETTER D WITH CEDILLA
24722472-1E11 ; PVALID # LATIN SMALL LETTER D WITH CEDILLA
24732473-1E12 ; DISALLOWED # LATIN CAPITAL LETTER D WITH CIRCUMFLEX BELOW
24742474-1E13 ; PVALID # LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW
24752475-1E14 ; DISALLOWED # LATIN CAPITAL LETTER E WITH MACRON AND GRAVE
24762476-1E15 ; PVALID # LATIN SMALL LETTER E WITH MACRON AND GRAVE
24772477-1E16 ; DISALLOWED # LATIN CAPITAL LETTER E WITH MACRON AND ACUTE
24782478-1E17 ; PVALID # LATIN SMALL LETTER E WITH MACRON AND ACUTE
24792479-1E18 ; DISALLOWED # LATIN CAPITAL LETTER E WITH CIRCUMFLEX BELOW
24802480-1E19 ; PVALID # LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW
24812481-1E1A ; DISALLOWED # LATIN CAPITAL LETTER E WITH TILDE BELOW
24822482-1E1B ; PVALID # LATIN SMALL LETTER E WITH TILDE BELOW
24832483-1E1C ; DISALLOWED # LATIN CAPITAL LETTER E WITH CEDILLA AND BREV
24842484-1E1D ; PVALID # LATIN SMALL LETTER E WITH CEDILLA AND BREVE
24852485-1E1E ; DISALLOWED # LATIN CAPITAL LETTER F WITH DOT ABOVE
24862486-1E1F ; PVALID # LATIN SMALL LETTER F WITH DOT ABOVE
24872487-1E20 ; DISALLOWED # LATIN CAPITAL LETTER G WITH MACRON
24882488-1E21 ; PVALID # LATIN SMALL LETTER G WITH MACRON
24892489-1E22 ; DISALLOWED # LATIN CAPITAL LETTER H WITH DOT ABOVE
24902490-1E23 ; PVALID # LATIN SMALL LETTER H WITH DOT ABOVE
24912491-1E24 ; DISALLOWED # LATIN CAPITAL LETTER H WITH DOT BELOW
24922492-1E25 ; PVALID # LATIN SMALL LETTER H WITH DOT BELOW
24932493-1E26 ; DISALLOWED # LATIN CAPITAL LETTER H WITH DIAERESIS
24942494-1E27 ; PVALID # LATIN SMALL LETTER H WITH DIAERESIS
24952495-1E28 ; DISALLOWED # LATIN CAPITAL LETTER H WITH CEDILLA
24962496-1E29 ; PVALID # LATIN SMALL LETTER H WITH CEDILLA
24972497-1E2A ; DISALLOWED # LATIN CAPITAL LETTER H WITH BREVE BELOW
24982498-1E2B ; PVALID # LATIN SMALL LETTER H WITH BREVE BELOW
24992499-1E2C ; DISALLOWED # LATIN CAPITAL LETTER I WITH TILDE BELOW
25002500-1E2D ; PVALID # LATIN SMALL LETTER I WITH TILDE BELOW
25012501-1E2E ; DISALLOWED # LATIN CAPITAL LETTER I WITH DIAERESIS AND AC
25022502-1E2F ; PVALID # LATIN SMALL LETTER I WITH DIAERESIS AND ACUT
25032503-1E30 ; DISALLOWED # LATIN CAPITAL LETTER K WITH ACUTE
25042504-1E31 ; PVALID # LATIN SMALL LETTER K WITH ACUTE
25052505-1E32 ; DISALLOWED # LATIN CAPITAL LETTER K WITH DOT BELOW
25062506-1E33 ; PVALID # LATIN SMALL LETTER K WITH DOT BELOW
25072507-1E34 ; DISALLOWED # LATIN CAPITAL LETTER K WITH LINE BELOW
25082508-1E35 ; PVALID # LATIN SMALL LETTER K WITH LINE BELOW
25092509-1E36 ; DISALLOWED # LATIN CAPITAL LETTER L WITH DOT BELOW
25102510-1E37 ; PVALID # LATIN SMALL LETTER L WITH DOT BELOW
25112511-1E38 ; DISALLOWED # LATIN CAPITAL LETTER L WITH DOT BELOW AND MA
25122512-1E39 ; PVALID # LATIN SMALL LETTER L WITH DOT BELOW AND MACR
25132513-1E3A ; DISALLOWED # LATIN CAPITAL LETTER L WITH LINE BELOW
25142514-1E3B ; PVALID # LATIN SMALL LETTER L WITH LINE BELOW
25152515-1E3C ; DISALLOWED # LATIN CAPITAL LETTER L WITH CIRCUMFLEX BELOW
25162516-1E3D ; PVALID # LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW
25172517-1E3E ; DISALLOWED # LATIN CAPITAL LETTER M WITH ACUTE
25182518-1E3F ; PVALID # LATIN SMALL LETTER M WITH ACUTE
25192519-25202520-25212521-25222522-Faltstrom Standards Track [Page 45]
25232523-25242524-RFC 5892 IDNA Code Points August 2010
25252525-25262526-25272527-1E40 ; DISALLOWED # LATIN CAPITAL LETTER M WITH DOT ABOVE
25282528-1E41 ; PVALID # LATIN SMALL LETTER M WITH DOT ABOVE
25292529-1E42 ; DISALLOWED # LATIN CAPITAL LETTER M WITH DOT BELOW
25302530-1E43 ; PVALID # LATIN SMALL LETTER M WITH DOT BELOW
25312531-1E44 ; DISALLOWED # LATIN CAPITAL LETTER N WITH DOT ABOVE
25322532-1E45 ; PVALID # LATIN SMALL LETTER N WITH DOT ABOVE
25332533-1E46 ; DISALLOWED # LATIN CAPITAL LETTER N WITH DOT BELOW
25342534-1E47 ; PVALID # LATIN SMALL LETTER N WITH DOT BELOW
25352535-1E48 ; DISALLOWED # LATIN CAPITAL LETTER N WITH LINE BELOW
25362536-1E49 ; PVALID # LATIN SMALL LETTER N WITH LINE BELOW
25372537-1E4A ; DISALLOWED # LATIN CAPITAL LETTER N WITH CIRCUMFLEX BELOW
25382538-1E4B ; PVALID # LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW
25392539-1E4C ; DISALLOWED # LATIN CAPITAL LETTER O WITH TILDE AND ACUTE
25402540-1E4D ; PVALID # LATIN SMALL LETTER O WITH TILDE AND ACUTE
25412541-1E4E ; DISALLOWED # LATIN CAPITAL LETTER O WITH TILDE AND DIAERE
25422542-1E4F ; PVALID # LATIN SMALL LETTER O WITH TILDE AND DIAERESI
25432543-1E50 ; DISALLOWED # LATIN CAPITAL LETTER O WITH MACRON AND GRAVE
25442544-1E51 ; PVALID # LATIN SMALL LETTER O WITH MACRON AND GRAVE
25452545-1E52 ; DISALLOWED # LATIN CAPITAL LETTER O WITH MACRON AND ACUTE
25462546-1E53 ; PVALID # LATIN SMALL LETTER O WITH MACRON AND ACUTE
25472547-1E54 ; DISALLOWED # LATIN CAPITAL LETTER P WITH ACUTE
25482548-1E55 ; PVALID # LATIN SMALL LETTER P WITH ACUTE
25492549-1E56 ; DISALLOWED # LATIN CAPITAL LETTER P WITH DOT ABOVE
25502550-1E57 ; PVALID # LATIN SMALL LETTER P WITH DOT ABOVE
25512551-1E58 ; DISALLOWED # LATIN CAPITAL LETTER R WITH DOT ABOVE
25522552-1E59 ; PVALID # LATIN SMALL LETTER R WITH DOT ABOVE
25532553-1E5A ; DISALLOWED # LATIN CAPITAL LETTER R WITH DOT BELOW
25542554-1E5B ; PVALID # LATIN SMALL LETTER R WITH DOT BELOW
25552555-1E5C ; DISALLOWED # LATIN CAPITAL LETTER R WITH DOT BELOW AND MA
25562556-1E5D ; PVALID # LATIN SMALL LETTER R WITH DOT BELOW AND MACR
25572557-1E5E ; DISALLOWED # LATIN CAPITAL LETTER R WITH LINE BELOW
25582558-1E5F ; PVALID # LATIN SMALL LETTER R WITH LINE BELOW
25592559-1E60 ; DISALLOWED # LATIN CAPITAL LETTER S WITH DOT ABOVE
25602560-1E61 ; PVALID # LATIN SMALL LETTER S WITH DOT ABOVE
25612561-1E62 ; DISALLOWED # LATIN CAPITAL LETTER S WITH DOT BELOW
25622562-1E63 ; PVALID # LATIN SMALL LETTER S WITH DOT BELOW
25632563-1E64 ; DISALLOWED # LATIN CAPITAL LETTER S WITH ACUTE AND DOT AB
25642564-1E65 ; PVALID # LATIN SMALL LETTER S WITH ACUTE AND DOT ABOV
25652565-1E66 ; DISALLOWED # LATIN CAPITAL LETTER S WITH CARON AND DOT AB
25662566-1E67 ; PVALID # LATIN SMALL LETTER S WITH CARON AND DOT ABOV
25672567-1E68 ; DISALLOWED # LATIN CAPITAL LETTER S WITH DOT BELOW AND DO
25682568-1E69 ; PVALID # LATIN SMALL LETTER S WITH DOT BELOW AND DOT
25692569-1E6A ; DISALLOWED # LATIN CAPITAL LETTER T WITH DOT ABOVE
25702570-1E6B ; PVALID # LATIN SMALL LETTER T WITH DOT ABOVE
25712571-1E6C ; DISALLOWED # LATIN CAPITAL LETTER T WITH DOT BELOW
25722572-1E6D ; PVALID # LATIN SMALL LETTER T WITH DOT BELOW
25732573-1E6E ; DISALLOWED # LATIN CAPITAL LETTER T WITH LINE BELOW
25742574-1E6F ; PVALID # LATIN SMALL LETTER T WITH LINE BELOW
25752575-25762576-25772577-25782578-Faltstrom Standards Track [Page 46]
25792579-25802580-RFC 5892 IDNA Code Points August 2010
25812581-25822582-25832583-1E70 ; DISALLOWED # LATIN CAPITAL LETTER T WITH CIRCUMFLEX BELOW
25842584-1E71 ; PVALID # LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW
25852585-1E72 ; DISALLOWED # LATIN CAPITAL LETTER U WITH DIAERESIS BELOW
25862586-1E73 ; PVALID # LATIN SMALL LETTER U WITH DIAERESIS BELOW
25872587-1E74 ; DISALLOWED # LATIN CAPITAL LETTER U WITH TILDE BELOW
25882588-1E75 ; PVALID # LATIN SMALL LETTER U WITH TILDE BELOW
25892589-1E76 ; DISALLOWED # LATIN CAPITAL LETTER U WITH CIRCUMFLEX BELOW
25902590-1E77 ; PVALID # LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW
25912591-1E78 ; DISALLOWED # LATIN CAPITAL LETTER U WITH TILDE AND ACUTE
25922592-1E79 ; PVALID # LATIN SMALL LETTER U WITH TILDE AND ACUTE
25932593-1E7A ; DISALLOWED # LATIN CAPITAL LETTER U WITH MACRON AND DIAER
25942594-1E7B ; PVALID # LATIN SMALL LETTER U WITH MACRON AND DIAERES
25952595-1E7C ; DISALLOWED # LATIN CAPITAL LETTER V WITH TILDE
25962596-1E7D ; PVALID # LATIN SMALL LETTER V WITH TILDE
25972597-1E7E ; DISALLOWED # LATIN CAPITAL LETTER V WITH DOT BELOW
25982598-1E7F ; PVALID # LATIN SMALL LETTER V WITH DOT BELOW
25992599-1E80 ; DISALLOWED # LATIN CAPITAL LETTER W WITH GRAVE
26002600-1E81 ; PVALID # LATIN SMALL LETTER W WITH GRAVE
26012601-1E82 ; DISALLOWED # LATIN CAPITAL LETTER W WITH ACUTE
26022602-1E83 ; PVALID # LATIN SMALL LETTER W WITH ACUTE
26032603-1E84 ; DISALLOWED # LATIN CAPITAL LETTER W WITH DIAERESIS
26042604-1E85 ; PVALID # LATIN SMALL LETTER W WITH DIAERESIS
26052605-1E86 ; DISALLOWED # LATIN CAPITAL LETTER W WITH DOT ABOVE
26062606-1E87 ; PVALID # LATIN SMALL LETTER W WITH DOT ABOVE
26072607-1E88 ; DISALLOWED # LATIN CAPITAL LETTER W WITH DOT BELOW
26082608-1E89 ; PVALID # LATIN SMALL LETTER W WITH DOT BELOW
26092609-1E8A ; DISALLOWED # LATIN CAPITAL LETTER X WITH DOT ABOVE
26102610-1E8B ; PVALID # LATIN SMALL LETTER X WITH DOT ABOVE
26112611-1E8C ; DISALLOWED # LATIN CAPITAL LETTER X WITH DIAERESIS
26122612-1E8D ; PVALID # LATIN SMALL LETTER X WITH DIAERESIS
26132613-1E8E ; DISALLOWED # LATIN CAPITAL LETTER Y WITH DOT ABOVE
26142614-1E8F ; PVALID # LATIN SMALL LETTER Y WITH DOT ABOVE
26152615-1E90 ; DISALLOWED # LATIN CAPITAL LETTER Z WITH CIRCUMFLEX
26162616-1E91 ; PVALID # LATIN SMALL LETTER Z WITH CIRCUMFLEX
26172617-1E92 ; DISALLOWED # LATIN CAPITAL LETTER Z WITH DOT BELOW
26182618-1E93 ; PVALID # LATIN SMALL LETTER Z WITH DOT BELOW
26192619-1E94 ; DISALLOWED # LATIN CAPITAL LETTER Z WITH LINE BELOW
26202620-1E95..1E99 ; PVALID # LATIN SMALL LETTER Z WITH LINE BELOW..LATIN
26212621-1E9A..1E9B ; DISALLOWED # LATIN SMALL LETTER A WITH RIGHT HALF RING..L
26222622-1E9C..1E9D ; PVALID # LATIN SMALL LETTER LONG S WITH DIAGONAL STRO
26232623-1E9E ; DISALLOWED # LATIN CAPITAL LETTER SHARP S
26242624-1E9F ; PVALID # LATIN SMALL LETTER DELTA
26252625-1EA0 ; DISALLOWED # LATIN CAPITAL LETTER A WITH DOT BELOW
26262626-1EA1 ; PVALID # LATIN SMALL LETTER A WITH DOT BELOW
26272627-1EA2 ; DISALLOWED # LATIN CAPITAL LETTER A WITH HOOK ABOVE
26282628-1EA3 ; PVALID # LATIN SMALL LETTER A WITH HOOK ABOVE
26292629-1EA4 ; DISALLOWED # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND A
26302630-1EA5 ; PVALID # LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACU
26312631-26322632-26332633-26342634-Faltstrom Standards Track [Page 47]
26352635-26362636-RFC 5892 IDNA Code Points August 2010
26372637-26382638-26392639-1EA6 ; DISALLOWED # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND G
26402640-1EA7 ; PVALID # LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRA
26412641-1EA8 ; DISALLOWED # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND H
26422642-1EA9 ; PVALID # LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOO
26432643-1EAA ; DISALLOWED # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND T
26442644-1EAB ; PVALID # LATIN SMALL LETTER A WITH CIRCUMFLEX AND TIL
26452645-1EAC ; DISALLOWED # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND D
26462646-1EAD ; PVALID # LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT
26472647-1EAE ; DISALLOWED # LATIN CAPITAL LETTER A WITH BREVE AND ACUTE
26482648-1EAF ; PVALID # LATIN SMALL LETTER A WITH BREVE AND ACUTE
26492649-1EB0 ; DISALLOWED # LATIN CAPITAL LETTER A WITH BREVE AND GRAVE
26502650-1EB1 ; PVALID # LATIN SMALL LETTER A WITH BREVE AND GRAVE
26512651-1EB2 ; DISALLOWED # LATIN CAPITAL LETTER A WITH BREVE AND HOOK A
26522652-1EB3 ; PVALID # LATIN SMALL LETTER A WITH BREVE AND HOOK ABO
26532653-1EB4 ; DISALLOWED # LATIN CAPITAL LETTER A WITH BREVE AND TILDE
26542654-1EB5 ; PVALID # LATIN SMALL LETTER A WITH BREVE AND TILDE
26552655-1EB6 ; DISALLOWED # LATIN CAPITAL LETTER A WITH BREVE AND DOT BE
26562656-1EB7 ; PVALID # LATIN SMALL LETTER A WITH BREVE AND DOT BELO
26572657-1EB8 ; DISALLOWED # LATIN CAPITAL LETTER E WITH DOT BELOW
26582658-1EB9 ; PVALID # LATIN SMALL LETTER E WITH DOT BELOW
26592659-1EBA ; DISALLOWED # LATIN CAPITAL LETTER E WITH HOOK ABOVE
26602660-1EBB ; PVALID # LATIN SMALL LETTER E WITH HOOK ABOVE
26612661-1EBC ; DISALLOWED # LATIN CAPITAL LETTER E WITH TILDE
26622662-1EBD ; PVALID # LATIN SMALL LETTER E WITH TILDE
26632663-1EBE ; DISALLOWED # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND A
26642664-1EBF ; PVALID # LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACU
26652665-1EC0 ; DISALLOWED # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND G
26662666-1EC1 ; PVALID # LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRA
26672667-1EC2 ; DISALLOWED # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND H
26682668-1EC3 ; PVALID # LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOO
26692669-1EC4 ; DISALLOWED # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND T
26702670-1EC5 ; PVALID # LATIN SMALL LETTER E WITH CIRCUMFLEX AND TIL
26712671-1EC6 ; DISALLOWED # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND D
26722672-1EC7 ; PVALID # LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT
26732673-1EC8 ; DISALLOWED # LATIN CAPITAL LETTER I WITH HOOK ABOVE
26742674-1EC9 ; PVALID # LATIN SMALL LETTER I WITH HOOK ABOVE
26752675-1ECA ; DISALLOWED # LATIN CAPITAL LETTER I WITH DOT BELOW
26762676-1ECB ; PVALID # LATIN SMALL LETTER I WITH DOT BELOW
26772677-1ECC ; DISALLOWED # LATIN CAPITAL LETTER O WITH DOT BELOW
26782678-1ECD ; PVALID # LATIN SMALL LETTER O WITH DOT BELOW
26792679-1ECE ; DISALLOWED # LATIN CAPITAL LETTER O WITH HOOK ABOVE
26802680-1ECF ; PVALID # LATIN SMALL LETTER O WITH HOOK ABOVE
26812681-1ED0 ; DISALLOWED # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND A
26822682-1ED1 ; PVALID # LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACU
26832683-1ED2 ; DISALLOWED # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND G
26842684-1ED3 ; PVALID # LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRA
26852685-1ED4 ; DISALLOWED # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND H
26862686-1ED5 ; PVALID # LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOO
26872687-26882688-26892689-26902690-Faltstrom Standards Track [Page 48]
26912691-26922692-RFC 5892 IDNA Code Points August 2010
26932693-26942694-26952695-1ED6 ; DISALLOWED # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND T
26962696-1ED7 ; PVALID # LATIN SMALL LETTER O WITH CIRCUMFLEX AND TIL
26972697-1ED8 ; DISALLOWED # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND D
26982698-1ED9 ; PVALID # LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT
26992699-1EDA ; DISALLOWED # LATIN CAPITAL LETTER O WITH HORN AND ACUTE
27002700-1EDB ; PVALID # LATIN SMALL LETTER O WITH HORN AND ACUTE
27012701-1EDC ; DISALLOWED # LATIN CAPITAL LETTER O WITH HORN AND GRAVE
27022702-1EDD ; PVALID # LATIN SMALL LETTER O WITH HORN AND GRAVE
27032703-1EDE ; DISALLOWED # LATIN CAPITAL LETTER O WITH HORN AND HOOK AB
27042704-1EDF ; PVALID # LATIN SMALL LETTER O WITH HORN AND HOOK ABOV
27052705-1EE0 ; DISALLOWED # LATIN CAPITAL LETTER O WITH HORN AND TILDE
27062706-1EE1 ; PVALID # LATIN SMALL LETTER O WITH HORN AND TILDE
27072707-1EE2 ; DISALLOWED # LATIN CAPITAL LETTER O WITH HORN AND DOT BEL
27082708-1EE3 ; PVALID # LATIN SMALL LETTER O WITH HORN AND DOT BELOW
27092709-1EE4 ; DISALLOWED # LATIN CAPITAL LETTER U WITH DOT BELOW
27102710-1EE5 ; PVALID # LATIN SMALL LETTER U WITH DOT BELOW
27112711-1EE6 ; DISALLOWED # LATIN CAPITAL LETTER U WITH HOOK ABOVE
27122712-1EE7 ; PVALID # LATIN SMALL LETTER U WITH HOOK ABOVE
27132713-1EE8 ; DISALLOWED # LATIN CAPITAL LETTER U WITH HORN AND ACUTE
27142714-1EE9 ; PVALID # LATIN SMALL LETTER U WITH HORN AND ACUTE
27152715-1EEA ; DISALLOWED # LATIN CAPITAL LETTER U WITH HORN AND GRAVE
27162716-1EEB ; PVALID # LATIN SMALL LETTER U WITH HORN AND GRAVE
27172717-1EEC ; DISALLOWED # LATIN CAPITAL LETTER U WITH HORN AND HOOK AB
27182718-1EED ; PVALID # LATIN SMALL LETTER U WITH HORN AND HOOK ABOV
27192719-1EEE ; DISALLOWED # LATIN CAPITAL LETTER U WITH HORN AND TILDE
27202720-1EEF ; PVALID # LATIN SMALL LETTER U WITH HORN AND TILDE
27212721-1EF0 ; DISALLOWED # LATIN CAPITAL LETTER U WITH HORN AND DOT BEL
27222722-1EF1 ; PVALID # LATIN SMALL LETTER U WITH HORN AND DOT BELOW
27232723-1EF2 ; DISALLOWED # LATIN CAPITAL LETTER Y WITH GRAVE
27242724-1EF3 ; PVALID # LATIN SMALL LETTER Y WITH GRAVE
27252725-1EF4 ; DISALLOWED # LATIN CAPITAL LETTER Y WITH DOT BELOW
27262726-1EF5 ; PVALID # LATIN SMALL LETTER Y WITH DOT BELOW
27272727-1EF6 ; DISALLOWED # LATIN CAPITAL LETTER Y WITH HOOK ABOVE
27282728-1EF7 ; PVALID # LATIN SMALL LETTER Y WITH HOOK ABOVE
27292729-1EF8 ; DISALLOWED # LATIN CAPITAL LETTER Y WITH TILDE
27302730-1EF9 ; PVALID # LATIN SMALL LETTER Y WITH TILDE
27312731-1EFA ; DISALLOWED # LATIN CAPITAL LETTER MIDDLE-WELSH LL
27322732-1EFB ; PVALID # LATIN SMALL LETTER MIDDLE-WELSH LL
27332733-1EFC ; DISALLOWED # LATIN CAPITAL LETTER MIDDLE-WELSH V
27342734-1EFD ; PVALID # LATIN SMALL LETTER MIDDLE-WELSH V
27352735-1EFE ; DISALLOWED # LATIN CAPITAL LETTER Y WITH LOOP
27362736-1EFF..1F07 ; PVALID # LATIN SMALL LETTER Y WITH LOOP..GREEK SMALL
27372737-1F08..1F0F ; DISALLOWED # GREEK CAPITAL LETTER ALPHA WITH PSILI..GREEK
27382738-1F10..1F15 ; PVALID # GREEK SMALL LETTER EPSILON WITH PSILI..GREEK
27392739-1F16..1F17 ; UNASSIGNED # <reserved>..<reserved>
27402740-1F18..1F1D ; DISALLOWED # GREEK CAPITAL LETTER EPSILON WITH PSILI..GRE
27412741-1F1E..1F1F ; UNASSIGNED # <reserved>..<reserved>
27422742-1F20..1F27 ; PVALID # GREEK SMALL LETTER ETA WITH PSILI..GREEK SMA
27432743-27442744-27452745-27462746-Faltstrom Standards Track [Page 49]
27472747-27482748-RFC 5892 IDNA Code Points August 2010
27492749-27502750-27512751-1F28..1F2F ; DISALLOWED # GREEK CAPITAL LETTER ETA WITH PSILI..GREEK C
27522752-1F30..1F37 ; PVALID # GREEK SMALL LETTER IOTA WITH PSILI..GREEK SM
27532753-1F38..1F3F ; DISALLOWED # GREEK CAPITAL LETTER IOTA WITH PSILI..GREEK
27542754-1F40..1F45 ; PVALID # GREEK SMALL LETTER OMICRON WITH PSILI..GREEK
27552755-1F46..1F47 ; UNASSIGNED # <reserved>..<reserved>
27562756-1F48..1F4D ; DISALLOWED # GREEK CAPITAL LETTER OMICRON WITH PSILI..GRE
27572757-1F4E..1F4F ; UNASSIGNED # <reserved>..<reserved>
27582758-1F50..1F57 ; PVALID # GREEK SMALL LETTER UPSILON WITH PSILI..GREEK
27592759-1F58 ; UNASSIGNED # <reserved>
27602760-1F59 ; DISALLOWED # GREEK CAPITAL LETTER UPSILON WITH DASIA
27612761-1F5A ; UNASSIGNED # <reserved>
27622762-1F5B ; DISALLOWED # GREEK CAPITAL LETTER UPSILON WITH DASIA AND
27632763-1F5C ; UNASSIGNED # <reserved>
27642764-1F5D ; DISALLOWED # GREEK CAPITAL LETTER UPSILON WITH DASIA AND
27652765-1F5E ; UNASSIGNED # <reserved>
27662766-1F5F ; DISALLOWED # GREEK CAPITAL LETTER UPSILON WITH DASIA AND
27672767-1F60..1F67 ; PVALID # GREEK SMALL LETTER OMEGA WITH PSILI..GREEK S
27682768-1F68..1F6F ; DISALLOWED # GREEK CAPITAL LETTER OMEGA WITH PSILI..GREEK
27692769-1F70 ; PVALID # GREEK SMALL LETTER ALPHA WITH VARIA
27702770-1F71 ; DISALLOWED # GREEK SMALL LETTER ALPHA WITH OXIA
27712771-1F72 ; PVALID # GREEK SMALL LETTER EPSILON WITH VARIA
27722772-1F73 ; DISALLOWED # GREEK SMALL LETTER EPSILON WITH OXIA
27732773-1F74 ; PVALID # GREEK SMALL LETTER ETA WITH VARIA
27742774-1F75 ; DISALLOWED # GREEK SMALL LETTER ETA WITH OXIA
27752775-1F76 ; PVALID # GREEK SMALL LETTER IOTA WITH VARIA
27762776-1F77 ; DISALLOWED # GREEK SMALL LETTER IOTA WITH OXIA
27772777-1F78 ; PVALID # GREEK SMALL LETTER OMICRON WITH VARIA
27782778-1F79 ; DISALLOWED # GREEK SMALL LETTER OMICRON WITH OXIA
27792779-1F7A ; PVALID # GREEK SMALL LETTER UPSILON WITH VARIA
27802780-1F7B ; DISALLOWED # GREEK SMALL LETTER UPSILON WITH OXIA
27812781-1F7C ; PVALID # GREEK SMALL LETTER OMEGA WITH VARIA
27822782-1F7D ; DISALLOWED # GREEK SMALL LETTER OMEGA WITH OXIA
27832783-1F7E..1F7F ; UNASSIGNED # <reserved>..<reserved>
27842784-1F80..1FAF ; DISALLOWED # GREEK SMALL LETTER ALPHA WITH PSILI AND YPOG
27852785-1FB0..1FB1 ; PVALID # GREEK SMALL LETTER ALPHA WITH VRACHY..GREEK
27862786-1FB2..1FB4 ; DISALLOWED # GREEK SMALL LETTER ALPHA WITH VARIA AND YPOG
27872787-1FB5 ; UNASSIGNED # <reserved>
27882788-1FB6 ; PVALID # GREEK SMALL LETTER ALPHA WITH PERISPOMENI
27892789-1FB7..1FC4 ; DISALLOWED # GREEK SMALL LETTER ALPHA WITH PERISPOMENI AN
27902790-1FC5 ; UNASSIGNED # <reserved>
27912791-1FC6 ; PVALID # GREEK SMALL LETTER ETA WITH PERISPOMENI
27922792-1FC7..1FCF ; DISALLOWED # GREEK SMALL LETTER ETA WITH PERISPOMENI AND
27932793-1FD0..1FD2 ; PVALID # GREEK SMALL LETTER IOTA WITH VRACHY..GREEK S
27942794-1FD3 ; DISALLOWED # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND O
27952795-1FD4..1FD5 ; UNASSIGNED # <reserved>..<reserved>
27962796-1FD6..1FD7 ; PVALID # GREEK SMALL LETTER IOTA WITH PERISPOMENI..GR
27972797-1FD8..1FDB ; DISALLOWED # GREEK CAPITAL LETTER IOTA WITH VRACHY..GREEK
27982798-1FDC ; UNASSIGNED # <reserved>
27992799-28002800-28012801-28022802-Faltstrom Standards Track [Page 50]
28032803-28042804-RFC 5892 IDNA Code Points August 2010
28052805-28062806-28072807-1FDD..1FDF ; DISALLOWED # GREEK DASIA AND VARIA..GREEK DASIA AND PERIS
28082808-1FE0..1FE2 ; PVALID # GREEK SMALL LETTER UPSILON WITH VRACHY..GREE
28092809-1FE3 ; DISALLOWED # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AN
28102810-1FE4..1FE7 ; PVALID # GREEK SMALL LETTER RHO WITH PSILI..GREEK SMA
28112811-1FE8..1FEF ; DISALLOWED # GREEK CAPITAL LETTER UPSILON WITH VRACHY..GR
28122812-1FF0..1FF1 ; UNASSIGNED # <reserved>..<reserved>
28132813-1FF2..1FF4 ; DISALLOWED # GREEK SMALL LETTER OMEGA WITH VARIA AND YPOG
28142814-1FF5 ; UNASSIGNED # <reserved>
28152815-1FF6 ; PVALID # GREEK SMALL LETTER OMEGA WITH PERISPOMENI
28162816-1FF7..1FFE ; DISALLOWED # GREEK SMALL LETTER OMEGA WITH PERISPOMENI AN
28172817-1FFF ; UNASSIGNED # <reserved>
28182818-2000..200B ; DISALLOWED # EN QUAD..ZERO WIDTH SPACE
28192819-200C..200D ; CONTEXTJ # ZERO WIDTH NON-JOINER..ZERO WIDTH JOINER
28202820-200E..2064 ; DISALLOWED # LEFT-TO-RIGHT MARK..INVISIBLE PLUS
28212821-2065..2069 ; UNASSIGNED # <reserved>..<reserved>
28222822-206A..2071 ; DISALLOWED # INHIBIT SYMMETRIC SWAPPING..SUPERSCRIPT LATI
28232823-2072..2073 ; UNASSIGNED # <reserved>..<reserved>
28242824-2074..208E ; DISALLOWED # SUPERSCRIPT FOUR..SUBSCRIPT RIGHT PARENTHESI
28252825-208F ; UNASSIGNED # <reserved>
28262826-2090..2094 ; DISALLOWED # LATIN SUBSCRIPT SMALL LETTER A..LATIN SUBSCR
28272827-2095..209F ; UNASSIGNED # <reserved>..<reserved>
28282828-20A0..20B8 ; DISALLOWED # EURO-CURRENCY SIGN..TENGE SIGN
28292829-20B9..20CF ; UNASSIGNED # <reserved>..<reserved>
28302830-20D0..20F0 ; DISALLOWED # COMBINING LEFT HARPOON ABOVE..COMBINING ASTE
28312831-20F1..20FF ; UNASSIGNED # <reserved>..<reserved>
28322832-2100..214D ; DISALLOWED # ACCOUNT OF..AKTIESELSKAB
28332833-214E ; PVALID # TURNED SMALL F
28342834-214F..2183 ; DISALLOWED # SYMBOL FOR SAMARITAN SOURCE..ROMAN NUMERAL R
28352835-2184 ; PVALID # LATIN SMALL LETTER REVERSED C
28362836-2185..2189 ; DISALLOWED # ROMAN NUMERAL SIX LATE FORM..VULGAR FRACTION
28372837-218A..218F ; UNASSIGNED # <reserved>..<reserved>
28382838-2190..23E8 ; DISALLOWED # LEFTWARDS ARROW..DECIMAL EXPONENT SYMBOL
28392839-23E9..23FF ; UNASSIGNED # <reserved>..<reserved>
28402840-2400..2426 ; DISALLOWED # SYMBOL FOR NULL..SYMBOL FOR SUBSTITUTE FORM
28412841-2427..243F ; UNASSIGNED # <reserved>..<reserved>
28422842-2440..244A ; DISALLOWED # OCR HOOK..OCR DOUBLE BACKSLASH
28432843-244B..245F ; UNASSIGNED # <reserved>..<reserved>
28442844-2460..26CD ; DISALLOWED # CIRCLED DIGIT ONE..DISABLED CAR
28452845-26CE ; UNASSIGNED # <reserved>
28462846-26CF..26E1 ; DISALLOWED # PICK..RESTRICTED LEFT ENTRY-2
28472847-26E2 ; UNASSIGNED # <reserved>
28482848-26E3 ; DISALLOWED # HEAVY CIRCLE WITH STROKE AND TWO DOTS ABOVE
28492849-26E4..26E7 ; UNASSIGNED # <reserved>..<reserved>
28502850-26E8..26FF ; DISALLOWED # BLACK CROSS ON SHIELD..WHITE FLAG WITH HORIZ
28512851-2700 ; UNASSIGNED # <reserved>
28522852-2701..2704 ; DISALLOWED # UPPER BLADE SCISSORS..WHITE SCISSORS
28532853-2705 ; UNASSIGNED # <reserved>
28542854-2706..2709 ; DISALLOWED # TELEPHONE LOCATION SIGN..ENVELOPE
28552855-28562856-28572857-28582858-Faltstrom Standards Track [Page 51]
28592859-28602860-RFC 5892 IDNA Code Points August 2010
28612861-28622862-28632863-270A..270B ; UNASSIGNED # <reserved>..<reserved>
28642864-270C..2727 ; DISALLOWED # VICTORY HAND..WHITE FOUR POINTED STAR
28652865-2728 ; UNASSIGNED # <reserved>
28662866-2729..274B ; DISALLOWED # STRESS OUTLINED WHITE STAR..HEAVY EIGHT TEAR
28672867-274C ; UNASSIGNED # <reserved>
28682868-274D ; DISALLOWED # SHADOWED WHITE CIRCLE
28692869-274E ; UNASSIGNED # <reserved>
28702870-274F..2752 ; DISALLOWED # LOWER RIGHT DROP-SHADOWED WHITE SQUARE..UPPE
28712871-2753..2755 ; UNASSIGNED # <reserved>..<reserved>
28722872-2756..275E ; DISALLOWED # BLACK DIAMOND MINUS WHITE X..HEAVY DOUBLE CO
28732873-275F..2760 ; UNASSIGNED # <reserved>..<reserved>
28742874-2761..2794 ; DISALLOWED # CURVED STEM PARAGRAPH SIGN ORNAMENT..HEAVY W
28752875-2795..2797 ; UNASSIGNED # <reserved>..<reserved>
28762876-2798..27AF ; DISALLOWED # HEAVY SOUTH EAST ARROW..NOTCHED LOWER RIGHT-
28772877-27B0 ; UNASSIGNED # <reserved>
28782878-27B1..27BE ; DISALLOWED # NOTCHED UPPER RIGHT-SHADOWED WHITE RIGHTWARD
28792879-27BF ; UNASSIGNED # <reserved>
28802880-27C0..27CA ; DISALLOWED # THREE DIMENSIONAL ANGLE..VERTICAL BAR WITH H
28812881-27CB ; UNASSIGNED # <reserved>
28822882-27CC ; DISALLOWED # LONG DIVISION
28832883-27CD..27CF ; UNASSIGNED # <reserved>..<reserved>
28842884-27D0..2B4C ; DISALLOWED # WHITE DIAMOND WITH CENTRED DOT..RIGHTWARDS A
28852885-2B4D..2B4F ; UNASSIGNED # <reserved>..<reserved>
28862886-2B50..2B59 ; DISALLOWED # WHITE MEDIUM STAR..HEAVY CIRCLED SALTIRE
28872887-2B5A..2BFF ; UNASSIGNED # <reserved>..<reserved>
28882888-2C00..2C2E ; DISALLOWED # GLAGOLITIC CAPITAL LETTER AZU..GLAGOLITIC CA
28892889-2C2F ; UNASSIGNED # <reserved>
28902890-2C30..2C5E ; PVALID # GLAGOLITIC SMALL LETTER AZU..GLAGOLITIC SMAL
28912891-2C5F ; UNASSIGNED # <reserved>
28922892-2C60 ; DISALLOWED # LATIN CAPITAL LETTER L WITH DOUBLE BAR
28932893-2C61 ; PVALID # LATIN SMALL LETTER L WITH DOUBLE BAR
28942894-2C62..2C64 ; DISALLOWED # LATIN CAPITAL LETTER L WITH MIDDLE TILDE..LA
28952895-2C65..2C66 ; PVALID # LATIN SMALL LETTER A WITH STROKE..LATIN SMAL
28962896-2C67 ; DISALLOWED # LATIN CAPITAL LETTER H WITH DESCENDER
28972897-2C68 ; PVALID # LATIN SMALL LETTER H WITH DESCENDER
28982898-2C69 ; DISALLOWED # LATIN CAPITAL LETTER K WITH DESCENDER
28992899-2C6A ; PVALID # LATIN SMALL LETTER K WITH DESCENDER
29002900-2C6B ; DISALLOWED # LATIN CAPITAL LETTER Z WITH DESCENDER
29012901-2C6C ; PVALID # LATIN SMALL LETTER Z WITH DESCENDER
29022902-2C6D..2C70 ; DISALLOWED # LATIN CAPITAL LETTER ALPHA..LATIN CAPITAL LE
29032903-2C71 ; PVALID # LATIN SMALL LETTER V WITH RIGHT HOOK
29042904-2C72 ; DISALLOWED # LATIN CAPITAL LETTER W WITH HOOK
29052905-2C73..2C74 ; PVALID # LATIN SMALL LETTER W WITH HOOK..LATIN SMALL
29062906-2C75 ; DISALLOWED # LATIN CAPITAL LETTER HALF H
29072907-2C76..2C7B ; PVALID # LATIN SMALL LETTER HALF H..LATIN LETTER SMAL
29082908-2C7C..2C80 ; DISALLOWED # LATIN SUBSCRIPT SMALL LETTER J..COPTIC CAPIT
29092909-2C81 ; PVALID # COPTIC SMALL LETTER ALFA
29102910-2C82 ; DISALLOWED # COPTIC CAPITAL LETTER VIDA
29112911-29122912-29132913-29142914-Faltstrom Standards Track [Page 52]
29152915-29162916-RFC 5892 IDNA Code Points August 2010
29172917-29182918-29192919-2C83 ; PVALID # COPTIC SMALL LETTER VIDA
29202920-2C84 ; DISALLOWED # COPTIC CAPITAL LETTER GAMMA
29212921-2C85 ; PVALID # COPTIC SMALL LETTER GAMMA
29222922-2C86 ; DISALLOWED # COPTIC CAPITAL LETTER DALDA
29232923-2C87 ; PVALID # COPTIC SMALL LETTER DALDA
29242924-2C88 ; DISALLOWED # COPTIC CAPITAL LETTER EIE
29252925-2C89 ; PVALID # COPTIC SMALL LETTER EIE
29262926-2C8A ; DISALLOWED # COPTIC CAPITAL LETTER SOU
29272927-2C8B ; PVALID # COPTIC SMALL LETTER SOU
29282928-2C8C ; DISALLOWED # COPTIC CAPITAL LETTER ZATA
29292929-2C8D ; PVALID # COPTIC SMALL LETTER ZATA
29302930-2C8E ; DISALLOWED # COPTIC CAPITAL LETTER HATE
29312931-2C8F ; PVALID # COPTIC SMALL LETTER HATE
29322932-2C90 ; DISALLOWED # COPTIC CAPITAL LETTER THETHE
29332933-2C91 ; PVALID # COPTIC SMALL LETTER THETHE
29342934-2C92 ; DISALLOWED # COPTIC CAPITAL LETTER IAUDA
29352935-2C93 ; PVALID # COPTIC SMALL LETTER IAUDA
29362936-2C94 ; DISALLOWED # COPTIC CAPITAL LETTER KAPA
29372937-2C95 ; PVALID # COPTIC SMALL LETTER KAPA
29382938-2C96 ; DISALLOWED # COPTIC CAPITAL LETTER LAULA
29392939-2C97 ; PVALID # COPTIC SMALL LETTER LAULA
29402940-2C98 ; DISALLOWED # COPTIC CAPITAL LETTER MI
29412941-2C99 ; PVALID # COPTIC SMALL LETTER MI
29422942-2C9A ; DISALLOWED # COPTIC CAPITAL LETTER NI
29432943-2C9B ; PVALID # COPTIC SMALL LETTER NI
29442944-2C9C ; DISALLOWED # COPTIC CAPITAL LETTER KSI
29452945-2C9D ; PVALID # COPTIC SMALL LETTER KSI
29462946-2C9E ; DISALLOWED # COPTIC CAPITAL LETTER O
29472947-2C9F ; PVALID # COPTIC SMALL LETTER O
29482948-2CA0 ; DISALLOWED # COPTIC CAPITAL LETTER PI
29492949-2CA1 ; PVALID # COPTIC SMALL LETTER PI
29502950-2CA2 ; DISALLOWED # COPTIC CAPITAL LETTER RO
29512951-2CA3 ; PVALID # COPTIC SMALL LETTER RO
29522952-2CA4 ; DISALLOWED # COPTIC CAPITAL LETTER SIMA
29532953-2CA5 ; PVALID # COPTIC SMALL LETTER SIMA
29542954-2CA6 ; DISALLOWED # COPTIC CAPITAL LETTER TAU
29552955-2CA7 ; PVALID # COPTIC SMALL LETTER TAU
29562956-2CA8 ; DISALLOWED # COPTIC CAPITAL LETTER UA
29572957-2CA9 ; PVALID # COPTIC SMALL LETTER UA
29582958-2CAA ; DISALLOWED # COPTIC CAPITAL LETTER FI
29592959-2CAB ; PVALID # COPTIC SMALL LETTER FI
29602960-2CAC ; DISALLOWED # COPTIC CAPITAL LETTER KHI
29612961-2CAD ; PVALID # COPTIC SMALL LETTER KHI
29622962-2CAE ; DISALLOWED # COPTIC CAPITAL LETTER PSI
29632963-2CAF ; PVALID # COPTIC SMALL LETTER PSI
29642964-2CB0 ; DISALLOWED # COPTIC CAPITAL LETTER OOU
29652965-2CB1 ; PVALID # COPTIC SMALL LETTER OOU
29662966-2CB2 ; DISALLOWED # COPTIC CAPITAL LETTER DIALECT-P ALEF
29672967-29682968-29692969-29702970-Faltstrom Standards Track [Page 53]
29712971-29722972-RFC 5892 IDNA Code Points August 2010
29732973-29742974-29752975-2CB3 ; PVALID # COPTIC SMALL LETTER DIALECT-P ALEF
29762976-2CB4 ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC AIN
29772977-2CB5 ; PVALID # COPTIC SMALL LETTER OLD COPTIC AIN
29782978-2CB6 ; DISALLOWED # COPTIC CAPITAL LETTER CRYPTOGRAMMIC EIE
29792979-2CB7 ; PVALID # COPTIC SMALL LETTER CRYPTOGRAMMIC EIE
29802980-2CB8 ; DISALLOWED # COPTIC CAPITAL LETTER DIALECT-P KAPA
29812981-2CB9 ; PVALID # COPTIC SMALL LETTER DIALECT-P KAPA
29822982-2CBA ; DISALLOWED # COPTIC CAPITAL LETTER DIALECT-P NI
29832983-2CBB ; PVALID # COPTIC SMALL LETTER DIALECT-P NI
29842984-2CBC ; DISALLOWED # COPTIC CAPITAL LETTER CRYPTOGRAMMIC NI
29852985-2CBD ; PVALID # COPTIC SMALL LETTER CRYPTOGRAMMIC NI
29862986-2CBE ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC OOU
29872987-2CBF ; PVALID # COPTIC SMALL LETTER OLD COPTIC OOU
29882988-2CC0 ; DISALLOWED # COPTIC CAPITAL LETTER SAMPI
29892989-2CC1 ; PVALID # COPTIC SMALL LETTER SAMPI
29902990-2CC2 ; DISALLOWED # COPTIC CAPITAL LETTER CROSSED SHEI
29912991-2CC3 ; PVALID # COPTIC SMALL LETTER CROSSED SHEI
29922992-2CC4 ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC SHEI
29932993-2CC5 ; PVALID # COPTIC SMALL LETTER OLD COPTIC SHEI
29942994-2CC6 ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC ESH
29952995-2CC7 ; PVALID # COPTIC SMALL LETTER OLD COPTIC ESH
29962996-2CC8 ; DISALLOWED # COPTIC CAPITAL LETTER AKHMIMIC KHEI
29972997-2CC9 ; PVALID # COPTIC SMALL LETTER AKHMIMIC KHEI
29982998-2CCA ; DISALLOWED # COPTIC CAPITAL LETTER DIALECT-P HORI
29992999-2CCB ; PVALID # COPTIC SMALL LETTER DIALECT-P HORI
30003000-2CCC ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC HORI
30013001-2CCD ; PVALID # COPTIC SMALL LETTER OLD COPTIC HORI
30023002-2CCE ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC HA
30033003-2CCF ; PVALID # COPTIC SMALL LETTER OLD COPTIC HA
30043004-2CD0 ; DISALLOWED # COPTIC CAPITAL LETTER L-SHAPED HA
30053005-2CD1 ; PVALID # COPTIC SMALL LETTER L-SHAPED HA
30063006-2CD2 ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC HEI
30073007-2CD3 ; PVALID # COPTIC SMALL LETTER OLD COPTIC HEI
30083008-2CD4 ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC HAT
30093009-2CD5 ; PVALID # COPTIC SMALL LETTER OLD COPTIC HAT
30103010-2CD6 ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC GANGIA
30113011-2CD7 ; PVALID # COPTIC SMALL LETTER OLD COPTIC GANGIA
30123012-2CD8 ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC DJA
30133013-2CD9 ; PVALID # COPTIC SMALL LETTER OLD COPTIC DJA
30143014-2CDA ; DISALLOWED # COPTIC CAPITAL LETTER OLD COPTIC SHIMA
30153015-2CDB ; PVALID # COPTIC SMALL LETTER OLD COPTIC SHIMA
30163016-2CDC ; DISALLOWED # COPTIC CAPITAL LETTER OLD NUBIAN SHIMA
30173017-2CDD ; PVALID # COPTIC SMALL LETTER OLD NUBIAN SHIMA
30183018-2CDE ; DISALLOWED # COPTIC CAPITAL LETTER OLD NUBIAN NGI
30193019-2CDF ; PVALID # COPTIC SMALL LETTER OLD NUBIAN NGI
30203020-2CE0 ; DISALLOWED # COPTIC CAPITAL LETTER OLD NUBIAN NYI
30213021-2CE1 ; PVALID # COPTIC SMALL LETTER OLD NUBIAN NYI
30223022-2CE2 ; DISALLOWED # COPTIC CAPITAL LETTER OLD NUBIAN WAU
30233023-30243024-30253025-30263026-Faltstrom Standards Track [Page 54]
30273027-30283028-RFC 5892 IDNA Code Points August 2010
30293029-30303030-30313031-2CE3..2CE4 ; PVALID # COPTIC SMALL LETTER OLD NUBIAN WAU..COPTIC S
30323032-2CE5..2CEB ; DISALLOWED # COPTIC SYMBOL MI RO..COPTIC CAPITAL LETTER C
30333033-2CEC ; PVALID # COPTIC SMALL LETTER CRYPTOGRAMMIC SHEI
30343034-2CED ; DISALLOWED # COPTIC CAPITAL LETTER CRYPTOGRAMMIC GANGIA
30353035-2CEE..2CF1 ; PVALID # COPTIC SMALL LETTER CRYPTOGRAMMIC GANGIA..CO
30363036-2CF2..2CF8 ; UNASSIGNED # <reserved>..<reserved>
30373037-2CF9..2CFF ; DISALLOWED # COPTIC OLD NUBIAN FULL STOP..COPTIC MORPHOLO
30383038-2D00..2D25 ; PVALID # GEORGIAN SMALL LETTER AN..GEORGIAN SMALL LET
30393039-2D26..2D2F ; UNASSIGNED # <reserved>..<reserved>
30403040-2D30..2D65 ; PVALID # TIFINAGH LETTER YA..TIFINAGH LETTER YAZZ
30413041-2D66..2D6E ; UNASSIGNED # <reserved>..<reserved>
30423042-2D6F ; DISALLOWED # TIFINAGH MODIFIER LETTER LABIALIZATION MARK
30433043-2D70..2D7F ; UNASSIGNED # <reserved>..<reserved>
30443044-2D80..2D96 ; PVALID # ETHIOPIC SYLLABLE LOA..ETHIOPIC SYLLABLE GGW
30453045-2D97..2D9F ; UNASSIGNED # <reserved>..<reserved>
30463046-2DA0..2DA6 ; PVALID # ETHIOPIC SYLLABLE SSA..ETHIOPIC SYLLABLE SSO
30473047-2DA7 ; UNASSIGNED # <reserved>
30483048-2DA8..2DAE ; PVALID # ETHIOPIC SYLLABLE CCA..ETHIOPIC SYLLABLE CCO
30493049-2DAF ; UNASSIGNED # <reserved>
30503050-2DB0..2DB6 ; PVALID # ETHIOPIC SYLLABLE ZZA..ETHIOPIC SYLLABLE ZZO
30513051-2DB7 ; UNASSIGNED # <reserved>
30523052-2DB8..2DBE ; PVALID # ETHIOPIC SYLLABLE CCHA..ETHIOPIC SYLLABLE CC
30533053-2DBF ; UNASSIGNED # <reserved>
30543054-2DC0..2DC6 ; PVALID # ETHIOPIC SYLLABLE QYA..ETHIOPIC SYLLABLE QYO
30553055-2DC7 ; UNASSIGNED # <reserved>
30563056-2DC8..2DCE ; PVALID # ETHIOPIC SYLLABLE KYA..ETHIOPIC SYLLABLE KYO
30573057-2DCF ; UNASSIGNED # <reserved>
30583058-2DD0..2DD6 ; PVALID # ETHIOPIC SYLLABLE XYA..ETHIOPIC SYLLABLE XYO
30593059-2DD7 ; UNASSIGNED # <reserved>
30603060-2DD8..2DDE ; PVALID # ETHIOPIC SYLLABLE GYA..ETHIOPIC SYLLABLE GYO
30613061-2DDF ; UNASSIGNED # <reserved>
30623062-2DE0..2DFF ; PVALID # COMBINING CYRILLIC LETTER BE..COMBINING CYRI
30633063-2E00..2E2E ; DISALLOWED # RIGHT ANGLE SUBSTITUTION MARKER..REVERSED QU
30643064-2E2F ; PVALID # VERTICAL TILDE
30653065-2E30..2E31 ; DISALLOWED # RING POINT..WORD SEPARATOR MIDDLE DOT
30663066-2E32..2E7F ; UNASSIGNED # <reserved>..<reserved>
30673067-2E80..2E99 ; DISALLOWED # CJK RADICAL REPEAT..CJK RADICAL RAP
30683068-2E9A ; UNASSIGNED # <reserved>
30693069-2E9B..2EF3 ; DISALLOWED # CJK RADICAL CHOKE..CJK RADICAL C-SIMPLIFIED
30703070-2EF4..2EFF ; UNASSIGNED # <reserved>..<reserved>
30713071-2F00..2FD5 ; DISALLOWED # KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
30723072-2FD6..2FEF ; UNASSIGNED # <reserved>..<reserved>
30733073-2FF0..2FFB ; DISALLOWED # IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RI
30743074-2FFC..2FFF ; UNASSIGNED # <reserved>..<reserved>
30753075-3000..3004 ; DISALLOWED # IDEOGRAPHIC SPACE..JAPANESE INDUSTRIAL STAND
30763076-3005..3007 ; PVALID # IDEOGRAPHIC ITERATION MARK..IDEOGRAPHIC NUMB
30773077-3008..3029 ; DISALLOWED # LEFT ANGLE BRACKET..HANGZHOU NUMERAL NINE
30783078-302A..302D ; PVALID # IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC ENT
30793079-30803080-30813081-30823082-Faltstrom Standards Track [Page 55]
30833083-30843084-RFC 5892 IDNA Code Points August 2010
30853085-30863086-30873087-302E..303B ; DISALLOWED # HANGUL SINGLE DOT TONE MARK..VERTICAL IDEOGR
30883088-303C ; PVALID # MASU MARK
30893089-303D..303F ; DISALLOWED # PART ALTERNATION MARK..IDEOGRAPHIC HALF FILL
30903090-3040 ; UNASSIGNED # <reserved>
30913091-3041..3096 ; PVALID # HIRAGANA LETTER SMALL A..HIRAGANA LETTER SMA
30923092-3097..3098 ; UNASSIGNED # <reserved>..<reserved>
30933093-3099..309A ; PVALID # COMBINING KATAKANA-HIRAGANA VOICED SOUND MAR
30943094-309B..309C ; DISALLOWED # KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKAN
30953095-309D..309E ; PVALID # HIRAGANA ITERATION MARK..HIRAGANA VOICED ITE
30963096-309F..30A0 ; DISALLOWED # HIRAGANA DIGRAPH YORI..KATAKANA-HIRAGANA DOU
30973097-30A1..30FA ; PVALID # KATAKANA LETTER SMALL A..KATAKANA LETTER VO
30983098-30FB ; CONTEXTO # KATAKANA MIDDLE DOT
30993099-30FC..30FE ; PVALID # KATAKANA-HIRAGANA PROLONGED SOUND MARK..KATA
31003100-30FF ; DISALLOWED # KATAKANA DIGRAPH KOTO
31013101-3100..3104 ; UNASSIGNED # <reserved>..<reserved>
31023102-3105..312D ; PVALID # BOPOMOFO LETTER B..BOPOMOFO LETTER IH
31033103-312E..3130 ; UNASSIGNED # <reserved>..<reserved>
31043104-3131..318E ; DISALLOWED # HANGUL LETTER KIYEOK..HANGUL LETTER ARAEAE
31053105-318F ; UNASSIGNED # <reserved>
31063106-3190..319F ; DISALLOWED # IDEOGRAPHIC ANNOTATION LINKING MARK..IDEOGRA
31073107-31A0..31B7 ; PVALID # BOPOMOFO LETTER BU..BOPOMOFO FINAL LETTER H
31083108-31B8..31BF ; UNASSIGNED # <reserved>..<reserved>
31093109-31C0..31E3 ; DISALLOWED # CJK STROKE T..CJK STROKE Q
31103110-31E4..31EF ; UNASSIGNED # <reserved>..<reserved>
31113111-31F0..31FF ; PVALID # KATAKANA LETTER SMALL KU..KATAKANA LETTER SM
31123112-3200..321E ; DISALLOWED # PARENTHESIZED HANGUL KIYEOK..PARENTHESIZED K
31133113-321F ; UNASSIGNED # <reserved>
31143114-3220..32FE ; DISALLOWED # PARENTHESIZED IDEOGRAPH ONE..CIRCLED KATAKAN
31153115-32FF ; UNASSIGNED # <reserved>
31163116-3300..33FF ; DISALLOWED # SQUARE APAATO..SQUARE GAL
31173117-3400..4DB5 ; PVALID # <CJK Ideograph Extension A>..<CJK Ideograph
31183118-4DB6..4DBF ; UNASSIGNED # <reserved>..<reserved>
31193119-4DC0..4DFF ; DISALLOWED # HEXAGRAM FOR THE CREATIVE HEAVEN..HEXAGRAM F
31203120-4E00..9FCB ; PVALID # <CJK Ideograph>..<CJK Ideograph>
31213121-9FCC..9FFF ; UNASSIGNED # <reserved>..<reserved>
31223122-A000..A48C ; PVALID # YI SYLLABLE IT..YI SYLLABLE YYR
31233123-A48D..A48F ; UNASSIGNED # <reserved>..<reserved>
31243124-A490..A4C6 ; DISALLOWED # YI RADICAL QOT..YI RADICAL KE
31253125-A4C7..A4CF ; UNASSIGNED # <reserved>..<reserved>
31263126-A4D0..A4FD ; PVALID # LISU LETTER BA..LISU LETTER TONE MYA JEU
31273127-A4FE..A4FF ; DISALLOWED # LISU PUNCTUATION COMMA..LISU PUNCTUATION FUL
31283128-A500..A60C ; PVALID # VAI SYLLABLE EE..VAI SYLLABLE LENGTHENER
31293129-A60D..A60F ; DISALLOWED # VAI COMMA..VAI QUESTION MARK
31303130-A610..A62B ; PVALID # VAI SYLLABLE NDOLE FA..VAI SYLLABLE NDOLE DO
31313131-A62C..A63F ; UNASSIGNED # <reserved>..<reserved>
31323132-A640 ; DISALLOWED # CYRILLIC CAPITAL LETTER ZEMLYA
31333133-A641 ; PVALID # CYRILLIC SMALL LETTER ZEMLYA
31343134-A642 ; DISALLOWED # CYRILLIC CAPITAL LETTER DZELO
31353135-31363136-31373137-31383138-Faltstrom Standards Track [Page 56]
31393139-31403140-RFC 5892 IDNA Code Points August 2010
31413141-31423142-31433143-A643 ; PVALID # CYRILLIC SMALL LETTER DZELO
31443144-A644 ; DISALLOWED # CYRILLIC CAPITAL LETTER REVERSED DZE
31453145-A645 ; PVALID # CYRILLIC SMALL LETTER REVERSED DZE
31463146-A646 ; DISALLOWED # CYRILLIC CAPITAL LETTER IOTA
31473147-A647 ; PVALID # CYRILLIC SMALL LETTER IOTA
31483148-A648 ; DISALLOWED # CYRILLIC CAPITAL LETTER DJERV
31493149-A649 ; PVALID # CYRILLIC SMALL LETTER DJERV
31503150-A64A ; DISALLOWED # CYRILLIC CAPITAL LETTER MONOGRAPH UK
31513151-A64B ; PVALID # CYRILLIC SMALL LETTER MONOGRAPH UK
31523152-A64C ; DISALLOWED # CYRILLIC CAPITAL LETTER BROAD OMEGA
31533153-A64D ; PVALID # CYRILLIC SMALL LETTER BROAD OMEGA
31543154-A64E ; DISALLOWED # CYRILLIC CAPITAL LETTER NEUTRAL YER
31553155-A64F ; PVALID # CYRILLIC SMALL LETTER NEUTRAL YER
31563156-A650 ; DISALLOWED # CYRILLIC CAPITAL LETTER YERU WITH BACK YER
31573157-A651 ; PVALID # CYRILLIC SMALL LETTER YERU WITH BACK YER
31583158-A652 ; DISALLOWED # CYRILLIC CAPITAL LETTER IOTIFIED YAT
31593159-A653 ; PVALID # CYRILLIC SMALL LETTER IOTIFIED YAT
31603160-A654 ; DISALLOWED # CYRILLIC CAPITAL LETTER REVERSED YU
31613161-A655 ; PVALID # CYRILLIC SMALL LETTER REVERSED YU
31623162-A656 ; DISALLOWED # CYRILLIC CAPITAL LETTER IOTIFIED A
31633163-A657 ; PVALID # CYRILLIC SMALL LETTER IOTIFIED A
31643164-A658 ; DISALLOWED # CYRILLIC CAPITAL LETTER CLOSED LITTLE YUS
31653165-A659 ; PVALID # CYRILLIC SMALL LETTER CLOSED LITTLE YUS
31663166-A65A ; DISALLOWED # CYRILLIC CAPITAL LETTER BLENDED YUS
31673167-A65B ; PVALID # CYRILLIC SMALL LETTER BLENDED YUS
31683168-A65C ; DISALLOWED # CYRILLIC CAPITAL LETTER IOTIFIED CLOSED LITT
31693169-A65D ; PVALID # CYRILLIC SMALL LETTER IOTIFIED CLOSED LITTLE
31703170-A65E ; DISALLOWED # CYRILLIC CAPITAL LETTER YN
31713171-A65F ; PVALID # CYRILLIC SMALL LETTER YN
31723172-A660..A661 ; UNASSIGNED # <reserved>..<reserved>
31733173-A662 ; DISALLOWED # CYRILLIC CAPITAL LETTER SOFT DE
31743174-A663 ; PVALID # CYRILLIC SMALL LETTER SOFT DE
31753175-A664 ; DISALLOWED # CYRILLIC CAPITAL LETTER SOFT EL
31763176-A665 ; PVALID # CYRILLIC SMALL LETTER SOFT EL
31773177-A666 ; DISALLOWED # CYRILLIC CAPITAL LETTER SOFT EM
31783178-A667 ; PVALID # CYRILLIC SMALL LETTER SOFT EM
31793179-A668 ; DISALLOWED # CYRILLIC CAPITAL LETTER MONOCULAR O
31803180-A669 ; PVALID # CYRILLIC SMALL LETTER MONOCULAR O
31813181-A66A ; DISALLOWED # CYRILLIC CAPITAL LETTER BINOCULAR O
31823182-A66B ; PVALID # CYRILLIC SMALL LETTER BINOCULAR O
31833183-A66C ; DISALLOWED # CYRILLIC CAPITAL LETTER DOUBLE MONOCULAR O
31843184-A66D..A66F ; PVALID # CYRILLIC SMALL LETTER DOUBLE MONOCULAR O..CO
31853185-A670..A673 ; DISALLOWED # COMBINING CYRILLIC TEN MILLIONS SIGN..SLAVON
31863186-A674..A67B ; UNASSIGNED # <reserved>..<reserved>
31873187-A67C..A67D ; PVALID # COMBINING CYRILLIC KAVYKA..COMBINING CYRILLI
31883188-A67E ; DISALLOWED # CYRILLIC KAVYKA
31893189-A67F ; PVALID # CYRILLIC PAYEROK
31903190-A680 ; DISALLOWED # CYRILLIC CAPITAL LETTER DWE
31913191-31923192-31933193-31943194-Faltstrom Standards Track [Page 57]
31953195-31963196-RFC 5892 IDNA Code Points August 2010
31973197-31983198-31993199-A681 ; PVALID # CYRILLIC SMALL LETTER DWE
32003200-A682 ; DISALLOWED # CYRILLIC CAPITAL LETTER DZWE
32013201-A683 ; PVALID # CYRILLIC SMALL LETTER DZWE
32023202-A684 ; DISALLOWED # CYRILLIC CAPITAL LETTER ZHWE
32033203-A685 ; PVALID # CYRILLIC SMALL LETTER ZHWE
32043204-A686 ; DISALLOWED # CYRILLIC CAPITAL LETTER CCHE
32053205-A687 ; PVALID # CYRILLIC SMALL LETTER CCHE
32063206-A688 ; DISALLOWED # CYRILLIC CAPITAL LETTER DZZE
32073207-A689 ; PVALID # CYRILLIC SMALL LETTER DZZE
32083208-A68A ; DISALLOWED # CYRILLIC CAPITAL LETTER TE WITH MIDDLE HOOK
32093209-A68B ; PVALID # CYRILLIC SMALL LETTER TE WITH MIDDLE HOOK
32103210-A68C ; DISALLOWED # CYRILLIC CAPITAL LETTER TWE
32113211-A68D ; PVALID # CYRILLIC SMALL LETTER TWE
32123212-A68E ; DISALLOWED # CYRILLIC CAPITAL LETTER TSWE
32133213-A68F ; PVALID # CYRILLIC SMALL LETTER TSWE
32143214-A690 ; DISALLOWED # CYRILLIC CAPITAL LETTER TSSE
32153215-A691 ; PVALID # CYRILLIC SMALL LETTER TSSE
32163216-A692 ; DISALLOWED # CYRILLIC CAPITAL LETTER TCHE
32173217-A693 ; PVALID # CYRILLIC SMALL LETTER TCHE
32183218-A694 ; DISALLOWED # CYRILLIC CAPITAL LETTER HWE
32193219-A695 ; PVALID # CYRILLIC SMALL LETTER HWE
32203220-A696 ; DISALLOWED # CYRILLIC CAPITAL LETTER SHWE
32213221-A697 ; PVALID # CYRILLIC SMALL LETTER SHWE
32223222-A698..A69F ; UNASSIGNED # <reserved>..<reserved>
32233223-A6A0..A6E5 ; PVALID # BAMUM LETTER A..BAMUM LETTER KI
32243224-A6E6..A6EF ; DISALLOWED # BAMUM LETTER MO..BAMUM LETTER KOGHOM
32253225-A6F0..A6F1 ; PVALID # BAMUM COMBINING MARK KOQNDON..BAMUM COMBININ
32263226-A6F2..A6F7 ; DISALLOWED # BAMUM NJAEMLI..BAMUM QUESTION MARK
32273227-A6F8..A6FF ; UNASSIGNED # <reserved>..<reserved>
32283228-A700..A716 ; DISALLOWED # MODIFIER LETTER CHINESE TONE YIN PING..MODIF
32293229-A717..A71F ; PVALID # MODIFIER LETTER DOT VERTICAL BAR..MODIFIER L
32303230-A720..A722 ; DISALLOWED # MODIFIER LETTER STRESS AND HIGH TONE..LATIN
32313231-A723 ; PVALID # LATIN SMALL LETTER EGYPTOLOGICAL ALEF
32323232-A724 ; DISALLOWED # LATIN CAPITAL LETTER EGYPTOLOGICAL AIN
32333233-A725 ; PVALID # LATIN SMALL LETTER EGYPTOLOGICAL AIN
32343234-A726 ; DISALLOWED # LATIN CAPITAL LETTER HENG
32353235-A727 ; PVALID # LATIN SMALL LETTER HENG
32363236-A728 ; DISALLOWED # LATIN CAPITAL LETTER TZ
32373237-A729 ; PVALID # LATIN SMALL LETTER TZ
32383238-A72A ; DISALLOWED # LATIN CAPITAL LETTER TRESILLO
32393239-A72B ; PVALID # LATIN SMALL LETTER TRESILLO
32403240-A72C ; DISALLOWED # LATIN CAPITAL LETTER CUATRILLO
32413241-A72D ; PVALID # LATIN SMALL LETTER CUATRILLO
32423242-A72E ; DISALLOWED # LATIN CAPITAL LETTER CUATRILLO WITH COMMA
32433243-A72F..A731 ; PVALID # LATIN SMALL LETTER CUATRILLO WITH COMMA..LAT
32443244-A732 ; DISALLOWED # LATIN CAPITAL LETTER AA
32453245-A733 ; PVALID # LATIN SMALL LETTER AA
32463246-A734 ; DISALLOWED # LATIN CAPITAL LETTER AO
32473247-32483248-32493249-32503250-Faltstrom Standards Track [Page 58]
32513251-32523252-RFC 5892 IDNA Code Points August 2010
32533253-32543254-32553255-A735 ; PVALID # LATIN SMALL LETTER AO
32563256-A736 ; DISALLOWED # LATIN CAPITAL LETTER AU
32573257-A737 ; PVALID # LATIN SMALL LETTER AU
32583258-A738 ; DISALLOWED # LATIN CAPITAL LETTER AV
32593259-A739 ; PVALID # LATIN SMALL LETTER AV
32603260-A73A ; DISALLOWED # LATIN CAPITAL LETTER AV WITH HORIZONTAL BAR
32613261-A73B ; PVALID # LATIN SMALL LETTER AV WITH HORIZONTAL BAR
32623262-A73C ; DISALLOWED # LATIN CAPITAL LETTER AY
32633263-A73D ; PVALID # LATIN SMALL LETTER AY
32643264-A73E ; DISALLOWED # LATIN CAPITAL LETTER REVERSED C WITH DOT
32653265-A73F ; PVALID # LATIN SMALL LETTER REVERSED C WITH DOT
32663266-A740 ; DISALLOWED # LATIN CAPITAL LETTER K WITH STROKE
32673267-A741 ; PVALID # LATIN SMALL LETTER K WITH STROKE
32683268-A742 ; DISALLOWED # LATIN CAPITAL LETTER K WITH DIAGONAL STROKE
32693269-A743 ; PVALID # LATIN SMALL LETTER K WITH DIAGONAL STROKE
32703270-A744 ; DISALLOWED # LATIN CAPITAL LETTER K WITH STROKE AND DIAGO
32713271-A745 ; PVALID # LATIN SMALL LETTER K WITH STROKE AND DIAGONA
32723272-A746 ; DISALLOWED # LATIN CAPITAL LETTER BROKEN L
32733273-A747 ; PVALID # LATIN SMALL LETTER BROKEN L
32743274-A748 ; DISALLOWED # LATIN CAPITAL LETTER L WITH HIGH STROKE
32753275-A749 ; PVALID # LATIN SMALL LETTER L WITH HIGH STROKE
32763276-A74A ; DISALLOWED # LATIN CAPITAL LETTER O WITH LONG STROKE OVER
32773277-A74B ; PVALID # LATIN SMALL LETTER O WITH LONG STROKE OVERLA
32783278-A74C ; DISALLOWED # LATIN CAPITAL LETTER O WITH LOOP
32793279-A74D ; PVALID # LATIN SMALL LETTER O WITH LOOP
32803280-A74E ; DISALLOWED # LATIN CAPITAL LETTER OO
32813281-A74F ; PVALID # LATIN SMALL LETTER OO
32823282-A750 ; DISALLOWED # LATIN CAPITAL LETTER P WITH STROKE THROUGH D
32833283-A751 ; PVALID # LATIN SMALL LETTER P WITH STROKE THROUGH DES
32843284-A752 ; DISALLOWED # LATIN CAPITAL LETTER P WITH FLOURISH
32853285-A753 ; PVALID # LATIN SMALL LETTER P WITH FLOURISH
32863286-A754 ; DISALLOWED # LATIN CAPITAL LETTER P WITH SQUIRREL TAIL
32873287-A755 ; PVALID # LATIN SMALL LETTER P WITH SQUIRREL TAIL
32883288-A756 ; DISALLOWED # LATIN CAPITAL LETTER Q WITH STROKE THROUGH D
32893289-A757 ; PVALID # LATIN SMALL LETTER Q WITH STROKE THROUGH DES
32903290-A758 ; DISALLOWED # LATIN CAPITAL LETTER Q WITH DIAGONAL STROKE
32913291-A759 ; PVALID # LATIN SMALL LETTER Q WITH DIAGONAL STROKE
32923292-A75A ; DISALLOWED # LATIN CAPITAL LETTER R ROTUNDA
32933293-A75B ; PVALID # LATIN SMALL LETTER R ROTUNDA
32943294-A75C ; DISALLOWED # LATIN CAPITAL LETTER RUM ROTUNDA
32953295-A75D ; PVALID # LATIN SMALL LETTER RUM ROTUNDA
32963296-A75E ; DISALLOWED # LATIN CAPITAL LETTER V WITH DIAGONAL STROKE
32973297-A75F ; PVALID # LATIN SMALL LETTER V WITH DIAGONAL STROKE
32983298-A760 ; DISALLOWED # LATIN CAPITAL LETTER VY
32993299-A761 ; PVALID # LATIN SMALL LETTER VY
33003300-A762 ; DISALLOWED # LATIN CAPITAL LETTER VISIGOTHIC Z
33013301-A763 ; PVALID # LATIN SMALL LETTER VISIGOTHIC Z
33023302-A764 ; DISALLOWED # LATIN CAPITAL LETTER THORN WITH STROKE
33033303-33043304-33053305-33063306-Faltstrom Standards Track [Page 59]
33073307-33083308-RFC 5892 IDNA Code Points August 2010
33093309-33103310-33113311-A765 ; PVALID # LATIN SMALL LETTER THORN WITH STROKE
33123312-A766 ; DISALLOWED # LATIN CAPITAL LETTER THORN WITH STROKE THROU
33133313-A767 ; PVALID # LATIN SMALL LETTER THORN WITH STROKE THROUGH
33143314-A768 ; DISALLOWED # LATIN CAPITAL LETTER VEND
33153315-A769 ; PVALID # LATIN SMALL LETTER VEND
33163316-A76A ; DISALLOWED # LATIN CAPITAL LETTER ET
33173317-A76B ; PVALID # LATIN SMALL LETTER ET
33183318-A76C ; DISALLOWED # LATIN CAPITAL LETTER IS
33193319-A76D ; PVALID # LATIN SMALL LETTER IS
33203320-A76E ; DISALLOWED # LATIN CAPITAL LETTER CON
33213321-A76F ; PVALID # LATIN SMALL LETTER CON
33223322-A770 ; DISALLOWED # MODIFIER LETTER US
33233323-A771..A778 ; PVALID # LATIN SMALL LETTER DUM..LATIN SMALL LETTER U
33243324-A779 ; DISALLOWED # LATIN CAPITAL LETTER INSULAR D
33253325-A77A ; PVALID # LATIN SMALL LETTER INSULAR D
33263326-A77B ; DISALLOWED # LATIN CAPITAL LETTER INSULAR F
33273327-A77C ; PVALID # LATIN SMALL LETTER INSULAR F
33283328-A77D..A77E ; DISALLOWED # LATIN CAPITAL LETTER INSULAR G..LATIN CAPITA
33293329-A77F ; PVALID # LATIN SMALL LETTER TURNED INSULAR G
33303330-A780 ; DISALLOWED # LATIN CAPITAL LETTER TURNED L
33313331-A781 ; PVALID # LATIN SMALL LETTER TURNED L
33323332-A782 ; DISALLOWED # LATIN CAPITAL LETTER INSULAR R
33333333-A783 ; PVALID # LATIN SMALL LETTER INSULAR R
33343334-A784 ; DISALLOWED # LATIN CAPITAL LETTER INSULAR S
33353335-A785 ; PVALID # LATIN SMALL LETTER INSULAR S
33363336-A786 ; DISALLOWED # LATIN CAPITAL LETTER INSULAR T
33373337-A787..A788 ; PVALID # LATIN SMALL LETTER INSULAR T..MODIFIER LETTE
33383338-A789..A78B ; DISALLOWED # MODIFIER LETTER COLON..LATIN CAPITAL LETTER
33393339-A78C ; PVALID # LATIN SMALL LETTER SALTILLO
33403340-A78D..A7FA ; UNASSIGNED # <reserved>..<reserved>
33413341-A7FB..A827 ; PVALID # LATIN EPIGRAPHIC LETTER REVERSED F..SYLOTI N
33423342-A828..A82B ; DISALLOWED # SYLOTI NAGRI POETRY MARK-1..SYLOTI NAGRI POE
33433343-A82C..A82F ; UNASSIGNED # <reserved>..<reserved>
33443344-A830..A839 ; DISALLOWED # NORTH INDIC FRACTION ONE QUARTER..NORTH INDI
33453345-A83A..A83F ; UNASSIGNED # <reserved>..<reserved>
33463346-A840..A873 ; PVALID # PHAGS-PA LETTER KA..PHAGS-PA LETTER CANDRABI
33473347-A874..A877 ; DISALLOWED # PHAGS-PA SINGLE HEAD MARK..PHAGS-PA MARK DOU
33483348-A878..A87F ; UNASSIGNED # <reserved>..<reserved>
33493349-A880..A8C4 ; PVALID # SAURASHTRA SIGN ANUSVARA..SAURASHTRA SIGN VI
33503350-A8C5..A8CD ; UNASSIGNED # <reserved>..<reserved>
33513351-A8CE..A8CF ; DISALLOWED # SAURASHTRA DANDA..SAURASHTRA DOUBLE DANDA
33523352-A8D0..A8D9 ; PVALID # SAURASHTRA DIGIT ZERO..SAURASHTRA DIGIT NINE
33533353-A8DA..A8DF ; UNASSIGNED # <reserved>..<reserved>
33543354-A8E0..A8F7 ; PVALID # COMBINING DEVANAGARI DIGIT ZERO..DEVANAGARI
33553355-A8F8..A8FA ; DISALLOWED # DEVANAGARI SIGN PUSHPIKA..DEVANAGARI CARET
33563356-A8FB ; PVALID # DEVANAGARI HEADSTROKE
33573357-A8FC..A8FF ; UNASSIGNED # <reserved>..<reserved>
33583358-A900..A92D ; PVALID # KAYAH LI DIGIT ZERO..KAYAH LI TONE CALYA PLO
33593359-33603360-33613361-33623362-Faltstrom Standards Track [Page 60]
33633363-33643364-RFC 5892 IDNA Code Points August 2010
33653365-33663366-33673367-A92E..A92F ; DISALLOWED # KAYAH LI SIGN CWI..KAYAH LI SIGN SHYA
33683368-A930..A953 ; PVALID # REJANG LETTER KA..REJANG VIRAMA
33693369-A954..A95E ; UNASSIGNED # <reserved>..<reserved>
33703370-A95F..A97C ; DISALLOWED # REJANG SECTION MARK..HANGUL CHOSEONG SSANGYE
33713371-A97D..A97F ; UNASSIGNED # <reserved>..<reserved>
33723372-A980..A9C0 ; PVALID # JAVANESE SIGN PANYANGGA..JAVANESE PANGKON
33733373-A9C1..A9CD ; DISALLOWED # JAVANESE LEFT RERENGGAN..JAVANESE TURNED PAD
33743374-A9CE ; UNASSIGNED # <reserved>
33753375-A9CF..A9D9 ; PVALID # JAVANESE PANGRANGKEP..JAVANESE DIGIT NINE
33763376-A9DA..A9DD ; UNASSIGNED # <reserved>..<reserved>
33773377-A9DE..A9DF ; DISALLOWED # JAVANESE PADA TIRTA TUMETES..JAVANESE PADA I
33783378-A9E0..A9FF ; UNASSIGNED # <reserved>..<reserved>
33793379-AA00..AA36 ; PVALID # CHAM LETTER A..CHAM CONSONANT SIGN WA
33803380-AA37..AA3F ; UNASSIGNED # <reserved>..<reserved>
33813381-AA40..AA4D ; PVALID # CHAM LETTER FINAL K..CHAM CONSONANT SIGN FIN
33823382-AA4E..AA4F ; UNASSIGNED # <reserved>..<reserved>
33833383-AA50..AA59 ; PVALID # CHAM DIGIT ZERO..CHAM DIGIT NINE
33843384-AA5A..AA5B ; UNASSIGNED # <reserved>..<reserved>
33853385-AA5C..AA5F ; DISALLOWED # CHAM PUNCTUATION SPIRAL..CHAM PUNCTUATION TR
33863386-AA60..AA76 ; PVALID # MYANMAR LETTER KHAMTI GA..MYANMAR LOGOGRAM K
33873387-AA77..AA79 ; DISALLOWED # MYANMAR SYMBOL AITON EXCLAMATION..MYANMAR SY
33883388-AA7A..AA7B ; PVALID # MYANMAR LETTER AITON RA..MYANMAR SIGN PAO KA
33893389-AA7C..AA7F ; UNASSIGNED # <reserved>..<reserved>
33903390-AA80..AAC2 ; PVALID # TAI VIET LETTER LOW KO..TAI VIET TONE MAI SO
33913391-AAC3..AADA ; UNASSIGNED # <reserved>..<reserved>
33923392-AADB..AADD ; PVALID # TAI VIET SYMBOL KON..TAI VIET SYMBOL SAM
33933393-AADE..AADF ; DISALLOWED # TAI VIET SYMBOL HO HOI..TAI VIET SYMBOL KOI
33943394-AAE0..ABBF ; UNASSIGNED # <reserved>..<reserved>
33953395-ABC0..ABEA ; PVALID # MEETEI MAYEK LETTER KOK..MEETEI MAYEK VOWEL
33963396-ABEB ; DISALLOWED # MEETEI MAYEK CHEIKHEI
33973397-ABEC..ABED ; PVALID # MEETEI MAYEK LUM IYEK..MEETEI MAYEK APUN IYE
33983398-ABEE..ABEF ; UNASSIGNED # <reserved>..<reserved>
33993399-ABF0..ABF9 ; PVALID # MEETEI MAYEK DIGIT ZERO..MEETEI MAYEK DIGIT
34003400-ABFA..ABFF ; UNASSIGNED # <reserved>..<reserved>
34013401-AC00..D7A3 ; PVALID # <Hangul Syllable>..<Hangul Syllable>
34023402-D7A4..D7AF ; UNASSIGNED # <reserved>..<reserved>
34033403-D7B0..D7C6 ; DISALLOWED # HANGUL JUNGSEONG O-YEO..HANGUL JUNGSEONG ARA
34043404-D7C7..D7CA ; UNASSIGNED # <reserved>..<reserved>
34053405-D7CB..D7FB ; DISALLOWED # HANGUL JONGSEONG NIEUN-RIEUL..HANGUL JONGSEO
34063406-D7FC..D7FF ; UNASSIGNED # <reserved>..<reserved>
34073407-D800..FA0D ; DISALLOWED # <Non Private Use High Surrogate>..CJK COMPAT
34083408-FA0E..FA0F ; PVALID # CJK COMPATIBILITY IDEOGRAPH-FA0E..CJK COMPAT
34093409-FA10 ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA10
34103410-FA11 ; PVALID # CJK COMPATIBILITY IDEOGRAPH-FA11
34113411-FA12 ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA12
34123412-FA13..FA14 ; PVALID # CJK COMPATIBILITY IDEOGRAPH-FA13..CJK COMPAT
34133413-FA15..FA1E ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA15..CJK COMPAT
34143414-FA1F ; PVALID # CJK COMPATIBILITY IDEOGRAPH-FA1F
34153415-34163416-34173417-34183418-Faltstrom Standards Track [Page 61]
34193419-34203420-RFC 5892 IDNA Code Points August 2010
34213421-34223422-34233423-FA20 ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA20
34243424-FA21 ; PVALID # CJK COMPATIBILITY IDEOGRAPH-FA21
34253425-FA22 ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA22
34263426-FA23..FA24 ; PVALID # CJK COMPATIBILITY IDEOGRAPH-FA23..CJK COMPAT
34273427-FA25..FA26 ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA25..CJK COMPAT
34283428-FA27..FA29 ; PVALID # CJK COMPATIBILITY IDEOGRAPH-FA27..CJK COMPAT
34293429-FA2A..FA2D ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA2A..CJK COMPAT
34303430-FA2E..FA2F ; UNASSIGNED # <reserved>..<reserved>
34313431-FA30..FA6D ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA30..CJK COMPAT
34323432-FA6E..FA6F ; UNASSIGNED # <reserved>..<reserved>
34333433-FA70..FAD9 ; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPAT
34343434-FADA..FAFF ; UNASSIGNED # <reserved>..<reserved>
34353435-FB00..FB06 ; DISALLOWED # LATIN SMALL LIGATURE FF..LATIN SMALL LIGATUR
34363436-FB07..FB12 ; UNASSIGNED # <reserved>..<reserved>
34373437-FB13..FB17 ; DISALLOWED # ARMENIAN SMALL LIGATURE MEN NOW..ARMENIAN SM
34383438-FB18..FB1C ; UNASSIGNED # <reserved>..<reserved>
34393439-FB1D ; DISALLOWED # HEBREW LETTER YOD WITH HIRIQ
34403440-FB1E ; PVALID # HEBREW POINT JUDEO-SPANISH VARIKA
34413441-FB1F..FB36 ; DISALLOWED # HEBREW LIGATURE YIDDISH YOD YOD PATAH..HEBRE
34423442-FB37 ; UNASSIGNED # <reserved>
34433443-FB38..FB3C ; DISALLOWED # HEBREW LETTER TET WITH DAGESH..HEBREW LETTER
34443444-FB3D ; UNASSIGNED # <reserved>
34453445-FB3E ; DISALLOWED # HEBREW LETTER MEM WITH DAGESH
34463446-FB3F ; UNASSIGNED # <reserved>
34473447-FB40..FB41 ; DISALLOWED # HEBREW LETTER NUN WITH DAGESH..HEBREW LETTER
34483448-FB42 ; UNASSIGNED # <reserved>
34493449-FB43..FB44 ; DISALLOWED # HEBREW LETTER FINAL PE WITH DAGESH..HEBREW L
34503450-FB45 ; UNASSIGNED # <reserved>
34513451-FB46..FBB1 ; DISALLOWED # HEBREW LETTER TSADI WITH DAGESH..ARABIC LETT
34523452-FBB2..FBD2 ; UNASSIGNED # <reserved>..<reserved>
34533453-FBD3..FD3F ; DISALLOWED # ARABIC LETTER NG ISOLATED FORM..ORNATE RIGHT
34543454-FD40..FD4F ; UNASSIGNED # <reserved>..<reserved>
34553455-FD50..FD8F ; DISALLOWED # ARABIC LIGATURE TEH WITH JEEM WITH MEEM INIT
34563456-FD90..FD91 ; UNASSIGNED # <reserved>..<reserved>
34573457-FD92..FDC7 ; DISALLOWED # ARABIC LIGATURE MEEM WITH JEEM WITH KHAH INI
34583458-FDC8..FDCF ; UNASSIGNED # <reserved>..<reserved>
34593459-FDD0..FDFD ; DISALLOWED # <noncharacter>..ARABIC LIGATURE BISMILLAH AR
34603460-FDFE..FDFF ; UNASSIGNED # <reserved>..<reserved>
34613461-FE00..FE19 ; DISALLOWED # VARIATION SELECTOR-1..PRESENTATION FORM FOR
34623462-FE1A..FE1F ; UNASSIGNED # <reserved>..<reserved>
34633463-FE20..FE26 ; PVALID # COMBINING LIGATURE LEFT HALF..COMBINING CONJ
34643464-FE27..FE2F ; UNASSIGNED # <reserved>..<reserved>
34653465-FE30..FE52 ; DISALLOWED # PRESENTATION FORM FOR VERTICAL TWO DOT LEADE
34663466-FE53 ; UNASSIGNED # <reserved>
34673467-FE54..FE66 ; DISALLOWED # SMALL SEMICOLON..SMALL EQUALS SIGN
34683468-FE67 ; UNASSIGNED # <reserved>
34693469-FE68..FE6B ; DISALLOWED # SMALL REVERSE SOLIDUS..SMALL COMMERCIAL AT
34703470-FE6C..FE6F ; UNASSIGNED # <reserved>..<reserved>
34713471-34723472-34733473-34743474-Faltstrom Standards Track [Page 62]
34753475-34763476-RFC 5892 IDNA Code Points August 2010
34773477-34783478-34793479-FE70..FE72 ; DISALLOWED # ARABIC FATHATAN ISOLATED FORM..ARABIC DAMMAT
34803480-FE73 ; PVALID # ARABIC TAIL FRAGMENT
34813481-FE74 ; DISALLOWED # ARABIC KASRATAN ISOLATED FORM
34823482-FE75 ; UNASSIGNED # <reserved>
34833483-FE76..FEFC ; DISALLOWED # ARABIC FATHA ISOLATED FORM..ARABIC LIGATURE
34843484-FEFD..FEFE ; UNASSIGNED # <reserved>..<reserved>
34853485-FEFF ; DISALLOWED # ZERO WIDTH NO-BREAK SPACE
34863486-FF00 ; UNASSIGNED # <reserved>
34873487-FF01..FFBE ; DISALLOWED # FULLWIDTH EXCLAMATION MARK..HALFWIDTH HANGUL
34883488-FFBF..FFC1 ; UNASSIGNED # <reserved>..<reserved>
34893489-FFC2..FFC7 ; DISALLOWED # HALFWIDTH HANGUL LETTER A..HALFWIDTH HANGUL
34903490-FFC8..FFC9 ; UNASSIGNED # <reserved>..<reserved>
34913491-FFCA..FFCF ; DISALLOWED # HALFWIDTH HANGUL LETTER YEO..HALFWIDTH HANGU
34923492-FFD0..FFD1 ; UNASSIGNED # <reserved>..<reserved>
34933493-FFD2..FFD7 ; DISALLOWED # HALFWIDTH HANGUL LETTER YO..HALFWIDTH HANGUL
34943494-FFD8..FFD9 ; UNASSIGNED # <reserved>..<reserved>
34953495-FFDA..FFDC ; DISALLOWED # HALFWIDTH HANGUL LETTER EU..HALFWIDTH HANGUL
34963496-FFDD..FFDF ; UNASSIGNED # <reserved>..<reserved>
34973497-FFE0..FFE6 ; DISALLOWED # FULLWIDTH CENT SIGN..FULLWIDTH WON SIGN
34983498-FFE7 ; UNASSIGNED # <reserved>
34993499-FFE8..FFEE ; DISALLOWED # HALFWIDTH FORMS LIGHT VERTICAL..HALFWIDTH WH
35003500-FFEF..FFF8 ; UNASSIGNED # <reserved>..<reserved>
35013501-FFF9..FFFF ; DISALLOWED # INTERLINEAR ANNOTATION ANCHOR..<noncharacter
35023502-10000..1000B; PVALID # LINEAR B SYLLABLE B008 A..LINEAR B SYLLABLE
35033503-1000C ; UNASSIGNED # <reserved>
35043504-1000D..10026; PVALID # LINEAR B SYLLABLE B036 JO..LINEAR B SYLLABLE
35053505-10027 ; UNASSIGNED # <reserved>
35063506-10028..1003A; PVALID # LINEAR B SYLLABLE B060 RA..LINEAR B SYLLABLE
35073507-1003B ; UNASSIGNED # <reserved>
35083508-1003C..1003D; PVALID # LINEAR B SYLLABLE B017 ZA..LINEAR B SYLLABLE
35093509-1003E ; UNASSIGNED # <reserved>
35103510-1003F..1004D; PVALID # LINEAR B SYLLABLE B020 ZO..LINEAR B SYLLABLE
35113511-1004E..1004F; UNASSIGNED # <reserved>..<reserved>
35123512-10050..1005D; PVALID # LINEAR B SYMBOL B018..LINEAR B SYMBOL B089
35133513-1005E..1007F; UNASSIGNED # <reserved>..<reserved>
35143514-10080..100FA; PVALID # LINEAR B IDEOGRAM B100 MAN..LINEAR B IDEOGRA
35153515-100FB..100FF; UNASSIGNED # <reserved>..<reserved>
35163516-10100..10102; DISALLOWED # AEGEAN WORD SEPARATOR LINE..AEGEAN CHECK MAR
35173517-10103..10106; UNASSIGNED # <reserved>..<reserved>
35183518-10107..10133; DISALLOWED # AEGEAN NUMBER ONE..AEGEAN NUMBER NINETY THOU
35193519-10134..10136; UNASSIGNED # <reserved>..<reserved>
35203520-10137..1018A; DISALLOWED # AEGEAN WEIGHT BASE UNIT..GREEK ZERO SIGN
35213521-1018B..1018F; UNASSIGNED # <reserved>..<reserved>
35223522-10190..1019B; DISALLOWED # ROMAN SEXTANS SIGN..ROMAN CENTURIAL SIGN
35233523-1019C..101CF; UNASSIGNED # <reserved>..<reserved>
35243524-101D0..101FC; DISALLOWED # PHAISTOS DISC SIGN PEDESTRIAN..PHAISTOS DISC
35253525-101FD ; PVALID # PHAISTOS DISC SIGN COMBINING OBLIQUE STROKE
35263526-101FE..1027F; UNASSIGNED # <reserved>..<reserved>
35273527-35283528-35293529-35303530-Faltstrom Standards Track [Page 63]
35313531-35323532-RFC 5892 IDNA Code Points August 2010
35333533-35343534-35353535-10280..1029C; PVALID # LYCIAN LETTER A..LYCIAN LETTER X
35363536-1029D..1029F; UNASSIGNED # <reserved>..<reserved>
35373537-102A0..102D0; PVALID # CARIAN LETTER A..CARIAN LETTER UUU3
35383538-102D1..102FF; UNASSIGNED # <reserved>..<reserved>
35393539-10300..1031E; PVALID # OLD ITALIC LETTER A..OLD ITALIC LETTER UU
35403540-1031F ; UNASSIGNED # <reserved>
35413541-10320..10323; DISALLOWED # OLD ITALIC NUMERAL ONE..OLD ITALIC NUMERAL F
35423542-10324..1032F; UNASSIGNED # <reserved>..<reserved>
35433543-10330..10340; PVALID # GOTHIC LETTER AHSA..GOTHIC LETTER PAIRTHRA
35443544-10341 ; DISALLOWED # GOTHIC LETTER NINETY
35453545-10342..10349; PVALID # GOTHIC LETTER RAIDA..GOTHIC LETTER OTHAL
35463546-1034A ; DISALLOWED # GOTHIC LETTER NINE HUNDRED
35473547-1034B..1037F; UNASSIGNED # <reserved>..<reserved>
35483548-10380..1039D; PVALID # UGARITIC LETTER ALPA..UGARITIC LETTER SSU
35493549-1039E ; UNASSIGNED # <reserved>
35503550-1039F ; DISALLOWED # UGARITIC WORD DIVIDER
35513551-103A0..103C3; PVALID # OLD PERSIAN SIGN A..OLD PERSIAN SIGN HA
35523552-103C4..103C7; UNASSIGNED # <reserved>..<reserved>
35533553-103C8..103CF; PVALID # OLD PERSIAN SIGN AURAMAZDAA..OLD PERSIAN SIG
35543554-103D0..103D5; DISALLOWED # OLD PERSIAN WORD DIVIDER..OLD PERSIAN NUMBER
35553555-103D6..103FF; UNASSIGNED # <reserved>..<reserved>
35563556-10400..10427; DISALLOWED # DESERET CAPITAL LETTER LONG I..DESERET CAPIT
35573557-10428..1049D; PVALID # DESERET SMALL LETTER LONG I..OSMANYA LETTER
35583558-1049E..1049F; UNASSIGNED # <reserved>..<reserved>
35593559-104A0..104A9; PVALID # OSMANYA DIGIT ZERO..OSMANYA DIGIT NINE
35603560-104AA..107FF; UNASSIGNED # <reserved>..<reserved>
35613561-10800..10805; PVALID # CYPRIOT SYLLABLE A..CYPRIOT SYLLABLE JA
35623562-10806..10807; UNASSIGNED # <reserved>..<reserved>
35633563-10808 ; PVALID # CYPRIOT SYLLABLE JO
35643564-10809 ; UNASSIGNED # <reserved>
35653565-1080A..10835; PVALID # CYPRIOT SYLLABLE KA..CYPRIOT SYLLABLE WO
35663566-10836 ; UNASSIGNED # <reserved>
35673567-10837..10838; PVALID # CYPRIOT SYLLABLE XA..CYPRIOT SYLLABLE XE
35683568-10839..1083B; UNASSIGNED # <reserved>..<reserved>
35693569-1083C ; PVALID # CYPRIOT SYLLABLE ZA
35703570-1083D..1083E; UNASSIGNED # <reserved>..<reserved>
35713571-1083F..10855; PVALID # CYPRIOT SYLLABLE ZO..IMPERIAL ARAMAIC LETTER
35723572-10856 ; UNASSIGNED # <reserved>
35733573-10857..1085F; DISALLOWED # IMPERIAL ARAMAIC SECTION SIGN..IMPERIAL ARAM
35743574-10860..108FF; UNASSIGNED # <reserved>..<reserved>
35753575-10900..10915; PVALID # PHOENICIAN LETTER ALF..PHOENICIAN LETTER TAU
35763576-10916..1091B; DISALLOWED # PHOENICIAN NUMBER ONE..PHOENICIAN NUMBER THR
35773577-1091C..1091E; UNASSIGNED # <reserved>..<reserved>
35783578-1091F ; DISALLOWED # PHOENICIAN WORD SEPARATOR
35793579-10920..10939; PVALID # LYDIAN LETTER A..LYDIAN LETTER C
35803580-1093A..1093E; UNASSIGNED # <reserved>..<reserved>
35813581-1093F ; DISALLOWED # LYDIAN TRIANGULAR MARK
35823582-10940..109FF; UNASSIGNED # <reserved>..<reserved>
35833583-35843584-35853585-35863586-Faltstrom Standards Track [Page 64]
35873587-35883588-RFC 5892 IDNA Code Points August 2010
35893589-35903590-35913591-10A00..10A03; PVALID # KHAROSHTHI LETTER A..KHAROSHTHI VOWEL SIGN V
35923592-10A04 ; UNASSIGNED # <reserved>
35933593-10A05..10A06; PVALID # KHAROSHTHI VOWEL SIGN E..KHAROSHTHI VOWEL SI
35943594-10A07..10A0B; UNASSIGNED # <reserved>..<reserved>
35953595-10A0C..10A13; PVALID # KHAROSHTHI VOWEL LENGTH MARK..KHAROSHTHI LET
35963596-10A14 ; UNASSIGNED # <reserved>
35973597-10A15..10A17; PVALID # KHAROSHTHI LETTER CA..KHAROSHTHI LETTER JA
35983598-10A18 ; UNASSIGNED # <reserved>
35993599-10A19..10A33; PVALID # KHAROSHTHI LETTER NYA..KHAROSHTHI LETTER TTT
36003600-10A34..10A37; UNASSIGNED # <reserved>..<reserved>
36013601-10A38..10A3A; PVALID # KHAROSHTHI SIGN BAR ABOVE..KHAROSHTHI SIGN D
36023602-10A3B..10A3E; UNASSIGNED # <reserved>..<reserved>
36033603-10A3F ; PVALID # KHAROSHTHI VIRAMA
36043604-10A40..10A47; DISALLOWED # KHAROSHTHI DIGIT ONE..KHAROSHTHI NUMBER ONE
36053605-10A48..10A4F; UNASSIGNED # <reserved>..<reserved>
36063606-10A50..10A58; DISALLOWED # KHAROSHTHI PUNCTUATION DOT..KHAROSHTHI PUNCT
36073607-10A59..10A5F; UNASSIGNED # <reserved>..<reserved>
36083608-10A60..10A7C; PVALID # OLD SOUTH ARABIAN LETTER HE..OLD SOUTH ARABI
36093609-10A7D..10A7F; DISALLOWED # OLD SOUTH ARABIAN NUMBER ONE..OLD SOUTH ARAB
36103610-10A80..10AFF; UNASSIGNED # <reserved>..<reserved>
36113611-10B00..10B35; PVALID # AVESTAN LETTER A..AVESTAN LETTER HE
36123612-10B36..10B38; UNASSIGNED # <reserved>..<reserved>
36133613-10B39..10B3F; DISALLOWED # AVESTAN ABBREVIATION MARK..LARGE ONE RING OV
36143614-10B40..10B55; PVALID # INSCRIPTIONAL PARTHIAN LETTER ALEPH..INSCRIP
36153615-10B56..10B57; UNASSIGNED # <reserved>..<reserved>
36163616-10B58..10B5F; DISALLOWED # INSCRIPTIONAL PARTHIAN NUMBER ONE..INSCRIPTI
36173617-10B60..10B72; PVALID # INSCRIPTIONAL PAHLAVI LETTER ALEPH..INSCRIPT
36183618-10B73..10B77; UNASSIGNED # <reserved>..<reserved>
36193619-10B78..10B7F; DISALLOWED # INSCRIPTIONAL PAHLAVI NUMBER ONE..INSCRIPTIO
36203620-10B80..10BFF; UNASSIGNED # <reserved>..<reserved>
36213621-10C00..10C48; PVALID # OLD TURKIC LETTER ORKHON A..OLD TURKIC LETTE
36223622-10C49..10E5F; UNASSIGNED # <reserved>..<reserved>
36233623-10E60..10E7E; DISALLOWED # RUMI DIGIT ONE..RUMI FRACTION TWO THIRDS
36243624-10E7F..1107F; UNASSIGNED # <reserved>..<reserved>
36253625-11080..110BA; PVALID # KAITHI SIGN CANDRABINDU..KAITHI SIGN NUKTA
36263626-110BB..110C1; DISALLOWED # KAITHI ABBREVIATION SIGN..KAITHI DOUBLE DAND
36273627-110C2..11FFF; UNASSIGNED # <reserved>..<reserved>
36283628-12000..1236E; PVALID # CUNEIFORM SIGN A..CUNEIFORM SIGN ZUM
36293629-1236F..123FF; UNASSIGNED # <reserved>..<reserved>
36303630-12400..12462; DISALLOWED # CUNEIFORM NUMERIC SIGN TWO ASH..CUNEIFORM NU
36313631-12463..1246F; UNASSIGNED # <reserved>..<reserved>
36323632-12470..12473; DISALLOWED # CUNEIFORM PUNCTUATION SIGN OLD ASSYRIAN WORD
36333633-12474..12FFF; UNASSIGNED # <reserved>..<reserved>
36343634-13000..1342E; PVALID # EGYPTIAN HIEROGLYPH A001..EGYPTIAN HIEROGLYP
36353635-1342F..1CFFF; UNASSIGNED # <reserved>..<reserved>
36363636-1D000..1D0F5; DISALLOWED # BYZANTINE MUSICAL SYMBOL PSILI..BYZANTINE MU
36373637-1D0F6..1D0FF; UNASSIGNED # <reserved>..<reserved>
36383638-1D100..1D126; DISALLOWED # MUSICAL SYMBOL SINGLE BARLINE..MUSICAL SYMBO
36393639-36403640-36413641-36423642-Faltstrom Standards Track [Page 65]
36433643-36443644-RFC 5892 IDNA Code Points August 2010
36453645-36463646-36473647-1D127..1D128; UNASSIGNED # <reserved>..<reserved>
36483648-1D129..1D1DD; DISALLOWED # MUSICAL SYMBOL MULTIPLE MEASURE REST..MUSICA
36493649-1D1DE..1D1FF; UNASSIGNED # <reserved>..<reserved>
36503650-1D200..1D245; DISALLOWED # GREEK VOCAL NOTATION SYMBOL-1..GREEK MUSICAL
36513651-1D246..1D2FF; UNASSIGNED # <reserved>..<reserved>
36523652-1D300..1D356; DISALLOWED # MONOGRAM FOR EARTH..TETRAGRAM FOR FOSTERING
36533653-1D357..1D35F; UNASSIGNED # <reserved>..<reserved>
36543654-1D360..1D371; DISALLOWED # COUNTING ROD UNIT DIGIT ONE..COUNTING ROD TE
36553655-1D372..1D3FF; UNASSIGNED # <reserved>..<reserved>
36563656-1D400..1D454; DISALLOWED # MATHEMATICAL BOLD CAPITAL A..MATHEMATICAL IT
36573657-1D455 ; UNASSIGNED # <reserved>
36583658-1D456..1D49C; DISALLOWED # MATHEMATICAL ITALIC SMALL I..MATHEMATICAL SC
36593659-1D49D ; UNASSIGNED # <reserved>
36603660-1D49E..1D49F; DISALLOWED # MATHEMATICAL SCRIPT CAPITAL C..MATHEMATICAL
36613661-1D4A0..1D4A1; UNASSIGNED # <reserved>..<reserved>
36623662-1D4A2 ; DISALLOWED # MATHEMATICAL SCRIPT CAPITAL G
36633663-1D4A3..1D4A4; UNASSIGNED # <reserved>..<reserved>
36643664-1D4A5..1D4A6; DISALLOWED # MATHEMATICAL SCRIPT CAPITAL J..MATHEMATICAL
36653665-1D4A7..1D4A8; UNASSIGNED # <reserved>..<reserved>
36663666-1D4A9..1D4AC; DISALLOWED # MATHEMATICAL SCRIPT CAPITAL N..MATHEMATICAL
36673667-1D4AD ; UNASSIGNED # <reserved>
36683668-1D4AE..1D4B9; DISALLOWED # MATHEMATICAL SCRIPT CAPITAL S..MATHEMATICAL
36693669-1D4BA ; UNASSIGNED # <reserved>
36703670-1D4BB ; DISALLOWED # MATHEMATICAL SCRIPT SMALL F
36713671-1D4BC ; UNASSIGNED # <reserved>
36723672-1D4BD..1D4C3; DISALLOWED # MATHEMATICAL SCRIPT SMALL H..MATHEMATICAL SC
36733673-1D4C4 ; UNASSIGNED # <reserved>
36743674-1D4C5..1D505; DISALLOWED # MATHEMATICAL SCRIPT SMALL P..MATHEMATICAL FR
36753675-1D506 ; UNASSIGNED # <reserved>
36763676-1D507..1D50A; DISALLOWED # MATHEMATICAL FRAKTUR CAPITAL D..MATHEMATICAL
36773677-1D50B..1D50C; UNASSIGNED # <reserved>..<reserved>
36783678-1D50D..1D514; DISALLOWED # MATHEMATICAL FRAKTUR CAPITAL J..MATHEMATICAL
36793679-1D515 ; UNASSIGNED # <reserved>
36803680-1D516..1D51C; DISALLOWED # MATHEMATICAL FRAKTUR CAPITAL S..MATHEMATICAL
36813681-1D51D ; UNASSIGNED # <reserved>
36823682-1D51E..1D539; DISALLOWED # MATHEMATICAL FRAKTUR SMALL A..MATHEMATICAL D
36833683-1D53A ; UNASSIGNED # <reserved>
36843684-1D53B..1D53E; DISALLOWED # MATHEMATICAL DOUBLE-STRUCK CAPITAL D..MATHEM
36853685-1D53F ; UNASSIGNED # <reserved>
36863686-1D540..1D544; DISALLOWED # MATHEMATICAL DOUBLE-STRUCK CAPITAL I..MATHEM
36873687-1D545 ; UNASSIGNED # <reserved>
36883688-1D546 ; DISALLOWED # MATHEMATICAL DOUBLE-STRUCK CAPITAL O
36893689-1D547..1D549; UNASSIGNED # <reserved>..<reserved>
36903690-1D54A..1D550; DISALLOWED # MATHEMATICAL DOUBLE-STRUCK CAPITAL S..MATHEM
36913691-1D551 ; UNASSIGNED # <reserved>
36923692-1D552..1D6A5; DISALLOWED # MATHEMATICAL DOUBLE-STRUCK SMALL A..MATHEMAT
36933693-1D6A6..1D6A7; UNASSIGNED # <reserved>..<reserved>
36943694-1D6A8..1D7CB; DISALLOWED # MATHEMATICAL BOLD CAPITAL ALPHA..MATHEMATICA
36953695-36963696-36973697-36983698-Faltstrom Standards Track [Page 66]
36993699-37003700-RFC 5892 IDNA Code Points August 2010
37013701-37023702-37033703-1D7CC..1D7CD; UNASSIGNED # <reserved>..<reserved>
37043704-1D7CE..1D7FF; DISALLOWED # MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL M
37053705-1D800..1EFFF; UNASSIGNED # <reserved>..<reserved>
37063706-1F000..1F02B; DISALLOWED # MAHJONG TILE EAST WIND..MAHJONG TILE BACK
37073707-1F02C..1F02F; UNASSIGNED # <reserved>..<reserved>
37083708-1F030..1F093; DISALLOWED # DOMINO TILE HORIZONTAL BACK..DOMINO TILE VER
37093709-1F094..1F0FF; UNASSIGNED # <reserved>..<reserved>
37103710-1F100..1F10A; DISALLOWED # DIGIT ZERO FULL STOP..DIGIT NINE COMMA
37113711-1F10B..1F10F; UNASSIGNED # <reserved>..<reserved>
37123712-1F110..1F12E; DISALLOWED # PARENTHESIZED LATIN CAPITAL LETTER A..CIRCLE
37133713-1F12F..1F130; UNASSIGNED # <reserved>..<reserved>
37143714-1F131 ; DISALLOWED # SQUARED LATIN CAPITAL LETTER B
37153715-1F132..1F13C; UNASSIGNED # <reserved>..<reserved>
37163716-1F13D ; DISALLOWED # SQUARED LATIN CAPITAL LETTER N
37173717-1F13E ; UNASSIGNED # <reserved>
37183718-1F13F ; DISALLOWED # SQUARED LATIN CAPITAL LETTER P
37193719-1F140..1F141; UNASSIGNED # <reserved>..<reserved>
37203720-1F142 ; DISALLOWED # SQUARED LATIN CAPITAL LETTER S
37213721-1F143..1F145; UNASSIGNED # <reserved>..<reserved>
37223722-1F146 ; DISALLOWED # SQUARED LATIN CAPITAL LETTER W
37233723-1F147..1F149; UNASSIGNED # <reserved>..<reserved>
37243724-1F14A..1F14E; DISALLOWED # SQUARED HV..SQUARED PPV
37253725-1F14F..1F156; UNASSIGNED # <reserved>..<reserved>
37263726-1F157 ; DISALLOWED # NEGATIVE CIRCLED LATIN CAPITAL LETTER H
37273727-1F158..1F15E; UNASSIGNED # <reserved>..<reserved>
37283728-1F15F ; DISALLOWED # NEGATIVE CIRCLED LATIN CAPITAL LETTER P
37293729-1F160..1F178; UNASSIGNED # <reserved>..<reserved>
37303730-1F179 ; DISALLOWED # NEGATIVE SQUARED LATIN CAPITAL LETTER J
37313731-1F17A ; UNASSIGNED # <reserved>
37323732-1F17B..1F17C; DISALLOWED # NEGATIVE SQUARED LATIN CAPITAL LETTER L..NEG
37333733-1F17D..1F17E; UNASSIGNED # <reserved>..<reserved>
37343734-1F17F ; DISALLOWED # NEGATIVE SQUARED LATIN CAPITAL LETTER P
37353735-1F180..1F189; UNASSIGNED # <reserved>..<reserved>
37363736-1F18A..1F18D; DISALLOWED # CROSSED NEGATIVE SQUARED LATIN CAPITAL LETTE
37373737-1F18E..1F18F; UNASSIGNED # <reserved>..<reserved>
37383738-1F190 ; DISALLOWED # SQUARE DJ
37393739-1F191..1F1FF; UNASSIGNED # <reserved>..<reserved>
37403740-1F200 ; DISALLOWED # SQUARE HIRAGANA HOKA
37413741-1F201..1F20F; UNASSIGNED # <reserved>..<reserved>
37423742-1F210..1F231; DISALLOWED # SQUARED CJK UNIFIED IDEOGRAPH-624B..SQUARED
37433743-1F232..1F23F; UNASSIGNED # <reserved>..<reserved>
37443744-1F240..1F248; DISALLOWED # TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRA
37453745-1F249..1FFFD; UNASSIGNED # <reserved>..<reserved>
37463746-1FFFE..1FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37473747-20000..2A6D6; PVALID # <CJK Ideograph Extension B>..<CJK Ideograph
37483748-2A6D7..2A6FF; UNASSIGNED # <reserved>..<reserved>
37493749-2A700..2B734; PVALID # <CJK Ideograph Extension C>..<CJK Ideograph
37503750-2B735..2F7FF; UNASSIGNED # <reserved>..<reserved>
37513751-37523752-37533753-37543754-Faltstrom Standards Track [Page 67]
37553755-37563756-RFC 5892 IDNA Code Points August 2010
37573757-37583758-37593759-2F800..2FA1D; DISALLOWED # CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPA
37603760-2FA1E..2FFFD; UNASSIGNED # <reserved>..<reserved>
37613761-2FFFE..2FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37623762-30000..3FFFD; UNASSIGNED # <reserved>..<reserved>
37633763-3FFFE..3FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37643764-40000..4FFFD; UNASSIGNED # <reserved>..<reserved>
37653765-4FFFE..4FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37663766-50000..5FFFD; UNASSIGNED # <reserved>..<reserved>
37673767-5FFFE..5FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37683768-60000..6FFFD; UNASSIGNED # <reserved>..<reserved>
37693769-6FFFE..6FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37703770-70000..7FFFD; UNASSIGNED # <reserved>..<reserved>
37713771-7FFFE..7FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37723772-80000..8FFFD; UNASSIGNED # <reserved>..<reserved>
37733773-8FFFE..8FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37743774-90000..9FFFD; UNASSIGNED # <reserved>..<reserved>
37753775-9FFFE..9FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37763776-A0000..AFFFD; UNASSIGNED # <reserved>..<reserved>
37773777-AFFFE..AFFFF; DISALLOWED # <noncharacter>..<noncharacter>
37783778-B0000..BFFFD; UNASSIGNED # <reserved>..<reserved>
37793779-BFFFE..BFFFF; DISALLOWED # <noncharacter>..<noncharacter>
37803780-C0000..CFFFD; UNASSIGNED # <reserved>..<reserved>
37813781-CFFFE..CFFFF; DISALLOWED # <noncharacter>..<noncharacter>
37823782-D0000..DFFFD; UNASSIGNED # <reserved>..<reserved>
37833783-DFFFE..DFFFF; DISALLOWED # <noncharacter>..<noncharacter>
37843784-E0000 ; UNASSIGNED # <reserved>
37853785-E0001 ; DISALLOWED # LANGUAGE TAG
37863786-E0002..E001F; UNASSIGNED # <reserved>..<reserved>
37873787-E0020..E007F; DISALLOWED # TAG SPACE..CANCEL TAG
37883788-E0080..E00FF; UNASSIGNED # <reserved>..<reserved>
37893789-E0100..E01EF; DISALLOWED # VARIATION SELECTOR-17..VARIATION SELECTOR-25
37903790-E01F0..EFFFD; UNASSIGNED # <reserved>..<reserved>
37913791-EFFFE..10FFFF; DISALLOWED # <noncharacter>..<noncharacter>
37923792-37933793-37943794-37953795-37963796-37973797-37983798-37993799-38003800-38013801-38023802-38033803-38043804-38053805-38063806-38073807-38083808-38093809-38103810-Faltstrom Standards Track [Page 68]
38113811-38123812-RFC 5892 IDNA Code Points August 2010
38133813-38143814-38153815-8. References
38163816-38173817-8.1. Normative References
38183818-38193819- [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
38203820- Requirement Levels", BCP 14, RFC 2119, March 1997.
38213821-38223822- [TR15] Davis, M. and M. Duerst, "Unicode Standard Annex #15,
38233823- Unicode Normalization Forms, an integral part of the
38243824- Unicode Standard",
38253825- <http://unicode.org/unicode/reports/tr15/>.
38263826-38273827- [Unicode] The Unicode Consortium, "The Unicode Standard, Version
38283828- 5.0", 2007. Boston, MA, USA: Addison-Wesley. ISBN
38293829- 0-321-48091-0. This printed reference has now been
38303830- updated online to reflect additional code points. For
38313831- code points, the reference at the time this document was
38323832- published is to Unicode 5.2.
38333833-38343834- [Unicode52] The Unicode Consortium. The Unicode Standard, Version
38353835- 5.2.0, defined by: "The Unicode Standard, Version
38363836- 5.2.0", (Mountain View, CA: The Unicode Consortium,
38373837- 2009. ISBN 978-1-936213-00-9).
38383838- <http://www.unicode.org/versions/Unicode5.2.0/>.
38393839-38403840-8.2. Informative References
38413841-38423842- [BlockNames] "Blocks-5.2.0.txt", Unicode Character Database,
38433843- May 2009,
38443844- <http://unicode.org/Public/5.2.0/ucd/Blocks.txt>.
38453845-38463846- [DerivedCoreProperties]
38473847- "DerivedCoreProperties-5.2.0.txt", Unicode Character
38483848- Database, August 2009, <http://unicode.org/Public/5.2.0/
38493849- ucd/DerivedCoreProperties.txt>.
38503850-38513851- [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
38523852- Internationalized Strings ("stringprep")", RFC 3454,
38533853- December 2002.
38543854-38553855- [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
38563856- Profile for Internationalized Domain Names (IDN)",
38573857- RFC 3491, March 2003.
38583858-38593859- [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review
38603860- and Recommendations for Internationalized Domain Names
38613861- (IDNs)", RFC 4690, September 2006.
38623862-38633863-38643864-38653865-38663866-Faltstrom Standards Track [Page 69]
38673867-38683868-RFC 5892 IDNA Code Points August 2010
38693869-38703870-38713871- [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an
38723872- IANA Considerations Section in RFCs", BCP 26, RFC 5226,
38733873- May 2008.
38743874-38753875- [RFC5890] Klensin, J., "Internationalized Domain Names for
38763876- Applications (IDNA): Definitions and Document
38773877- Framework", RFC 5890, August 2010.
38783878-38793879- [RFC5891] Klensin, J., "Internationalized Domain Names in
38803880- Applications (IDNA): Protocol", RFC 5891, August 2010.
38813881-38823882- [RFC5893] Alvestrand, H., Ed. and C. Karp, "Right-to-Left Scripts
38833883- for Internationalized Domain Names for Applications
38843884- (IDNA)", RFC 5893, August 2010.
38853885-38863886- [RFC5894] Klensin, J., "Internationalized Domain Names for
38873887- Applications (IDNA): Background, Explanation, and
38883888- Rationale", RFC 5894, August 2010.
38893889-38903890-Author's Address
38913891-38923892- Patrik Faltstrom (editor)
38933893- Cisco
38943894-38953895- EMail: paf@cisco.com
38963896-38973897-38983898-38993899-39003900-39013901-39023902-39033903-39043904-39053905-39063906-39073907-39083908-39093909-39103910-39113911-39123912-39133913-39143914-39153915-39163916-39173917-39183918-39193919-39203920-39213921-39223922-Faltstrom Standards Track [Page 70]
39233923-
-955
ocaml-punycode/spec/rfc5893.txt
···11-22-33-44-55-66-77-Internet Engineering Task Force (IETF) H. Alvestrand, Ed.
88-Request for Comments: 5893 Google
99-Category: Standards Track C. Karp
1010-ISSN: 2070-1721 Swedish Museum of Natural History
1111- August 2010
1212-1313-1414- Right-to-Left Scripts for
1515- Internationalized Domain Names for Applications (IDNA)
1616-1717-Abstract
1818-1919- The use of right-to-left scripts in Internationalized Domain Names
2020- (IDNs) has presented several challenges. This memo provides a new
2121- Bidi rule for Internationalized Domain Names for Applications (IDNA)
2222- labels, based on the encountered problems with some scripts and some
2323- shortcomings in the 2003 IDNA Bidi criterion.
2424-2525-Status of This Memo
2626-2727- This is an Internet Standards Track document.
2828-2929- This document is a product of the Internet Engineering Task Force
3030- (IETF). It represents the consensus of the IETF community. It has
3131- received public review and has been approved for publication by the
3232- Internet Engineering Steering Group (IESG). Further information on
3333- Internet Standards is available in Section 2 of RFC 5741.
3434-3535- Information about the current status of this document, any errata,
3636- and how to provide feedback on it may be obtained at
3737- http://www.rfc-editor.org/info/rfc5893.
3838-3939-Copyright Notice
4040-4141- Copyright (c) 2010 IETF Trust and the persons identified as the
4242- document authors. All rights reserved.
4343-4444- This document is subject to BCP 78 and the IETF Trust's Legal
4545- Provisions Relating to IETF Documents
4646- (http://trustee.ietf.org/license-info) in effect on the date of
4747- publication of this document. Please review these documents
4848- carefully, as they describe your rights and restrictions with respect
4949- to this document. Code Components extracted from this document must
5050- include Simplified BSD License text as described in Section 4.e of
5151- the Trust Legal Provisions and are provided without warranty as
5252- described in the Simplified BSD License.
5353-5454-5555-5656-5757-5858-Alvestrand & Karp Standards Track [Page 1]
5959-6060-RFC 5893 IDNA Right to Left August 2010
6161-6262-6363-Table of Contents
6464-6565- 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
6666- 1.1. Purpose and Applicability . . . . . . . . . . . . . . . . 2
6767- 1.2. Background and History . . . . . . . . . . . . . . . . . . 3
6868- 1.3. Structure of the Rest of This Document . . . . . . . . . . 3
6969- 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
7070- 2. The Bidi Rule . . . . . . . . . . . . . . . . . . . . . . . . 6
7171- 3. The Requirement Set for the Bidi Rule . . . . . . . . . . . . 6
7272- 4. Examples of Issues Found with RFC 3454 . . . . . . . . . . . . 9
7373- 4.1. Dhivehi . . . . . . . . . . . . . . . . . . . . . . . . . 9
7474- 4.2. Yiddish . . . . . . . . . . . . . . . . . . . . . . . . . 10
7575- 4.3. Strings with Numbers . . . . . . . . . . . . . . . . . . . 12
7676- 5. Troublesome Situations and Guidelines . . . . . . . . . . . . 12
7777- 6. Other Issues in Need of Resolution . . . . . . . . . . . . . . 13
7878- 7. Compatibility Considerations . . . . . . . . . . . . . . . . . 14
7979- 7.1. Backwards Compatibility Considerations . . . . . . . . . . 14
8080- 7.2. Forward Compatibility Considerations . . . . . . . . . . . 15
8181- 8. Security Considerations . . . . . . . . . . . . . . . . . . . 15
8282- 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16
8383- 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8484- 10.1. Normative References . . . . . . . . . . . . . . . . . . . 16
8585- 10.2. Informative References . . . . . . . . . . . . . . . . . . 17
8686-8787-1. Introduction
8888-8989-1.1. Purpose and Applicability
9090-9191- The purpose of this document is to establish a rule that can be
9292- applied to Internationalized Domain Name (IDN) labels in Unicode form
9393- (U-labels) containing characters from scripts that are written from
9494- right to left. It is part of the revised IDNA protocol [RFC5891].
9595-9696- When labels satisfy the rule, and when certain other conditions are
9797- satisfied, there is only a minimal chance of these labels being
9898- displayed in a confusing way by the Unicode bidirectional display
9999- algorithm.
100100-101101- The other normative documents in the IDNA2008 document set establish
102102- criteria for valid labels, including listing the permitted
103103- characters. This document establishes additional validity criteria
104104- for labels in scripts normally written from right to left.
105105-106106- This specification is not intended to place any requirements on
107107- domain names that do not contain characters from such scripts.
108108-109109-110110-111111-112112-113113-114114-Alvestrand & Karp Standards Track [Page 2]
115115-116116-RFC 5893 IDNA Right to Left August 2010
117117-118118-119119-1.2. Background and History
120120-121121- The "Stringprep" specification [RFC3454], part of IDNA2003, made the
122122- following statement in its Section 6 on the Bidi algorithm:
123123-124124- 3) If a string contains any RandALCat character, a RandALCat
125125- character MUST be the first character of the string, and a
126126- RandALCat character MUST be the last character of the string.
127127-128128- (A RandALCat character is a character with unambiguously
129129- right-to-left directionality.)
130130-131131- The reasoning behind this prohibition was to ensure that every
132132- component of a displayed domain name has an unambiguously preferred
133133- direction. However, this made certain words in languages written
134134- with right-to-left scripts invalid as IDN labels, and in at least one
135135- case (Dhivehi) meant that all the words of an entire language were
136136- forbidden as IDN labels.
137137-138138- This is illustrated below with examples taken from the Dhivehi and
139139- Yiddish languages, as written with the Thaana and Hebrew scripts,
140140- respectively.
141141-142142- RFC 3454 did not explicitly state the requirement to be fulfilled.
143143- Therefore, it is impossible to determine whether a simple relaxation
144144- of the rule would continue to fulfill the requirement.
145145-146146- While this document specifies rules quite different from RFC 3454,
147147- most reasonable labels that were allowed under RFC 3454 will also be
148148- allowed under this specification (the most important example of
149149- non-permitted labels being labels that mix Arabic and European digits
150150- (AN and EN) inside an RTL label, and labels that use AN in an LTR
151151- label -- see Section 1.4 for terminology), so the operational impact
152152- of using the new rule in the updated IDNA specification is limited.
153153-154154-1.3. Structure of the Rest of This Document
155155-156156- Section 2 defines a rule, the "Bidi rule", which can be used on a
157157- domain name label to check how safe it is to use in a domain name of
158158- possibly mixed directionality. The primary initial use of this rule
159159- is as part of the IDNA2008 protocol [RFC5891].
160160-161161- Section 3 sets out the requirements for defining the Bidi rule.
162162-163163- Section 4 gives detailed examples that serve as justification for the
164164- new rule.
165165-166166-167167-168168-169169-170170-Alvestrand & Karp Standards Track [Page 3]
171171-172172-RFC 5893 IDNA Right to Left August 2010
173173-174174-175175- Section 5 to Section 8 describe various situations that can occur
176176- when dealing with domain names with characters of different
177177- directionality.
178178-179179- Only Section 1.4 and Section 2 are normative.
180180-181181-1.4. Terminology
182182-183183- The terminology used to describe IDNA concepts is defined in the
184184- Definitions document [RFC5890].
185185-186186- The terminology used for the Bidi properties of Unicode characters is
187187- taken from the Unicode Standard [Unicode52].
188188-189189- The Unicode Standard specifies a Bidi property for each character.
190190- That property controls the character's behavior in the Unicode
191191- bidirectional algorithm [Unicode-UAX9]. For reference, here are the
192192- values that the Unicode Bidi property can have:
193193-194194- o L - Left to right - most letters in LTR scripts
195195-196196- o R - Right to left - most letters in non-Arabic RTL scripts
197197-198198- o AL - Arabic letters - most letters in the Arabic script
199199-200200- o EN - European Number (0-9, and Extended Arabic-Indic numbers)
201201-202202- o ES - European Number Separator (+ and -)
203203-204204- o ET - European Number Terminator (currency symbols, the hash sign,
205205- the percent sign and so on)
206206-207207- o AN - Arabic Number; this encompasses the Arabic-Indic numbers, but
208208- not the Extended Arabic-Indic numbers
209209-210210- o CS - Common Number Separator (. , / : et al)
211211-212212- o NSM - Nonspacing Mark - most combining accents
213213-214214- o BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others)
215215-216216- o B - Paragraph Separator
217217-218218- o S - Segment Separator
219219-220220- o WS - Whitespace, including the SPACE character
221221-222222- o ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT
223223-224224-225225-226226-Alvestrand & Karp Standards Track [Page 4]
227227-228228-RFC 5893 IDNA Right to Left August 2010
229229-230230-231231- o LRE, LRO, RLE, RLO, PDF - these are "directional control
232232- characters" and are not used in IDNA labels.
233233-234234- In this memo, we use "network order" to describe the sequence of
235235- characters as transmitted on the wire or stored in a file; the terms
236236- "first", "next", "previous", "beginning", "end", "before", and
237237- "after" are used to refer to the relationship of characters and
238238- labels in network order.
239239-240240- We use "display order" to talk about the sequence of characters as
241241- imaged on a display medium; the terms "left" and "right" are used to
242242- refer to the relationship of characters and labels in display order.
243243-244244- Most of the time, the examples use the abbreviations for the Unicode
245245- Bidi classes to denote the directionality of the characters; the
246246- example string CS L consists of one character of class CS and one
247247- character of class L. In some examples, the convention that
248248- uppercase characters are of class R or AL, and lowercase characters
249249- are of class L is used -- thus, the example string ABC.abc would
250250- consist of three right-to-left characters and three left-to-right
251251- characters.
252252-253253- The directionality of such examples is determined by context -- for
254254- instance, in the sentence "ABC.abc is displayed as CBA.abc", the
255255- first example string is in network order, the second example string
256256- is in display order.
257257-258258- The term "paragraph" is used in the sense of the Unicode Bidi
259259- specification [Unicode-UAX9]. It means "a block of text that has an
260260- overall direction, either left to right or right to left",
261261- approximately; see the "Unicode Bidirectional Algorithm"
262262- [Unicode-UAX9] for details.
263263-264264- "RTL" and "LTR" are abbreviations for "right to left" and "left to
265265- right", respectively.
266266-267267- An RTL label is a label that contains at least one character of type
268268- R, AL, or AN.
269269-270270- An LTR label is any label that is not an RTL label.
271271-272272- A "Bidi domain name" is a domain name that contains at least one RTL
273273- label. (Note: This definition includes domain names containing only
274274- dots and right-to-left characters. Providing a separate category of
275275- "RTL domain names" would not make this specification simpler, so it
276276- has not been done.)
277277-278278-279279-280280-281281-282282-Alvestrand & Karp Standards Track [Page 5]
283283-284284-RFC 5893 IDNA Right to Left August 2010
285285-286286-287287-2. The Bidi Rule
288288-289289- The following rule, consisting of six conditions, applies to labels
290290- in Bidi domain names. The requirements that this rule satisfies are
291291- described in Section 3. All of the conditions must be satisfied for
292292- the rule to be satisfied.
293293-294294- 1. The first character must be a character with Bidi property L, R,
295295- or AL. If it has the R or AL property, it is an RTL label; if it
296296- has the L property, it is an LTR label.
297297-298298- 2. In an RTL label, only characters with the Bidi properties R, AL,
299299- AN, EN, ES, CS, ET, ON, BN, or NSM are allowed.
300300-301301- 3. In an RTL label, the end of the label must be a character with
302302- Bidi property R, AL, EN, or AN, followed by zero or more
303303- characters with Bidi property NSM.
304304-305305- 4. In an RTL label, if an EN is present, no AN may be present, and
306306- vice versa.
307307-308308- 5. In an LTR label, only characters with the Bidi properties L, EN,
309309- ES, CS, ET, ON, BN, or NSM are allowed.
310310-311311- 6. In an LTR label, the end of the label must be a character with
312312- Bidi property L or EN, followed by zero or more characters with
313313- Bidi property NSM.
314314-315315- The following guarantees can be made based on the above:
316316-317317- o In a domain name consisting of only labels that satisfy the rule,
318318- the requirements of Section 3 are satisfied. Note that even LTR
319319- labels and pure ASCII labels have to be tested.
320320-321321- o In a domain name consisting of only LDH labels (as defined in the
322322- Definitions document [RFC5890]) and labels that satisfy the rule,
323323- the requirements of Section 3 are satisfied as long as a label
324324- that starts with an ASCII digit does not come after a
325325- right-to-left label.
326326-327327- No guarantee is given for other combinations.
328328-329329-3. The Requirement Set for the Bidi Rule
330330-331331- This document, unlike RFC 3454 [RFC3454], provides an explicit
332332- justification for the Bidi rule, and states a set of requirements for
333333- which it is possible to test whether or not the modified rule
334334- fulfills the requirement.
335335-336336-337337-338338-Alvestrand & Karp Standards Track [Page 6]
339339-340340-RFC 5893 IDNA Right to Left August 2010
341341-342342-343343- All the text in this document assumes that text containing the labels
344344- under consideration will be displayed using the Unicode bidirectional
345345- algorithm [Unicode-UAX9].
346346-347347- The requirements proposed are these:
348348-349349- o Label Uniqueness: No two labels, when presented in display order
350350- in the same paragraph, should have the same sequence of characters
351351- without also having the same sequence of characters in network
352352- order, both when the paragraph has LTR direction and when the
353353- paragraph has RTL direction. (This is the criterion that is
354354- explicit in RFC 3454). (Note that a label displayed in an RTL
355355- paragraph may display the same as a different label displayed in
356356- an LTR paragraph and still satisfy this criterion.)
357357-358358- o Character Grouping: When displaying a string of labels, using the
359359- Unicode Bidi algorithm to reorder the characters for display, the
360360- characters of each label should remain grouped between the
361361- characters delimiting the labels, both when the string is embedded
362362- in a paragraph with LTR direction and when it is embedded in a
363363- paragraph with RTL direction.
364364-365365- Several stronger statements were considered and rejected, because
366366- they seem to be impossible to fulfill within the constraints of the
367367- Unicode bidirectional algorithm. These include:
368368-369369- o The appearance of a label should be unaffected by its embedding
370370- context. This proved impossible even for ASCII labels; the label
371371- "123-A" will have a different display order in an RTL context than
372372- in an LTR context. (This particular example is, however,
373373- disallowed anyway.)
374374-375375- o The sequence of labels should be consistent with network order.
376376- This proved impossible -- a domain name consisting of the labels
377377- (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in
378378- an LTR context. (In an RTL context, it will be displayed as
379379- L4.R3.R2.L1).
380380-381381- o No two domain names should be displayed the same, even under
382382- differing directionality. This was shown to be unsound, since the
383383- domain name (in network order) ABC.abc will have display order
384384- CBA.abc in an LTR context and abc.CBA in an RTL context, while the
385385- domain name (network) abc.ABC will have display order abc.CBA in
386386- an LTR context and CBA.abc in an RTL context.
387387-388388-389389-390390-391391-392392-393393-394394-Alvestrand & Karp Standards Track [Page 7]
395395-396396-RFC 5893 IDNA Right to Left August 2010
397397-398398-399399- One possible requirement was thought to be problematic, but turned
400400- out to be satisfied by a string that obeys the proposed rules:
401401-402402- o The Character Grouping requirement should be satisfied when
403403- directional controls (LRE, RLE, RLO, LRO, PDF) are used in the
404404- same paragraph (outside of the labels). Because these controls
405405- affect presentation order in non-obvious ways, by affecting the
406406- "sor" and "eor" properties of the Unicode Bidi algorithm, the
407407- conditions above require extra testing in order to figure out
408408- whether or not they influence the display of the domain name.
409409- Testing found that for the strings allowed under the rule
410410- presented in this document, directional controls do not influence
411411- the display of the domain name.
412412-413413- This is still not stated as a requirement, since it did not seem as
414414- important as the stated requirements, but it is useful to know that
415415- Bidi domain names where the labels satisfy the rule have this
416416- property.
417417-418418- In the following descriptions, first-level bullets are used to
419419- indicate rules or normative statements; second-level bullets are
420420- commentary.
421421-422422- The Character Grouping requirement can be more formally stated as:
423423-424424- o Let "Delimiterchars" be a set of characters with the Unicode Bidi
425425- properties CS, WS, ON. (These are commonly used to delimit labels
426426- -- both the FULL STOP and the space are included. They are not
427427- allowed in domain labels.)
428428-429429- * ET, though it commonly occurs next to domain names in practice,
430430- is problematic: the context R CS L EN ET (for instance A.a1%)
431431- makes the label L EN not satisfy the character grouping
432432- requirement.
433433-434434- * ES commonly occurs in labels as HYPHEN-MINUS, but could also be
435435- used as a delimiter (for instance, the plus sign). It is left
436436- out here.
437437-438438- o Let "unproblematic label" be a label that either satisfies the
439439- requirements or does not contain any character with the Bidi
440440- properties R, AL, or AN and does not begin with a character with
441441- the Bidi property EN. (Informally, "it does not start with a
442442- number".)
443443-444444-445445-446446-447447-448448-449449-450450-Alvestrand & Karp Standards Track [Page 8]
451451-452452-RFC 5893 IDNA Right to Left August 2010
453453-454454-455455- A label X satisfies the Character Grouping requirement when, for any
456456- Delimiter Character D1 and D2, and for any label S1 and S2 that is an
457457- unproblematic label or an empty string, the following holds true:
458458-459459- If the string formed by concatenating S1, D1, X, D2, and S2 is
460460- reordered according to the Bidi algorithm, then all the characters of
461461- X in the reordered string are between D1 and D2, and no other
462462- characters are between D1 and D2, both if the overall paragraph
463463- direction is LTR and if the overall paragraph direction is RTL.
464464-465465- Note that the definition is self-referential, since S1 and S2 are
466466- constrained to be "legal" by this definition. This makes testing
467467- changes to proposed rules a little complex, but does not create
468468- problems for testing whether or not a given proposed rule satisfies
469469- the criterion.
470470-471471- The "zero-length" case represents the case where a domain name is
472472- next to something that isn't a domain name, separated by a delimiter
473473- character.
474474-475475- Note about the position of BN: The Unicode bidirectional algorithm
476476- specifies that a BN has an effect on the adjoining characters in
477477- network order, not in display order, and are therefore treated as if
478478- removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule
479479- X9 and Section 5.3). Therefore, the question of "what position does
480480- a BN have after reordering" is not meaningful. It has been ignored
481481- while developing the rules here.
482482-483483- The Label Uniqueness requirement can be formally stated as:
484484-485485- If two non-identical labels X and Y, embedded as for the test above,
486486- displayed in paragraphs with the same directionality, are reordered
487487- by the Bidi algorithm into the same sequence of code points, the
488488- labels X and Y cannot both be legal.
489489-490490-4. Examples of Issues Found with RFC 3454
491491-492492-4.1. Dhivehi
493493-494494- Dhivehi, the official language of the Maldives, is written with the
495495- Thaana script. This script displays some of the characteristics of
496496- the Arabic script, including its directional properties, and the
497497- indication of vowels by the diacritical marking of consonantal base
498498- characters. This marking is obligatory, and both two consecutive
499499- vowels and syllable-final consonants are indicated with unvoiced
500500- combining marks. Every Dhivehi word therefore ends with a combining
501501- mark.
502502-503503-504504-505505-506506-Alvestrand & Karp Standards Track [Page 9]
507507-508508-RFC 5893 IDNA Right to Left August 2010
509509-510510-511511- The word for "computer", which is romanized as "konpeetaru", is
512512- written with the following sequence of Unicode code points:
513513-514514- U+0786 THAANA LETTER KAAFU (AL)
515515-516516- U+07AE THAANA OBOFILI (NSM)
517517-518518- U+0782 THAANA LETTER NOONU (AL)
519519-520520- U+07B0 THAANA SUKUN (NSM)
521521-522522- U+0795 THAANA LETTER PAVIYANI (AL)
523523-524524- U+07A9 THAANA LETTER EEBEEFILI (AL)
525525-526526- U+0793 THAANA LETTER TAVIYANI (AL)
527527-528528- U+07A6 THAANA ABAFILI (NSM)
529529-530530- U+0783 THAANA LETTER RAA (AL)
531531-532532- U+07AA THAANA UBUFILI (NSM)
533533-534534- The directionality class of U+07AA in the Unicode database
535535- [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a
536536- conformant implementation of the IDNA2003 algorithm will say that
537537- "this is not in RandALCat" and refuse to encode the string.
538538-539539-4.2. Yiddish
540540-541541- Yiddish is one of several languages written with the Hebrew script
542542- (others include Hebrew and Ladino). This is basically a consonantal
543543- alphabet (also termed an "abjad"), but Yiddish is written using an
544544- extended form that is fully vocalic. The vowels are indicated in
545545- several ways, one of which is by repurposing letters that are
546546- consonants in Hebrew. Other letters are used both as vowels and
547547- consonants, with combining marks, called "points", used to
548548- differentiate between them. Finally, some base characters can
549549- indicate several different vowels, which are also disambiguated by
550550- combining marks. Pointed characters can appear in word-final
551551- position and may therefore also be needed at the end of labels. This
552552- is not an invariable attribute of a Yiddish string and there is thus
553553- greater latitude here than there is with Dhivehi.
554554-555555- The organization now known as the "YIVO Institute for Jewish
556556- Research" developed orthographic rules for modern Standard Yiddish
557557- during the 1930s on the basis of work conducted in several venues
558558- since earlier in that century. These are given in, "The Standardized
559559-560560-561561-562562-Alvestrand & Karp Standards Track [Page 10]
563563-564564-RFC 5893 IDNA Right to Left August 2010
565565-566566-567567- Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken
568568- as normatively descriptive of modern Standard Yiddish in any context
569569- where that notion is deemed relevant. They have been applied
570570- exclusively in all formal Yiddish dictionaries published since their
571571- establishment, and are similarly dominant in academic and
572572- bibliographic regards.
573573-574574- It therefore appears appropriate for this repertoire also to be
575575- supported fully by IDNA. This presents no difficulty with characters
576576- in initial and medial positions, but pointed characters are regularly
577577- used in final position as well. All of the characters in the SYO
578578- repertoire appear in both marked and unmarked form with one
579579- exception: the HEBREW LETTER PE (U+05E4). The SYO only permits this
580580- with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent
581581- to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent
582582- to the Latin letter "f". There is, however, a separate unpointed
583583- allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter
584584- character when it appears in final position. The constraint on the
585585- use of the SYO repertoire resulting from the proscription of
586586- combining marks at the end of RTL strings thus reduces to nothing
587587- more, or less, than the equivalent of saying that a string of Latin
588588- characters cannot end with the letter "p". It must also be noted
589589- that the HEBREW LETTER PE with the HEBREW POINT DAGESH is
590590- characteristic of almost all traditional Yiddish orthographies that
591591- predate (or remain in use in parallel to) the SYO, being the first
592592- pointed character to appear in any of them.
593593-594594- A more general instantiation of the basic problem can be seen in the
595595- representation of the YIVO acronym. This acronym is written with the
596596- Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and
597597- QAMATS are combining points. The Unicode code points are:
598598-599599- U+05D9 HEBREW LETTER YOD (R)
600600-601601- U+05B4 HEBREW POINT HIRIQ (NSM)
602602-603603- U+05D5 HEBREW LETTER VAV (R)
604604-605605- U+05D0 HEBREW LETTER ALEF (R)
606606-607607- U+05B8 HEBREW POINT QAMATS (NSM)
608608-609609- The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
610610- database is NSM, which again causes the IDNA2003 algorithm to reject
611611- the string.
612612-613613-614614-615615-616616-617617-618618-Alvestrand & Karp Standards Track [Page 11]
619619-620620-RFC 5893 IDNA Right to Left August 2010
621621-622622-623623- It may also be noted that all of the combined characters mentioned
624624- above exist in precomposed form at separate positions in the Unicode
625625- chart. However, by invoking Stringprep, the IDNA2003 algorithm also
626626- rejects those code points, for reasons not discussed here.
627627-628628-4.3. Strings with Numbers
629629-630630- By requiring that the first or last character of a string be a member
631631- of category R or AL, the Stringprep specification [RFC3454]
632632- prohibited a string containing right-to-left characters from ending
633633- with a number.
634634-635635- Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5
636636- ALEF. Displayed in an LTR context, the first one will be displayed
637637- from left to right as 5 ALEF (with the 5 being considered right to
638638- left because of the leading ALEF), while 5 ALEF will be displayed in
639639- exactly the same order (5 taking the direction from context).
640640- Clearly, only one of those should be permitted as a registered label,
641641- but barring them both seems unnecessary.
642642-643643-5. Troublesome Situations and Guidelines
644644-645645- There are situations in which labels that satisfy the rule above will
646646- be displayed in a surprising fashion. The most important of these is
647647- the case where a label ending in a character with Bidi property AL,
648648- AN, or R occurs before a label beginning with a character of Bidi
649649- property EN. In that case, the number will appear to move into the
650650- label containing the right-to-left character, violating the Character
651651- Grouping requirement.
652652-653653- If the label that occurs after the right-to-left label itself
654654- satisfies the Bidi criterion, the requirements will be satisfied in
655655- all cases (this is the reason why the criterion talks about strings
656656- containing L in some cases). However, the IDNABIS WG concluded that
657657- this could not be required for several reasons:
658658-659659- o There is a large current deployment of ASCII domain names starting
660660- with digits. These cannot possibly be invalidated.
661661-662662- o Domain names are often constructed piecemeal, for instance, by
663663- combining a string with the content of a search list. This may
664664- occur after IDNA processing, and thus in part of the code that is
665665- not IDNA-aware, making detection of the undesirable combination
666666- impossible.
667667-668668-669669-670670-671671-672672-673673-674674-Alvestrand & Karp Standards Track [Page 12]
675675-676676-RFC 5893 IDNA Right to Left August 2010
677677-678678-679679- o Even if a label is registered under a "safe" label, there may be a
680680- DNAME [RFC2672] with an "unsafe" label that points to the "safe"
681681- label, thus creating seemingly valid names that would not satisfy
682682- the criterion.
683683-684684- o Wildcards create the odd situation where a label is "valid" (can
685685- be looked up successfully) without the zone owner knowing that
686686- this label exists. So an owner of a zone whose name starts with a
687687- digit and contains a wildcard has no way of controlling whether or
688688- not names with RTL labels in them are looked up in his zone.
689689-690690- Rather than trying to suggest rules that disallow all such
691691- undesirable situations, this document merely warns about the
692692- possibility, and leaves it to application developers to take whatever
693693- measures they deem appropriate to avoid problematic situations.
694694-695695-6. Other Issues in Need of Resolution
696696-697697- This document concerns itself only with the rules that are needed
698698- when dealing with domain names with characters that have differing
699699- Bidi properties, and considers characters only in terms of their Bidi
700700- properties. All other issues with scripts that are written from
701701- right to left must be considered in other contexts.
702702-703703- One such issue is the need to keep numbers separate. Several scripts
704704- are used with multiple sets of numbers -- most commonly they use
705705- Latin numbers and a script-specific set of numbers, but in the case
706706- of Arabic, there are two sets of "Arabic-Indic" digits involved.
707707-708708- The algorithm in this document disallows occurrences of AN-class
709709- characters ("Arabic-Indic digits", U+0660 to U+0669) together with
710710- EN-class characters (which includes "European" digits, U+0030 to
711711- U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but
712712- does not help in preventing the mixing of, for instance, Bengali
713713- digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),
714714- both of which have Bidi class L. A registry or script community that
715715- wishes to create rules restricting the mixing of digits in a label
716716- will be able to specify these restrictions at the registry level.
717717- Some rules are also specified at the protocol level.
718718-719719- Another set of issues concerns the proper display of IDNs with a
720720- mixture of LTR and RTL labels, or only RTL labels.
721721-722722- It is unrealistic to expect that applications will display domain
723723- names using embedded formatting codes between their labels (for one
724724- thing, no reliable algorithms for identifying domain names in running
725725- text exist); thus, the display order will be determined by the Bidi
726726- algorithm. Thus, a sequence (in network order) of R1.R2.ltr will be
727727-728728-729729-730730-Alvestrand & Karp Standards Track [Page 13]
731731-732732-RFC 5893 IDNA Right to Left August 2010
733733-734734-735735- displayed in the order 2R.1R.ltr in an LTR context, which might
736736- surprise someone expecting to see labels displayed in hierarchical
737737- order. People used to working with text that mixes LTR and RTL
738738- strings might not be so surprised by this. Again, this memo does not
739739- attempt to suggest a solution to this problem.
740740-741741-7. Compatibility Considerations
742742-743743-7.1. Backwards Compatibility Considerations
744744-745745- As with any change to an existing standard, it is important to
746746- consider what happens with existing implementations when the change
747747- is introduced. Some troublesome cases include:
748748-749749- o An old program used to input the newly allowed label. If the old
750750- program checks the input against RFC 3454, some labels will not be
751751- allowed, and domain names containing those labels will remain
752752- inaccessible.
753753-754754- o An old program is asked to display the newly allowed label, and
755755- checks it against RFC 3454 before displaying. The program will
756756- perform some kind of fallback, most likely displaying the label in
757757- A-label form.
758758-759759- o An old program tries to display the newly allowed label. If the
760760- old program has code for displaying the last character of a label
761761- that is different from the code used to display the characters in
762762- the middle of the label, the display may be inconsistent and cause
763763- confusion.
764764-765765- One particular example of the last case is if a program chooses to
766766- examine the last character (in network order) of a string in order to
767767- determine its directionality, rather than its first. If it finds an
768768- NSM character and tries to display the string as if it was a
769769- left-to-right string, the resulting display may be interesting, but
770770- not useful.
771771-772772- The editors believe that these cases will have a less harmful impact
773773- in practice than continuing to deny the use of words from the
774774- languages for which these strings are necessary as IDN labels.
775775-776776- This specification does not forbid using leading European digits in
777777- ASCII-only labels, since this would conflict with a large installed
778778- base of such labels, and would increase the scope of the
779779- specification from RTL labels to all labels. The harm resulting from
780780- this limitation of scope is described in Section 5. Registries and
781781- private zone managers can check for this particular condition before
782782- they allow registration of any RTL label. Generally, it is best to
783783-784784-785785-786786-Alvestrand & Karp Standards Track [Page 14]
787787-788788-RFC 5893 IDNA Right to Left August 2010
789789-790790-791791- disallow registration of any right-to-left strings in a zone where
792792- the label at the level above begins with a digit.
793793-794794-7.2. Forward Compatibility Considerations
795795-796796- This text is intentionally specified strictly in terms of the Unicode
797797- Bidi properties. The determination that the condition is sufficient
798798- to fulfill the criteria depends on the Unicode Bidi algorithm; it is
799799- unlikely that drastic changes will be made to this algorithm.
800800-801801- However, the determination of validity for any string depends on the
802802- Unicode Bidi property values, which are not declared immutable by the
803803- Unicode Consortium. Furthermore, the behavior of the algorithm for
804804- any given character is likely to be linguistically and culturally
805805- sensitive, so while it should occur rarely, it is possible that later
806806- versions of the Unicode Standard may change the Bidi properties
807807- assigned to certain Unicode characters.
808808-809809- This memo does not propose a solution for this problem.
810810-811811-8. Security Considerations
812812-813813- The display behavior of mixed-direction text can be extremely
814814- surprising to users who are not used to it; for instance, cut and
815815- paste of a piece of text can cause the text to display differently at
816816- the destination, if the destination is in another directionality
817817- context, and adding a character in one place of a text can cause
818818- characters some distance from the point of insertion to change their
819819- display position. This is, however, not a phenomenon unique to the
820820- display of domain names.
821821-822822- The new IDNA protocol, and particularly these new Bidi rules, will
823823- allow some strings to be used in IDNA contexts that are not allowed
824824- today. It is possible that differences in the interpretation of
825825- labels between implementations of IDNA2003 and IDNA2008 could pose a
826826- security risk, but it is difficult to envision any specific
827827- instantiation of this.
828828-829829- Any rational attempt to compute, for instance, a hash over an
830830- identifier processed by IDNA would use network order for its
831831- computation, and thus be unaffected by the new rules proposed here.
832832-833833- While it is not believed to pose a problem, if display routines had
834834- been written with specific knowledge of the RFC 3454 IDNA
835835- prohibitions, it is possible that the potential problems noted under
836836- "Backwards Compatibility Considerations" could cause new kinds of
837837- confusion.
838838-839839-840840-841841-842842-Alvestrand & Karp Standards Track [Page 15]
843843-844844-RFC 5893 IDNA Right to Left August 2010
845845-846846-847847-9. Acknowledgements
848848-849849- While the listed editors held the pen, this document represents the
850850- joint work and conclusions of an ad hoc design team. In addition to
851851- the editors, this consisted of, in alphabetic order, Tina Dam, Patrik
852852- Faltstrom, and John Klensin. Many further specific contributions and
853853- helpful comments were received from the people listed below, and
854854- others who have contributed to the development and use of the IDNA
855855- protocols.
856856-857857- The particular formulation of the Bidi rule in Section 2 was
858858- suggested by Matitiahu Allouche.
859859-860860- The team wishes, in particular, to thank Roozbeh Pournader for
861861- calling its attention to the issue with the Thaana script, Paul
862862- Hoffman for pointing out the need to be explicit about backwards
863863- compatibility considerations, Ken Whistler for suggesting the basis
864864- of the formalized "Character Grouping" requirement, Mark Davis for
865865- commentary, Erik van der Poel for careful review, comments, and
866866- verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete
867867- Resnick for reviews, and Vint Cerf for chairing the working group and
868868- contributing massively to getting the documents finished.
869869-870870-10. References
871871-872872-10.1. Normative References
873873-874874- [RFC5890] Klensin, J., "Internationalized Domain Names for
875875- Applications (IDNA): Definitions and Document
876876- Framework", RFC 5890, August 2010.
877877-878878- [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9:
879879- Unicode Bidirectional Algorithm", September 2009,
880880- <http://www.unicode.org/reports/tr9/>.
881881-882882- [Unicode52] The Unicode Consortium. The Unicode Standard, Version
883883- 5.2.0, defined by: "The Unicode Standard, Version
884884- 5.2.0", (Mountain View, CA: The Unicode Consortium,
885885- 2009. ISBN 978-1-936213-00-9).
886886- <http://www.unicode.org/versions/Unicode5.2.0/>.
887887-888888-889889-890890-891891-892892-893893-894894-895895-896896-897897-898898-Alvestrand & Karp Standards Track [Page 16]
899899-900900-RFC 5893 IDNA Right to Left August 2010
901901-902902-903903-10.2. Informative References
904904-905905- [RFC2672] Crawford, M., "Non-Terminal DNS Name Redirection",
906906- RFC 2672, August 1999.
907907-908908- [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
909909- Internationalized Strings ("stringprep")", RFC 3454,
910910- December 2002.
911911-912912- [RFC5891] Klensin, J., "Internationalized Domain Names in
913913- Applications (IDNA): Protocol", RFC 5891, August 2010.
914914-915915- [SYO] "The Standardized Yiddish Orthography: Rules of
916916- Yiddish Spelling, 6th ed., New York, ISBN
917917- 0-914512-25-0", 1999.
918918-919919-Authors' Addresses
920920-921921- Harald Tveit Alvestrand (editor)
922922- Google
923923- Beddingen 10
924924- Trondheim, 7014
925925- Norway
926926-927927- EMail: harald@alvestrand.no
928928-929929-930930- Cary Karp
931931- Swedish Museum of Natural History
932932- Frescativ. 40
933933- Stockholm, 10405
934934- Sweden
935935-936936- Phone: +46 8 5195 4055
937937- Fax:
938938- EMail: ck@nic.museum
939939-940940-941941-942942-943943-944944-945945-946946-947947-948948-949949-950950-951951-952952-953953-954954-Alvestrand & Karp Standards Track [Page 17]
955955-