cue/literal: reject invalid UTF-8 in Unquote

this repo has no description

The scanner already rejects invalid UTF-8 byte sequences per the spec,
but literal.Unquote accepted them. For example, "\xb0" (a lone
continuation byte) was accepted by Unquote even though the parser
rejected it, because unquoteChar passed through the RuneError from
DecodeRuneInString without checking for the size==1 invalid-byte signal.

Fix unquoteChar to return errInvalidUTF8 when DecodeRuneInString
indicates an invalid byte, and update the isSimple fast path to also
bail out on RuneError (covering both invalid UTF-8 and valid U+FFFD,
the latter being handled correctly by the slow path).

Found by the fuzzer.

Signed-off-by: Daniel Martí <mvdan@mvdan.cc>
Change-Id: I047928b8cfd881bb69cfe06028afa20ee16ec537
Reviewed-on: https://review.gerrithub.io/c/cue-lang/cue/+/1235297
Unity-Result: CUE porcuepine <cue.porcuepine@gmail.com>
Reviewed-by: Matthew Sackman <matthew@cue.works>
TryBot-Result: CUEcueckoo <cueckoo@cuelang.org>

Daniel Martí 2 months ago dc8d0f70 88f4d9de

+10 -4

2 changed files

expand all

cue

literal

string.go

string_test.go

+5 -4

cue/literal/string.go

··· 32 32 // TODO: making this an error is optional according to RFC 4627. But we 33 33 // could make it not an error if this ever results in an issue. 34 34 errSurrogate = errors.New("unmatched surrogate pair") 35 + errInvalidUTF8 = errors.New("invalid UTF-8 encoding") 35 36 errEscapedLastNewline = errors.New("last newline of multiline string cannot be escaped") 36 37 ) 37 38 ··· 285 286 // faster than converting to code points. At the very least there should 286 287 // be an ASCII fast path. 287 288 for _, r := range s { 288 - if r == quote || r == '\\' || r == 0 { 289 + if r == quote || r == '\\' || r == 0 || r == utf8.RuneError { 289 290 return false 290 291 } 291 292 if surHigh <= r && r < surEnd { ··· 345 346 } 346 347 return terminatedByQuote, false, "", nil 347 348 case c >= utf8.RuneSelf: 348 - // TODO: consider handling surrogate values. These are discarded by 349 - // DecodeRuneInString. It is technically correct to disallow it, but 350 - // some JSON parsers allow this anyway. 351 349 r, size := utf8.DecodeRuneInString(s) 350 + if r == utf8.RuneError && size == 1 { 351 + return 0, false, s, errInvalidUTF8 352 + } 352 353 return r, true, s[size:], nil 353 354 case c != '\\': 354 355 if c == 0 {

cue/literal/string_test.go

··· 72 72 {"\"\x00\"", "", errSyntax}, 73 73 {"'\x00'", "", errSyntax}, 74 74 75 + // Invalid UTF-8 bytes in the source are rejected, matching the scanner. 76 + {"\"\xb0\"", "", errInvalidUTF8}, 77 + {"'\xb0'", "", errInvalidUTF8}, 78 + {"\"\xff\"", "", errInvalidUTF8}, 79 + 75 80 {`"\\"`, "\\", nil}, 76 81 {`"\'"`, "", errSyntax}, 77 82 {`"\q"`, "", errSyntax},

Configure Feed

Configure Feed