* [Qemu-devel] [PATCH 0/4] Fix JSON string formatter @ 2013-04-11 16:07 Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster ` (6 more replies) 0 siblings, 7 replies; 12+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek This should unbreak "make check" on machines where char is unsigned. Blue, please give it a whirl. The JSON parser is still as broken as ever. Left for another day. v2: - Rebased, trivial conflicts in PATCH 1/4. - Make mod_utf8_codepoint() treat empty input as invalid sequence of length zero (both when n==0 and when n>0 && *s==0). No code in this series passes empty input. - Some commit messages and comments improved. Markus Armbruster (4): unicode: New mod_utf8_codepoint() check-qjson: Improve a few comments, delete bogus ones check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings qjson: to_json() case QTYPE_QSTRING is buggy, rewrite include/qemu-common.h | 3 + qobject/qjson.c | 102 ++++++++--------- tests/check-qjson.c | 308 ++++++++++++++++++++++++++++++-------------------- util/Makefile.objs | 2 +- util/unicode.c | 100 ++++++++++++++++ 5 files changed, 333 insertions(+), 182 deletions(-) create mode 100644 util/unicode.c -- 1.7.11.7 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster @ 2013-04-11 16:07 ` Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 2/4] check-qjson: Improve a few comments, delete bogus ones Markus Armbruster ` (5 subsequent siblings) 6 siblings, 0 replies; 12+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Signed-off-by: Markus Armbruster <armbru@redhat.com> --- include/qemu-common.h | 3 ++ util/Makefile.objs | 2 +- util/unicode.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 104 insertions(+), 1 deletion(-) create mode 100644 util/unicode.c diff --git a/include/qemu-common.h b/include/qemu-common.h index 31fff22..3b1873e 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -442,6 +442,9 @@ int64_t pow2floor(int64_t value); int uleb128_encode_small(uint8_t *out, uint32_t n); int uleb128_decode_small(const uint8_t *in, uint32_t *n); +/* unicode.c */ +int mod_utf8_codepoint(const char *s, size_t n, char **end); + /* * Hexdump a buffer to a file. An optional string prefix is added to every line */ diff --git a/util/Makefile.objs b/util/Makefile.objs index 557bda7..c5652f5 100644 --- a/util/Makefile.objs +++ b/util/Makefile.objs @@ -1,4 +1,4 @@ -util-obj-y = osdep.o cutils.o qemu-timer-common.o +util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o event_notifier-win32.o util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o event_notifier-posix.o util-obj-y += envlist.o path.o host-utils.o cache-utils.o module.o diff --git a/util/unicode.c b/util/unicode.c new file mode 100644 index 0000000..d1c8658 --- /dev/null +++ b/util/unicode.c @@ -0,0 +1,100 @@ +/* + * Dealing with Unicode + * + * Copyright (C) 2013 Red Hat, Inc. + * + * Authors: + * Markus Armbruster <armbru@redhat.com> + * + * This work is licensed under the terms of the GNU GPL, version 2 or + * later. See the COPYING file in the top-level directory. + */ + +#include "qemu-common.h" + +/** + * mod_utf8_codepoint: + * @s: string encoded in modified UTF-8 + * @n: maximum number of bytes to read from @s, if less than 6 + * @end: set to end of sequence on return + * + * Convert the modified UTF-8 sequence at the start of @s. Modified + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as + * "\xC0\x80". + * + * If @n is zero or @s points to a zero byte, the sequence is invalid, + * and @end is set to @s. + * + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation + * byte, the sequence is invalid, and @end is set to @s + 1 + * + * Else, the first byte determines how many continuation bytes are + * expected. If there are fewer, the sequence is invalid, and @end is + * set to @s + 1 + actual number of continuation bytes. Else, the + * sequence is well-formed, and @end is set to @s + 1 + expected + * number of continuation bytes. + * + * A well-formed sequence is valid unless it encodes a codepoint + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66 + * noncharacters, a surrogate codepoint, or is overlong. Except the + * overlong sequence "\xC0\x80" is valid. + * + * Conversion succeeds if and only if the sequence is valid. + * + * Returns: the Unicode codepoint on success, -1 on failure. + */ +int mod_utf8_codepoint(const char *s, size_t n, char **end) +{ + static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 }; + const unsigned char *p; + unsigned byte, mask, len, i; + int cp; + + if (n == 0 || *s == 0) { + /* empty sequence */ + *end = (char *)s; + return -1; + } + + p = (const unsigned char *)s; + byte = *p++; + if (byte < 0x80) { + cp = byte; /* one byte sequence */ + } else if (byte >= 0xFE) { + cp = -1; /* impossible bytes 0xFE, 0xFF */ + } else if ((byte & 0x40) == 0) { + cp = -1; /* unexpected continuation byte */ + } else { + /* multi-byte sequence */ + len = 0; + for (mask = 0x80; byte & mask; mask >>= 1) { + len++; + } + assert(len > 1 && len < 7); + cp = byte & (mask - 1); + for (i = 1; i < len; i++) { + byte = i < n ? *p : 0; + if ((byte & 0xC0) != 0x80) { + cp = -1; /* continuation byte missing */ + goto out; + } + p++; + cp <<= 6; + cp |= byte & 0x3F; + } + if (cp > 0x10FFFF) { + cp = -1; /* beyond Unicode range */ + } else if ((cp >= 0xFDD0 && cp <= 0xFDEF) + || (cp & 0xFFFE) == 0xFFFE) { + cp = -1; /* noncharacter */ + } else if (cp >= 0xD800 && cp <= 0xDFFF) { + cp = -1; /* surrogate code point */ + } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) { + cp = -1; /* overlong, not \xC0\x80 */ + } + } + +out: + *end = (char *)p; + return cp; +} -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH 2/4] check-qjson: Improve a few comments, delete bogus ones 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster @ 2013-04-11 16:07 ` Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster ` (4 subsequent siblings) 6 siblings, 0 replies; 12+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Signed-off-by: Markus Armbruster <armbru@redhat.com> --- tests/check-qjson.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/tests/check-qjson.c b/tests/check-qjson.c index ec85a0c..91b4e5d 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -4,7 +4,7 @@ * * Authors: * Anthony Liguori <aliguori@us.ibm.com> - * Markus Armbruster <armbru@redhat.com>, + * Markus Armbruster <armbru@redhat.com> * * This work is licensed under the terms of the GNU LGPL, version 2.1 or later. * See the COPYING.LIB file in the top-level directory. @@ -285,31 +285,31 @@ static void utf8_string(void) }, /* 2.3 Other boundary conditions */ { - /* U+D7FF */ + /* last one before surrogate range: U+D7FF */ "\"\xED\x9F\xBF\"", "\xED\x9F\xBF", "\"\\uD7FF\"", }, { - /* U+E000 */ + /* first one after surrogate range: U+E000 */ "\"\xEE\x80\x80\"", "\xEE\x80\x80", "\"\\uE000\"", }, { - /* U+FFFD */ + /* last one in BMP: U+FFFD */ "\"\xEF\xBF\xBD\"", "\xEF\xBF\xBD", "\"\\uFFFD\"", }, { - /* U+10FFFF */ + /* last one in last plane: U+10FFFF */ "\"\xF4\x8F\xBF\xBF\"", "\xF4\x8F\xBF\xBF", "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */ }, { - /* U+110000 */ + /* first one beyond Unicode range: U+110000 */ "\"\xF4\x90\x80\x80\"", "\xF4\x90\x80\x80", "\"\\u4400\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ @@ -462,8 +462,7 @@ static void utf8_string(void) }, /* 3.3.4 5-byte sequence with last byte missing (U+0000) */ { - /* invalid */ - "\"\xF8\x80\x80\x80\"", /* bug: not corrected */ + "\"\xF8\x80\x80\x80\"", NULL, /* bug: rejected */ "\"\\u8000\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ "\xF8\x80\x80\x80", @@ -570,7 +569,12 @@ static void utf8_string(void) "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */ "\xFC\x80\x80\x80\x80\xAF", }, - /* 4.2 Maximum overlong sequences */ + /* + * 4.2 Maximum overlong sequences + * Highest Unicode value that is still resulting in an + * overlong sequence if represented with the given number of + * bytes. This is a boundary test for safe UTF-8 decoders. + */ { /* \U+007F */ "\"\xC1\xBF\"", -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 2/4] check-qjson: Improve a few comments, delete bogus ones Markus Armbruster @ 2013-04-11 16:07 ` Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster ` (3 subsequent siblings) 6 siblings, 0 replies; 12+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Test cases cover the two noncharacters in the BMP. Add tests for the other 64 noncharacters. Three existing test cases involve noncharacters U+FFFF and U+10FFFF. Instead of deleting them as now duplicates, adjust them to use U+FFFC and U+10FFFFD. Signed-off-by: Markus Armbruster <armbru@redhat.com> --- tests/check-qjson.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 84 insertions(+), 12 deletions(-) diff --git a/tests/check-qjson.c b/tests/check-qjson.c index 91b4e5d..54074a9 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -158,7 +158,7 @@ static void utf8_string(void) * consider using overlong encoding \xC0\x80 for U+0000 ("modified * UTF-8"). * - * Test cases are scraped from Markus Kuhn's UTF-8 decoder + * Most test cases are scraped from Markus Kuhn's UTF-8 decoder * capability and stress test at * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt */ @@ -256,11 +256,19 @@ static void utf8_string(void) "\xDF\xBF", "\"\\u07FF\"", }, - /* 2.2.3 3 bytes U+FFFF */ + /* + * 2.2.3 3 bytes U+FFFC + * The last possible sequence is actually U+FFFF. But that's + * a noncharacter, and already covered by its own test case + * under 5.3. Same for U+FFFE. U+FFFD is the last character + * in the BMP, and covered under 2.3. Because of U+FFFD's + * special role as replacement character, it's worth testing + * U+FFFC here. + */ { - "\"\xEF\xBF\xBF\"", - "\xEF\xBF\xBF", - "\"\\uFFFF\"", + "\"\xEF\xBF\xBC\"", + "\xEF\xBF\xBC", + "\"\\uFFFC\"", }, /* 2.2.4 4 bytes U+1FFFFF */ { @@ -303,10 +311,10 @@ static void utf8_string(void) "\"\\uFFFD\"", }, { - /* last one in last plane: U+10FFFF */ - "\"\xF4\x8F\xBF\xBF\"", - "\xF4\x8F\xBF\xBF", - "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */ + /* last one in last plane: U+10FFFD */ + "\"\xF4\x8F\xBF\xBD\"", + "\xF4\x8F\xBF\xBD", + "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */ }, { /* first one beyond Unicode range: U+110000 */ @@ -589,9 +597,14 @@ static void utf8_string(void) "\"\\u07FF\"", }, { - /* \U+FFFF */ - "\"\xF0\x8F\xBF\xBF\"", - "\xF0\x8F\xBF\xBF", /* bug: not corrected */ + /* + * \U+FFFC + * The actual maximum would be U+FFFF, but that's a + * noncharacter. Testing U+FFFC seems more useful. See + * also 2.2.3 + */ + "\"\xF0\x8F\xBF\xBC\"", + "\xF0\x8F\xBF\xBC", /* bug: not corrected */ "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ }, { @@ -736,6 +749,7 @@ static void utf8_string(void) "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ }, /* 5.3 Other illegal code positions */ + /* BMP noncharacters */ { /* \U+FFFE */ "\"\xEF\xBF\xBE\"", @@ -748,6 +762,64 @@ static void utf8_string(void) "\xEF\xBF\xBF", /* bug: not corrected */ "\"\\uFFFF\"", /* bug: not corrected */ }, + { + /* U+FDD0 */ + "\"\xEF\xB7\x90\"", + "\xEF\xB7\x90", /* bug: not corrected */ + "\"\\uFDD0\"", /* bug: not corrected */ + }, + { + /* U+FDEF */ + "\"\xEF\xB7\xAF\"", + "\xEF\xB7\xAF", /* bug: not corrected */ + "\"\\uFDEF\"", /* bug: not corrected */ + }, + /* Plane 1 .. 16 noncharacters */ + { + /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */ + "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"", + /* bug: not corrected */ + "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF", + /* bug: not corrected */ + "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF" + "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF" + "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF" + "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF" + "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF" + "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF" + "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF" + "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"", + }, {} }; int i; -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (2 preceding siblings ...) 2013-04-11 16:07 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster @ 2013-04-11 16:07 ` Markus Armbruster 2013-04-11 16:11 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (2 subsequent siblings) 6 siblings, 0 replies; 12+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Known bugs in to_json(): * A start byte for a three-byte sequence followed by less than two continuation bytes is split into one-byte sequences. * Start bytes for sequences longer than three bytes get misinterpreted as start bytes for three-byte sequences. Continuation bytes beyond byte three become one-byte sequences. This means all characters outside the BMP are decoded incorrectly. * One-byte sequences with the MSB are put into the JSON string verbatim when char is unsigned, producing invalid UTF-8. When char is signed, they're replaced by "\\uFFFF" instead. This includes \xFE, \xFF, and stray continuation bytes. * Overlong sequences are happily accepted, unless screwed up by the bugs above. * Likewise, sequences encoding surrogate code points or noncharacters. * Unlike other control characters, ASCII DEL is not escaped. Except in overlong encodings. My rewrite fixes them as follows: * Malformed UTF-8 sequences are replaced. Except the overlong encoding \xC0\x80 of U+0000 is still accepted. Permits embedding NUL characters in C strings. This trick is known as "Modified UTF-8". * Sequences encoding code points beyond Unicode range are replaced. * Sequences encoding code points beyond the BMP produce a surrogate pair. * Sequences encoding surrogate code points are replaced. * Sequences encoding noncharacters are replaced. * ASCII DEL is now always escaped. The replacement character is U+FFFD. Signed-off-by: Markus Armbruster <armbru@redhat.com> --- qobject/qjson.c | 102 +++++++++++-------------- tests/check-qjson.c | 216 ++++++++++++++++++++++++---------------------------- 2 files changed, 145 insertions(+), 173 deletions(-) diff --git a/qobject/qjson.c b/qobject/qjson.c index 83a6b4f..19085a1 100644 --- a/qobject/qjson.c +++ b/qobject/qjson.c @@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent) case QTYPE_QSTRING: { QString *val = qobject_to_qstring(obj); const char *ptr; + int cp; + char buf[16]; + char *end; ptr = qstring_get_str(val); qstring_append(str, "\""); - while (*ptr) { - if ((ptr[0] & 0xE0) == 0xE0 && - (ptr[1] & 0x80) && (ptr[2] & 0x80)) { - uint16_t wchar; - char escape[7]; - - wchar = (ptr[0] & 0x0F) << 12; - wchar |= (ptr[1] & 0x3F) << 6; - wchar |= (ptr[2] & 0x3F); - ptr += 2; - - snprintf(escape, sizeof(escape), "\\u%04X", wchar); - qstring_append(str, escape); - } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) { - uint16_t wchar; - char escape[7]; - - wchar = (ptr[0] & 0x1F) << 6; - wchar |= (ptr[1] & 0x3F); - ptr++; - - snprintf(escape, sizeof(escape), "\\u%04X", wchar); - qstring_append(str, escape); - } else switch (ptr[0]) { - case '\"': - qstring_append(str, "\\\""); - break; - case '\\': - qstring_append(str, "\\\\"); - break; - case '\b': - qstring_append(str, "\\b"); - break; - case '\f': - qstring_append(str, "\\f"); - break; - case '\n': - qstring_append(str, "\\n"); - break; - case '\r': - qstring_append(str, "\\r"); - break; - case '\t': - qstring_append(str, "\\t"); - break; - default: { - if (ptr[0] <= 0x1F) { - char escape[7]; - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]); - qstring_append(str, escape); - } else { - char buf[2] = { ptr[0], 0 }; - qstring_append(str, buf); - } - break; + + for (; *ptr; ptr = end) { + cp = mod_utf8_codepoint(ptr, 6, &end); + switch (cp) { + case '\"': + qstring_append(str, "\\\""); + break; + case '\\': + qstring_append(str, "\\\\"); + break; + case '\b': + qstring_append(str, "\\b"); + break; + case '\f': + qstring_append(str, "\\f"); + break; + case '\n': + qstring_append(str, "\\n"); + break; + case '\r': + qstring_append(str, "\\r"); + break; + case '\t': + qstring_append(str, "\\t"); + break; + default: + if (cp < 0) { + cp = 0xFFFD; /* replacement character */ } + if (cp > 0xFFFF) { + /* beyond BMP; need a surrogate pair */ + snprintf(buf, sizeof(buf), "\\u%04X\\u%04X", + 0xD800 + ((cp - 0x10000) >> 10), + 0xDC00 + ((cp - 0x10000) & 0x3FF)); + } else if (cp < 0x20 || cp >= 0x7F) { + snprintf(buf, sizeof(buf), "\\u%04X", cp); + } else { + buf[0] = cp; + buf[1] = 0; } - ptr++; - } + qstring_append(str, buf); + } + }; + qstring_append(str, "\""); break; } diff --git a/tests/check-qjson.c b/tests/check-qjson.c index 54074a9..4e74548 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -144,13 +144,10 @@ static void utf8_string(void) * The JSON parser rejects some invalid sequences, but accepts * others without correcting the problem. * - * The JSON formatter replaces some invalid sequences by U+FFFF (a - * noncharacter), and goes wonky for others. - * - * For both directions, we should either reject all invalid - * sequences, or minimize overlong sequences and replace all other - * invalid sequences by a suitable replacement character. A - * common choice for replacement is U+FFFD. + * We should either reject all invalid sequences, or minimize + * overlong sequences and replace all other invalid sequences by a + * suitable replacement character. A common choice for + * replacement is U+FFFD. * * Problem: we can't easily deal with embedded U+0000. Parsing * the JSON string "this \\u0000" is fun" yields "this \0 is fun", @@ -175,16 +172,10 @@ static void utf8_string(void) * - bug: rejected * JSON parser rejects invalid sequence(s) * We may choose to define this as feature - * - bug: want "\"...\"" - * JSON formatter produces incorrect result, this is the - * correct one, assuming replacement character U+FFFF - * - bug: want "..." (no \") + * - bug: want "..." * JSON parser produces incorrect result, this is the * correct one, assuming replacement character U+FFFF * We may choose to reject instead of replace - * Not marked explicitly, but trivial to find: - * - JSON formatter replacing invalid sequence by \\uFFFF is a - * bug if we want it to fail for invalid sequences. */ /* 1 Some correct UTF-8 text */ @@ -209,7 +200,8 @@ static void utf8_string(void) { "\"\\u0000\"", "", /* bug: want overlong "\xC0\x80" */ - "\"\"", /* bug: want "\"\\u0000\"" */ + "\"\\u0000\"", + "\xC0\x80", }, /* 2.1.2 2 bytes U+0080 */ { @@ -227,20 +219,20 @@ static void utf8_string(void) { "\"\xF0\x90\x80\x80\"", "\xF0\x90\x80\x80", - "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */ + "\"\\uD800\\uDC00\"", }, /* 2.1.5 5 bytes U+200000 */ { "\"\xF8\x88\x80\x80\x80\"", - NULL, /* bug: rejected */ - "\"\\u8200\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xF8\x88\x80\x80\x80", }, /* 2.1.6 6 bytes U+4000000 */ { "\"\xFC\x84\x80\x80\x80\x80\"", - NULL, /* bug: rejected */ - "\"\\uC100\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFC\x84\x80\x80\x80\x80", }, /* 2.2 Last possible sequence of a certain length */ @@ -248,7 +240,7 @@ static void utf8_string(void) { "\"\x7F\"", "\x7F", - "\"\177\"", + "\"\\u007F\"", }, /* 2.2.2 2 bytes U+07FF */ { @@ -273,22 +265,22 @@ static void utf8_string(void) /* 2.2.4 4 bytes U+1FFFFF */ { "\"\xF7\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\u7FFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xF7\xBF\xBF\xBF", }, /* 2.2.5 5 bytes U+3FFFFFF */ { "\"\xFB\xBF\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\uBFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFB\xBF\xBF\xBF\xBF", }, /* 2.2.6 6 bytes U+7FFFFFFF */ { "\"\xFD\xBF\xBF\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\uDFFF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFD\xBF\xBF\xBF\xBF\xBF", }, /* 2.3 Other boundary conditions */ @@ -314,13 +306,13 @@ static void utf8_string(void) /* last one in last plane: U+10FFFD */ "\"\xF4\x8F\xBF\xBD\"", "\xF4\x8F\xBF\xBD", - "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */ + "\"\\uDBFF\\uDFFD\"" }, { /* first one beyond Unicode range: U+110000 */ "\"\xF4\x90\x80\x80\"", "\xF4\x90\x80\x80", - "\"\\u4400\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3 Malformed sequences */ /* 3.1 Unexpected continuation bytes */ @@ -328,49 +320,49 @@ static void utf8_string(void) { "\"\x80\"", "\x80", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.1.2 Last continuation byte */ { "\"\xBF\"", "\xBF", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.1.3 2 continuation bytes */ { "\"\x80\xBF\"", "\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\"", }, /* 3.1.4 3 continuation bytes */ { "\"\x80\xBF\x80\"", "\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.5 4 continuation bytes */ { "\"\x80\xBF\x80\xBF\"", "\x80\xBF\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.6 5 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\"", "\x80\xBF\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.7 6 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\xBF\"", "\x80\xBF\x80\xBF\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.8 7 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\xBF\x80\"", "\x80\xBF\x80\xBF\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.9 Sequence of all 64 possible continuation bytes */ { @@ -391,14 +383,14 @@ static void utf8_string(void) "\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF" "\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7" "\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF", - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"" }, /* 3.2 Lonely start characters */ /* 3.2.1 All 32 first bytes of 2-byte sequences, followed by space */ @@ -408,10 +400,10 @@ static void utf8_string(void) "\xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD7 " "\xD8 \xD9 \xDA \xDB \xDC \xDD \xDE \xDF \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xC0 \xC1 \xC2 \xC3 \xC4 \xC5 \xC6 \xC7 " "\xC8 \xC9 \xCA \xCB \xCC \xCD \xCE \xCF " "\xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD7 " @@ -424,28 +416,28 @@ static void utf8_string(void) /* bug: not corrected */ "\xE0 \xE1 \xE2 \xE3 \xE4 \xE5 \xE6 \xE7 " "\xE8 \xE9 \xEA \xEB \xEC \xED \xEE \xEF ", - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", }, /* 3.2.3 All 8 first bytes of 4-byte sequences, followed by space */ { "\"\xF0 \xF1 \xF2 \xF3 \xF4 \xF5 \xF6 \xF7 \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xF0 \xF1 \xF2 \xF3 \xF4 \xF5 \xF6 \xF7 ", }, /* 3.2.4 All 4 first bytes of 5-byte sequences, followed by space */ { "\"\xF8 \xF9 \xFA \xFB \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xF8 \xF9 \xFA \xFB ", }, /* 3.2.5 All 2 first bytes of 6-byte sequences, followed by space */ { "\"\xFC \xFD \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \"", "\xFC \xFD ", }, /* 3.3 Sequences with last continuation byte missing */ @@ -453,66 +445,66 @@ static void utf8_string(void) { "\"\xC0\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xC0", }, /* 3.3.2 3-byte sequence with last byte missing (U+0000) */ { "\"\xE0\x80\"", "\xE0\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.3 4-byte sequence with last byte missing (U+0000) */ { "\"\xF0\x80\x80\"", "\xF0\x80\x80", /* bug: not corrected */ - "\"\\u0000\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.4 5-byte sequence with last byte missing (U+0000) */ { "\"\xF8\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80", }, /* 3.3.5 6-byte sequence with last byte missing (U+0000) */ { "\"\xFC\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80", }, /* 3.3.6 2-byte sequence with last byte missing (U+07FF) */ { "\"\xDF\"", "\xDF", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.3.7 3-byte sequence with last byte missing (U+FFFF) */ { "\"\xEF\xBF\"", "\xEF\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.8 4-byte sequence with last byte missing (U+1FFFFF) */ { "\"\xF7\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\u7FFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF7\xBF\xBF", }, /* 3.3.9 5-byte sequence with last byte missing (U+3FFFFFF) */ { "\"\xFB\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uBFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFB\xBF\xBF\xBF", }, /* 3.3.10 6-byte sequence with last byte missing (U+7FFFFFFF) */ { "\"\xFD\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uDFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"", */ + "\"\\uFFFD\"", "\xFD\xBF\xBF\xBF\xBF", }, /* 3.4 Concatenation of incomplete sequences */ @@ -520,10 +512,8 @@ static void utf8_string(void) "\"\xC0\xE0\x80\xF0\x80\x80\xF8\x80\x80\x80\xFC\x80\x80\x80\x80" "\xDF\xEF\xBF\xF7\xBF\xBF\xFB\xBF\xBF\xBF\xFD\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - /* bug: want "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" */ - "\"\\u0020\\uFFFF\\u0000\\u8000\\uFFFF\\uC000\\uFFFF\\uFFFF" - "\\u07EF\\uFFFF\\u7FFF\\uBFFF\\uFFFF\\uDFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", "\xC0\xE0\x80\xF0\x80\x80\xF8\x80\x80\x80\xFC\x80\x80\x80\x80" "\xDF\xEF\xBF\xF7\xBF\xBF\xFB\xBF\xBF\xBF\xFD\xBF\xBF\xBF\xBF", }, @@ -531,20 +521,19 @@ static void utf8_string(void) { "\"\xFE\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xFE", }, { "\"\xFF\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xFF", }, { "\"\xFE\xFE\xFF\xFF\"", NULL, /* bug: rejected */ - /* bug: want "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" */ - "\"\\uEFBF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", "\xFE\xFE\xFF\xFF", }, /* 4 Overlong sequences */ @@ -552,29 +541,29 @@ static void utf8_string(void) { "\"\xC0\xAF\"", NULL, /* bug: rejected */ - "\"\\u002F\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xC0\xAF", }, { "\"\xE0\x80\xAF\"", "\xE0\x80\xAF", /* bug: not corrected */ - "\"\\u002F\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", }, { "\"\xF0\x80\x80\xAF\"", "\xF0\x80\x80\xAF", /* bug: not corrected */ - "\"\\u0000\\uFFFF\"" /* bug: want "\"/\"" */ + "\"\\uFFFD\"", }, { "\"\xF8\x80\x80\x80\xAF\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80\xAF", }, { "\"\xFC\x80\x80\x80\x80\xAF\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80\xAF", }, /* @@ -587,14 +576,14 @@ static void utf8_string(void) /* \U+007F */ "\"\xC1\xBF\"", NULL, /* bug: rejected */ - "\"\\u007F\"", /* bug: want "\"\177\"" */ + "\"\\uFFFD\"", "\xC1\xBF", }, { /* \U+07FF */ "\"\xE0\x9F\xBF\"", "\xE0\x9F\xBF", /* bug: not corrected */ - "\"\\u07FF\"", + "\"\\uFFFD\"", }, { /* @@ -605,20 +594,20 @@ static void utf8_string(void) */ "\"\xF0\x8F\xBF\xBC\"", "\xF0\x8F\xBF\xBC", /* bug: not corrected */ - "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+1FFFFF */ "\"\xF8\x87\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\u81FF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF8\x87\xBF\xBF\xBF", }, { /* \U+3FFFFFF */ "\"\xFC\x83\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uC0FF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFC\x83\xBF\xBF\xBF\xBF", }, /* 4.3 Overlong representation of the NUL character */ @@ -633,26 +622,26 @@ static void utf8_string(void) /* \U+0000 */ "\"\xE0\x80\x80\"", "\xE0\x80\x80", /* bug: not corrected */ - "\"\\u0000\"", + "\"\\uFFFD\"", }, { /* \U+0000 */ "\"\xF0\x80\x80\x80\"", "\xF0\x80\x80\x80", /* bug: not corrected */ - "\"\\u0000\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", }, { /* \U+0000 */ "\"\xF8\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80\x80", }, { /* \U+0000 */ "\"\xFC\x80\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80\x80", }, /* 5 Illegal code positions */ @@ -661,92 +650,92 @@ static void utf8_string(void) /* \U+D800 */ "\"\xED\xA0\x80\"", "\xED\xA0\x80", /* bug: not corrected */ - "\"\\uD800\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DB7F */ "\"\xED\xAD\xBF\"", "\xED\xAD\xBF", /* bug: not corrected */ - "\"\\uDB7F\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DB80 */ "\"\xED\xAE\x80\"", "\xED\xAE\x80", /* bug: not corrected */ - "\"\\uDB80\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DBFF */ "\"\xED\xAF\xBF\"", "\xED\xAF\xBF", /* bug: not corrected */ - "\"\\uDBFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DC00 */ "\"\xED\xB0\x80\"", "\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDC00\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DF80 */ "\"\xED\xBE\x80\"", "\xED\xBE\x80", /* bug: not corrected */ - "\"\\uDF80\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DFFF */ "\"\xED\xBF\xBF\"", "\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 5.2 Paired UTF-16 surrogates */ { /* \U+D800\U+DC00 */ "\"\xED\xA0\x80\xED\xB0\x80\"", "\xED\xA0\x80\xED\xB0\x80", /* bug: not corrected */ - "\"\\uD800\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+D800\U+DFFF */ "\"\xED\xA0\x80\xED\xBF\xBF\"", "\xED\xA0\x80\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uD800\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB7F\U+DC00 */ "\"\xED\xAD\xBF\xED\xB0\x80\"", "\xED\xAD\xBF\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDB7F\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB7F\U+DFFF */ "\"\xED\xAD\xBF\xED\xBF\xBF\"", "\xED\xAD\xBF\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDB7F\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB80\U+DC00 */ "\"\xED\xAE\x80\xED\xB0\x80\"", "\xED\xAE\x80\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDB80\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB80\U+DFFF */ "\"\xED\xAE\x80\xED\xBF\xBF\"", "\xED\xAE\x80\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDB80\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DBFF\U+DC00 */ "\"\xED\xAF\xBF\xED\xB0\x80\"", "\xED\xAF\xBF\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDBFF\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DBFF\U+DFFF */ "\"\xED\xAF\xBF\xED\xBF\xBF\"", "\xED\xAF\xBF\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, /* 5.3 Other illegal code positions */ /* BMP noncharacters */ @@ -754,25 +743,25 @@ static void utf8_string(void) /* \U+FFFE */ "\"\xEF\xBF\xBE\"", "\xEF\xBF\xBE", /* bug: not corrected */ - "\"\\uFFFE\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, { /* \U+FFFF */ "\"\xEF\xBF\xBF\"", "\xEF\xBF\xBF", /* bug: not corrected */ - "\"\\uFFFF\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, { /* U+FDD0 */ "\"\xEF\xB7\x90\"", "\xEF\xB7\x90", /* bug: not corrected */ - "\"\\uFDD0\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, { /* U+FDEF */ "\"\xEF\xB7\xAF\"", "\xEF\xB7\xAF", /* bug: not corrected */ - "\"\\uFDEF\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, /* Plane 1 .. 16 noncharacters */ { @@ -810,15 +799,10 @@ static void utf8_string(void) "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF", - /* bug: not corrected */ - "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF" - "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF" - "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF" - "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF" - "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF" - "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF" - "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF" - "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, {} }; @@ -856,8 +840,8 @@ static void utf8_string(void) qobject_decref(obj); /* - * Disabled, because json_out currently contains the crap - * qobject_to_json() produces. + * Disabled, because qobject_from_json() is buggy, and I can't + * be bothered to add the expected incorrect results. * FIXME Enable once these bugs have been fixed. */ if (0 && json_out != json_in) { -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (3 preceding siblings ...) 2013-04-11 16:07 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster @ 2013-04-11 16:11 ` Markus Armbruster 2013-04-11 17:03 ` Laszlo Ersek 2013-04-13 19:54 ` Blue Swirl 6 siblings, 0 replies; 12+ messages in thread From: Markus Armbruster @ 2013-04-11 16:11 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Rats, forgot --subject-prefix="PATCH v2". My apologies! ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (4 preceding siblings ...) 2013-04-11 16:11 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster @ 2013-04-11 17:03 ` Laszlo Ersek 2013-04-13 19:54 ` Blue Swirl 6 siblings, 0 replies; 12+ messages in thread From: Laszlo Ersek @ 2013-04-11 17:03 UTC (permalink / raw) To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel On 04/11/13 18:07, Markus Armbruster wrote: > This should unbreak "make check" on machines where char is unsigned. > Blue, please give it a whirl. > > The JSON parser is still as broken as ever. Left for another day. > > v2: > - Rebased, trivial conflicts in PATCH 1/4. > - Make mod_utf8_codepoint() treat empty input as invalid sequence of > length zero (both when n==0 and when n>0 && *s==0). No code in this > series passes empty input. > - Some commit messages and comments improved. > > Markus Armbruster (4): > unicode: New mod_utf8_codepoint() > check-qjson: Improve a few comments, delete bogus ones > check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings > qjson: to_json() case QTYPE_QSTRING is buggy, rewrite > > include/qemu-common.h | 3 + > qobject/qjson.c | 102 ++++++++--------- > tests/check-qjson.c | 308 ++++++++++++++++++++++++++++++-------------------- > util/Makefile.objs | 2 +- > util/unicode.c | 100 ++++++++++++++++ > 5 files changed, 333 insertions(+), 182 deletions(-) > create mode 100644 util/unicode.c > I compared this v2 series patch-wise to v1. Reviewed-by: Laszlo Ersek <lersek@redhat.com> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (5 preceding siblings ...) 2013-04-11 17:03 ` Laszlo Ersek @ 2013-04-13 19:54 ` Blue Swirl 6 siblings, 0 replies; 12+ messages in thread From: Blue Swirl @ 2013-04-13 19:54 UTC (permalink / raw) To: Markus Armbruster; +Cc: Anthony Liguori, Laszlo Ersek, qemu-devel Thanks, applied all. On Thu, Apr 11, 2013 at 4:07 PM, Markus Armbruster <armbru@redhat.com> wrote: > This should unbreak "make check" on machines where char is unsigned. > Blue, please give it a whirl. > > The JSON parser is still as broken as ever. Left for another day. > > v2: > - Rebased, trivial conflicts in PATCH 1/4. > - Make mod_utf8_codepoint() treat empty input as invalid sequence of > length zero (both when n==0 and when n>0 && *s==0). No code in this > series passes empty input. > - Some commit messages and comments improved. > > Markus Armbruster (4): > unicode: New mod_utf8_codepoint() > check-qjson: Improve a few comments, delete bogus ones > check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings > qjson: to_json() case QTYPE_QSTRING is buggy, rewrite > > include/qemu-common.h | 3 + > qobject/qjson.c | 102 ++++++++--------- > tests/check-qjson.c | 308 ++++++++++++++++++++++++++++++-------------------- > util/Makefile.objs | 2 +- > util/unicode.c | 100 ++++++++++++++++ > 5 files changed, 333 insertions(+), 182 deletions(-) > create mode 100644 util/unicode.c > > -- > 1.7.11.7 > ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH 0/4] Fix JSON string formatter @ 2013-03-14 17:49 Markus Armbruster 2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster 0 siblings, 1 reply; 12+ messages in thread From: Markus Armbruster @ 2013-03-14 17:49 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori This should unbreak "make check" on machines where char is unsigned. Blue, please give it a whirl. The JSON parser is still as broken as ever. Left for another day. Markus Armbruster (4): unicode: New mod_utf8_codepoint() check-qjson: Fix up a few bogus comments check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings qjson: to_json() case QTYPE_QSTRING is buggy, rewrite include/qemu-common.h | 3 + qobject/qjson.c | 102 ++++++++---------- tests/check-qjson.c | 280 +++++++++++++++++++++++++++++--------------------- util/Makefile.objs | 1 + util/unicode.c | 96 +++++++++++++++++ 5 files changed, 306 insertions(+), 176 deletions(-) create mode 100644 util/unicode.c -- 1.7.11.7 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite 2013-03-14 17:49 Markus Armbruster @ 2013-03-14 17:49 ` Markus Armbruster 2013-03-21 20:44 ` Laszlo Ersek 2013-03-22 13:15 ` Laszlo Ersek 0 siblings, 2 replies; 12+ messages in thread From: Markus Armbruster @ 2013-03-14 17:49 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori Known bugs in to_json(): * A start byte for a three-byte sequence followed by less than two continuation bytes is split into one-byte sequences. * Start bytes for sequences longer than three bytes get misinterpreted as start bytes for three-byte sequences. Continuation bytes beyond byte three become one-byte sequences. This means all characters outside the BMP are decoded incorrectly. * One-byte sequences with the MSB are put into the JSON string verbatim when char is unsigned, producing invalid UTF-8. When char is signed, they're replaced by "\\uFFFF" instead. This includes \xFE, \xFF, and stray continuation bytes. * Overlong sequences are happily accepted, unless screwed up by the bugs above. * Likewise, sequences encoding surrogate code points or noncharacters. * Unlike other control characters, ASCII DEL is not escaped. Except in overlong encodings. My rewrite fixes them as follows: * Malformed UTF-8 sequences are replaced. Except the overlong encoding \xC0\x80 of U+0000 is still accepted. Permits embedding NUL characters in C strings. This trick is known as "Modified UTF-8". * Sequences encoding code points beyond Unicode range are replaced. * Sequences encoding code points beyond the BMP produce a surrogate pair. * Sequences encoding surrogate code points are replaced. * Sequences encoding noncharacters are replaced. * ASCII DEL is now always escaped. The replacement character is U+FFFD. Signed-off-by: Markus Armbruster <armbru@redhat.com> --- qobject/qjson.c | 102 +++++++++++-------------- tests/check-qjson.c | 216 ++++++++++++++++++++++++---------------------------- 2 files changed, 145 insertions(+), 173 deletions(-) diff --git a/qobject/qjson.c b/qobject/qjson.c index 83a6b4f..19085a1 100644 --- a/qobject/qjson.c +++ b/qobject/qjson.c @@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent) case QTYPE_QSTRING: { QString *val = qobject_to_qstring(obj); const char *ptr; + int cp; + char buf[16]; + char *end; ptr = qstring_get_str(val); qstring_append(str, "\""); - while (*ptr) { - if ((ptr[0] & 0xE0) == 0xE0 && - (ptr[1] & 0x80) && (ptr[2] & 0x80)) { - uint16_t wchar; - char escape[7]; - - wchar = (ptr[0] & 0x0F) << 12; - wchar |= (ptr[1] & 0x3F) << 6; - wchar |= (ptr[2] & 0x3F); - ptr += 2; - - snprintf(escape, sizeof(escape), "\\u%04X", wchar); - qstring_append(str, escape); - } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) { - uint16_t wchar; - char escape[7]; - - wchar = (ptr[0] & 0x1F) << 6; - wchar |= (ptr[1] & 0x3F); - ptr++; - - snprintf(escape, sizeof(escape), "\\u%04X", wchar); - qstring_append(str, escape); - } else switch (ptr[0]) { - case '\"': - qstring_append(str, "\\\""); - break; - case '\\': - qstring_append(str, "\\\\"); - break; - case '\b': - qstring_append(str, "\\b"); - break; - case '\f': - qstring_append(str, "\\f"); - break; - case '\n': - qstring_append(str, "\\n"); - break; - case '\r': - qstring_append(str, "\\r"); - break; - case '\t': - qstring_append(str, "\\t"); - break; - default: { - if (ptr[0] <= 0x1F) { - char escape[7]; - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]); - qstring_append(str, escape); - } else { - char buf[2] = { ptr[0], 0 }; - qstring_append(str, buf); - } - break; + + for (; *ptr; ptr = end) { + cp = mod_utf8_codepoint(ptr, 6, &end); + switch (cp) { + case '\"': + qstring_append(str, "\\\""); + break; + case '\\': + qstring_append(str, "\\\\"); + break; + case '\b': + qstring_append(str, "\\b"); + break; + case '\f': + qstring_append(str, "\\f"); + break; + case '\n': + qstring_append(str, "\\n"); + break; + case '\r': + qstring_append(str, "\\r"); + break; + case '\t': + qstring_append(str, "\\t"); + break; + default: + if (cp < 0) { + cp = 0xFFFD; /* replacement character */ } + if (cp > 0xFFFF) { + /* beyond BMP; need a surrogate pair */ + snprintf(buf, sizeof(buf), "\\u%04X\\u%04X", + 0xD800 + ((cp - 0x10000) >> 10), + 0xDC00 + ((cp - 0x10000) & 0x3FF)); + } else if (cp < 0x20 || cp >= 0x7F) { + snprintf(buf, sizeof(buf), "\\u%04X", cp); + } else { + buf[0] = cp; + buf[1] = 0; } - ptr++; - } + qstring_append(str, buf); + } + }; + qstring_append(str, "\""); break; } diff --git a/tests/check-qjson.c b/tests/check-qjson.c index efec1b2..595ddc0 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -144,13 +144,10 @@ static void utf8_string(void) * The JSON parser rejects some invalid sequences, but accepts * others without correcting the problem. * - * The JSON formatter replaces some invalid sequences by U+FFFF (a - * noncharacter), and goes wonky for others. - * - * For both directions, we should either reject all invalid - * sequences, or minimize overlong sequences and replace all other - * invalid sequences by a suitable replacement character. A - * common choice for replacement is U+FFFD. + * We should either reject all invalid sequences, or minimize + * overlong sequences and replace all other invalid sequences by a + * suitable replacement character. A common choice for + * replacement is U+FFFD. * * Problem: we can't easily deal with embedded U+0000. Parsing * the JSON string "this \\u0000" is fun" yields "this \0 is fun", @@ -175,16 +172,10 @@ static void utf8_string(void) * - bug: rejected * JSON parser rejects invalid sequence(s) * We may choose to define this as feature - * - bug: want "\"...\"" - * JSON formatter produces incorrect result, this is the - * correct one, assuming replacement character U+FFFF - * - bug: want "..." (no \") + * - bug: want "..." * JSON parser produces incorrect result, this is the * correct one, assuming replacement character U+FFFF * We may choose to reject instead of replace - * Not marked explicitly, but trivial to find: - * - JSON formatter replacing invalid sequence by \\uFFFF is a - * bug if we want it to fail for invalid sequences. */ /* 1 Some correct UTF-8 text */ @@ -209,7 +200,8 @@ static void utf8_string(void) { "\"\\u0000\"", "", /* bug: want overlong "\xC0\x80" */ - "\"\"", /* bug: want "\"\\u0000\"" */ + "\"\\u0000\"", + "\xC0\x80", }, /* 2.1.2 2 bytes U+0080 */ { @@ -227,20 +219,20 @@ static void utf8_string(void) { "\"\xF0\x90\x80\x80\"", "\xF0\x90\x80\x80", - "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */ + "\"\\uD800\\uDC00\"", }, /* 2.1.5 5 bytes U+200000 */ { "\"\xF8\x88\x80\x80\x80\"", - NULL, /* bug: rejected */ - "\"\\u8200\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xF8\x88\x80\x80\x80", }, /* 2.1.6 6 bytes U+4000000 */ { "\"\xFC\x84\x80\x80\x80\x80\"", - NULL, /* bug: rejected */ - "\"\\uC100\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFC\x84\x80\x80\x80\x80", }, /* 2.2 Last possible sequence of a certain length */ @@ -248,7 +240,7 @@ static void utf8_string(void) { "\"\x7F\"", "\x7F", - "\"\177\"", + "\"\\u007F\"", }, /* 2.2.2 2 bytes U+07FF */ { @@ -265,22 +257,22 @@ static void utf8_string(void) /* 2.2.4 4 bytes U+1FFFFF */ { "\"\xF7\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\u7FFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xF7\xBF\xBF\xBF", }, /* 2.2.5 5 bytes U+3FFFFFF */ { "\"\xFB\xBF\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\uBFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFB\xBF\xBF\xBF\xBF", }, /* 2.2.6 6 bytes U+7FFFFFFF */ { "\"\xFD\xBF\xBF\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\uDFFF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFD\xBF\xBF\xBF\xBF\xBF", }, /* 2.3 Other boundary conditions */ @@ -306,13 +298,13 @@ static void utf8_string(void) /* U+10FFFD */ "\"\xF4\x8F\xBF\xBD\"", "\xF4\x8F\xBF\xBD", - "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */ + "\"\\uDBFF\\uDFFD\"" }, { /* U+110000 */ "\"\xF4\x90\x80\x80\"", "\xF4\x90\x80\x80", - "\"\\u4400\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3 Malformed sequences */ /* 3.1 Unexpected continuation bytes */ @@ -320,49 +312,49 @@ static void utf8_string(void) { "\"\x80\"", "\x80", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.1.2 Last continuation byte */ { "\"\xBF\"", "\xBF", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.1.3 2 continuation bytes */ { "\"\x80\xBF\"", "\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\"", }, /* 3.1.4 3 continuation bytes */ { "\"\x80\xBF\x80\"", "\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.5 4 continuation bytes */ { "\"\x80\xBF\x80\xBF\"", "\x80\xBF\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.6 5 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\"", "\x80\xBF\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.7 6 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\xBF\"", "\x80\xBF\x80\xBF\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.8 7 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\xBF\x80\"", "\x80\xBF\x80\xBF\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.9 Sequence of all 64 possible continuation bytes */ { @@ -383,14 +375,14 @@ static void utf8_string(void) "\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF" "\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7" "\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF", - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"" }, /* 3.2 Lonely start characters */ /* 3.2.1 All 32 first bytes of 2-byte sequences, followed by space */ @@ -400,10 +392,10 @@ static void utf8_string(void) "\xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD7 " "\xD8 \xD9 \xDA \xDB \xDC \xDD \xDE \xDF \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xC0 \xC1 \xC2 \xC3 \xC4 \xC5 \xC6 \xC7 " "\xC8 \xC9 \xCA \xCB \xCC \xCD \xCE \xCF " "\xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD7 " @@ -416,28 +408,28 @@ static void utf8_string(void) /* bug: not corrected */ "\xE0 \xE1 \xE2 \xE3 \xE4 \xE5 \xE6 \xE7 " "\xE8 \xE9 \xEA \xEB \xEC \xED \xEE \xEF ", - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", }, /* 3.2.3 All 8 first bytes of 4-byte sequences, followed by space */ { "\"\xF0 \xF1 \xF2 \xF3 \xF4 \xF5 \xF6 \xF7 \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xF0 \xF1 \xF2 \xF3 \xF4 \xF5 \xF6 \xF7 ", }, /* 3.2.4 All 4 first bytes of 5-byte sequences, followed by space */ { "\"\xF8 \xF9 \xFA \xFB \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xF8 \xF9 \xFA \xFB ", }, /* 3.2.5 All 2 first bytes of 6-byte sequences, followed by space */ { "\"\xFC \xFD \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \"", "\xFC \xFD ", }, /* 3.3 Sequences with last continuation byte missing */ @@ -445,66 +437,66 @@ static void utf8_string(void) { "\"\xC0\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xC0", }, /* 3.3.2 3-byte sequence with last byte missing (U+0000) */ { "\"\xE0\x80\"", "\xE0\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.3 4-byte sequence with last byte missing (U+0000) */ { "\"\xF0\x80\x80\"", "\xF0\x80\x80", /* bug: not corrected */ - "\"\\u0000\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.4 5-byte sequence with last byte missing (U+0000) */ { "\"\xF8\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80", }, /* 3.3.5 6-byte sequence with last byte missing (U+0000) */ { "\"\xFC\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80", }, /* 3.3.6 2-byte sequence with last byte missing (U+07FF) */ { "\"\xDF\"", "\xDF", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.3.7 3-byte sequence with last byte missing (U+FFFF) */ { "\"\xEF\xBF\"", "\xEF\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.8 4-byte sequence with last byte missing (U+1FFFFF) */ { "\"\xF7\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\u7FFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF7\xBF\xBF", }, /* 3.3.9 5-byte sequence with last byte missing (U+3FFFFFF) */ { "\"\xFB\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uBFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFB\xBF\xBF\xBF", }, /* 3.3.10 6-byte sequence with last byte missing (U+7FFFFFFF) */ { "\"\xFD\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uDFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"", */ + "\"\\uFFFD\"", "\xFD\xBF\xBF\xBF\xBF", }, /* 3.4 Concatenation of incomplete sequences */ @@ -512,10 +504,8 @@ static void utf8_string(void) "\"\xC0\xE0\x80\xF0\x80\x80\xF8\x80\x80\x80\xFC\x80\x80\x80\x80" "\xDF\xEF\xBF\xF7\xBF\xBF\xFB\xBF\xBF\xBF\xFD\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - /* bug: want "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" */ - "\"\\u0020\\uFFFF\\u0000\\u8000\\uFFFF\\uC000\\uFFFF\\uFFFF" - "\\u07EF\\uFFFF\\u7FFF\\uBFFF\\uFFFF\\uDFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", "\xC0\xE0\x80\xF0\x80\x80\xF8\x80\x80\x80\xFC\x80\x80\x80\x80" "\xDF\xEF\xBF\xF7\xBF\xBF\xFB\xBF\xBF\xBF\xFD\xBF\xBF\xBF\xBF", }, @@ -523,20 +513,19 @@ static void utf8_string(void) { "\"\xFE\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xFE", }, { "\"\xFF\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xFF", }, { "\"\xFE\xFE\xFF\xFF\"", NULL, /* bug: rejected */ - /* bug: want "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" */ - "\"\\uEFBF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", "\xFE\xFE\xFF\xFF", }, /* 4 Overlong sequences */ @@ -544,29 +533,29 @@ static void utf8_string(void) { "\"\xC0\xAF\"", NULL, /* bug: rejected */ - "\"\\u002F\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xC0\xAF", }, { "\"\xE0\x80\xAF\"", "\xE0\x80\xAF", /* bug: not corrected */ - "\"\\u002F\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", }, { "\"\xF0\x80\x80\xAF\"", "\xF0\x80\x80\xAF", /* bug: not corrected */ - "\"\\u0000\\uFFFF\"" /* bug: want "\"/\"" */ + "\"\\uFFFD\"", }, { "\"\xF8\x80\x80\x80\xAF\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80\xAF", }, { "\"\xFC\x80\x80\x80\x80\xAF\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80\xAF", }, /* 4.2 Maximum overlong sequences */ @@ -574,33 +563,33 @@ static void utf8_string(void) /* \U+007F */ "\"\xC1\xBF\"", NULL, /* bug: rejected */ - "\"\\u007F\"", /* bug: want "\"\177\"" */ + "\"\\uFFFD\"", "\xC1\xBF", }, { /* \U+07FF */ "\"\xE0\x9F\xBF\"", "\xE0\x9F\xBF", /* bug: not corrected */ - "\"\\u07FF\"", + "\"\\uFFFD\"", }, { /* \U+FFFD */ "\"\xF0\x8F\xBF\xBD\"", "\xF0\x8F\xBF\xBD", /* bug: not corrected */ - "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+1FFFFF */ "\"\xF8\x87\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\u81FF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF8\x87\xBF\xBF\xBF", }, { /* \U+3FFFFFF */ "\"\xFC\x83\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uC0FF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFC\x83\xBF\xBF\xBF\xBF", }, /* 4.3 Overlong representation of the NUL character */ @@ -615,26 +604,26 @@ static void utf8_string(void) /* \U+0000 */ "\"\xE0\x80\x80\"", "\xE0\x80\x80", /* bug: not corrected */ - "\"\\u0000\"", + "\"\\uFFFD\"", }, { /* \U+0000 */ "\"\xF0\x80\x80\x80\"", "\xF0\x80\x80\x80", /* bug: not corrected */ - "\"\\u0000\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", }, { /* \U+0000 */ "\"\xF8\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80\x80", }, { /* \U+0000 */ "\"\xFC\x80\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80\x80", }, /* 5 Illegal code positions */ @@ -643,92 +632,92 @@ static void utf8_string(void) /* \U+D800 */ "\"\xED\xA0\x80\"", "\xED\xA0\x80", /* bug: not corrected */ - "\"\\uD800\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DB7F */ "\"\xED\xAD\xBF\"", "\xED\xAD\xBF", /* bug: not corrected */ - "\"\\uDB7F\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DB80 */ "\"\xED\xAE\x80\"", "\xED\xAE\x80", /* bug: not corrected */ - "\"\\uDB80\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DBFF */ "\"\xED\xAF\xBF\"", "\xED\xAF\xBF", /* bug: not corrected */ - "\"\\uDBFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DC00 */ "\"\xED\xB0\x80\"", "\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDC00\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DF80 */ "\"\xED\xBE\x80\"", "\xED\xBE\x80", /* bug: not corrected */ - "\"\\uDF80\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DFFF */ "\"\xED\xBF\xBF\"", "\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 5.2 Paired UTF-16 surrogates */ { /* \U+D800\U+DC00 */ "\"\xED\xA0\x80\xED\xB0\x80\"", "\xED\xA0\x80\xED\xB0\x80", /* bug: not corrected */ - "\"\\uD800\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+D800\U+DFFF */ "\"\xED\xA0\x80\xED\xBF\xBF\"", "\xED\xA0\x80\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uD800\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB7F\U+DC00 */ "\"\xED\xAD\xBF\xED\xB0\x80\"", "\xED\xAD\xBF\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDB7F\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB7F\U+DFFF */ "\"\xED\xAD\xBF\xED\xBF\xBF\"", "\xED\xAD\xBF\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDB7F\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB80\U+DC00 */ "\"\xED\xAE\x80\xED\xB0\x80\"", "\xED\xAE\x80\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDB80\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB80\U+DFFF */ "\"\xED\xAE\x80\xED\xBF\xBF\"", "\xED\xAE\x80\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDB80\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DBFF\U+DC00 */ "\"\xED\xAF\xBF\xED\xB0\x80\"", "\xED\xAF\xBF\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDBFF\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DBFF\U+DFFF */ "\"\xED\xAF\xBF\xED\xBF\xBF\"", "\xED\xAF\xBF\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, /* 5.3 Other illegal code positions */ /* BMP noncharacters */ @@ -736,25 +725,25 @@ static void utf8_string(void) /* \U+FFFE */ "\"\xEF\xBF\xBE\"", "\xEF\xBF\xBE", /* bug: not corrected */ - "\"\\uFFFE\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, { /* \U+FFFF */ "\"\xEF\xBF\xBF\"", "\xEF\xBF\xBF", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, { /* U+FDD0 */ "\"\xEF\xB7\x90\"", "\xEF\xB7\x90", /* bug: not corrected */ - "\"\\uFDD0\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, { /* U+FDEF */ "\"\xEF\xB7\xAF\"", "\xEF\xB7\xAF", /* bug: not corrected */ - "\"\\uFDEF\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, /* Plane 1 .. 16 noncharacters */ { @@ -792,15 +781,10 @@ static void utf8_string(void) "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF", - /* bug: not corrected */ - "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF" - "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF" - "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF" - "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF" - "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF" - "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF" - "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF" - "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, {} }; @@ -838,8 +822,8 @@ static void utf8_string(void) qobject_decref(obj); /* - * Disabled, because json_out currently contains the crap - * qobject_to_json() produces. + * Disabled, because qobject_from_json() is buggy, and I can't + * be bothered to add the expected incorrect results. * FIXME Enable once these bugs have been fixed. */ if (0 && json_out != json_in) { -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite 2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster @ 2013-03-21 20:44 ` Laszlo Ersek 2013-03-22 13:15 ` Laszlo Ersek 1 sibling, 0 replies; 12+ messages in thread From: Laszlo Ersek @ 2013-03-21 20:44 UTC (permalink / raw) To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel On 03/14/13 18:49, Markus Armbruster wrote: > Known bugs in to_json(): > My rewrite fixes them as follows: I'll try to review this sometime later. Patch review doesn't scale *at all*. I've spent hours on the first 3 patches. You should just be given pull req rights. I'd need 36 hour days. /me throws his hands up Laszlo ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite 2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster 2013-03-21 20:44 ` Laszlo Ersek @ 2013-03-22 13:15 ` Laszlo Ersek 2013-03-22 14:51 ` Markus Armbruster 1 sibling, 1 reply; 12+ messages in thread From: Laszlo Ersek @ 2013-03-22 13:15 UTC (permalink / raw) To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel comments below On 03/14/13 18:49, Markus Armbruster wrote: > diff --git a/qobject/qjson.c b/qobject/qjson.c > index 83a6b4f..19085a1 100644 > --- a/qobject/qjson.c > +++ b/qobject/qjson.c > @@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent) > case QTYPE_QSTRING: { > QString *val = qobject_to_qstring(obj); > const char *ptr; > + int cp; > + char buf[16]; > + char *end; > > ptr = qstring_get_str(val); > qstring_append(str, "\""); > - while (*ptr) { > - if ((ptr[0] & 0xE0) == 0xE0 && > - (ptr[1] & 0x80) && (ptr[2] & 0x80)) { > - uint16_t wchar; > - char escape[7]; > - > - wchar = (ptr[0] & 0x0F) << 12; > - wchar |= (ptr[1] & 0x3F) << 6; > - wchar |= (ptr[2] & 0x3F); > - ptr += 2; > - > - snprintf(escape, sizeof(escape), "\\u%04X", wchar); > - qstring_append(str, escape); > - } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) { > - uint16_t wchar; > - char escape[7]; > - > - wchar = (ptr[0] & 0x1F) << 6; > - wchar |= (ptr[1] & 0x3F); > - ptr++; > - > - snprintf(escape, sizeof(escape), "\\u%04X", wchar); > - qstring_append(str, escape); > - } else switch (ptr[0]) { > - case '\"': > - qstring_append(str, "\\\""); > - break; > - case '\\': > - qstring_append(str, "\\\\"); > - break; > - case '\b': > - qstring_append(str, "\\b"); > - break; > - case '\f': > - qstring_append(str, "\\f"); > - break; > - case '\n': > - qstring_append(str, "\\n"); > - break; > - case '\r': > - qstring_append(str, "\\r"); > - break; > - case '\t': > - qstring_append(str, "\\t"); > - break; > - default: { > - if (ptr[0] <= 0x1F) { > - char escape[7]; > - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]); > - qstring_append(str, escape); > - } else { > - char buf[2] = { ptr[0], 0 }; > - qstring_append(str, buf); > - } > - break; > + > + for (; *ptr; ptr = end) { > + cp = mod_utf8_codepoint(ptr, 6, &end); This provides more background: you never call mod_utf8_codepoint() with '\0' at offset 0. So handling that in mod_utf8_codepoint() may not be that important. If a '\0' is found at offset >= 1, it will correctly trigger the /* continuation byte missing */ branch in mod_utf8_codepoint(). The retval is -1, and *end is left pointing to the NUL byte. (This is consistent with mod_utf8_codepoint()'s docs.) The -1 (incomplete sequence) produces the replacement character below, and the next time around *ptr is '\0', so we finish the loop. Seems OK. ( An alternative interface for mod_utf8_codepoint() might be something like: size_t alternative(const char *ptr, int *cp, size_t n); Resembling read() somewhat: - the return value would be the number of bytes consumed (it can't be negative (= fatal error), because we guarantee progress). 0 is EOF and only possible when "n" is 0. - "ptr" is the source, - "cp" is the output code point, -1 if invalid, - "n" is the bytes available in the source / requested to process at most. Encountering a \0 in the byte stream would be an error (*cp = -1), but would not terminate parsing per se. Then the loop would look like: processed = 0; while (processed < full) { int cp; rd = alternative(ptr + processed, &cp, full - processed); g_assert(rd > 0); /* look at cp */ processed += rd; } But of course I'm not suggesting to rewrite the function! ) > + switch (cp) { > + case '\"': > + qstring_append(str, "\\\""); > + break; > + case '\\': > + qstring_append(str, "\\\\"); > + break; > + case '\b': > + qstring_append(str, "\\b"); > + break; > + case '\f': > + qstring_append(str, "\\f"); > + break; > + case '\n': > + qstring_append(str, "\\n"); > + break; > + case '\r': > + qstring_append(str, "\\r"); > + break; > + case '\t': > + qstring_append(str, "\\t"); > + break; The C standard also names \a (alert) and \v (vertical tab); I'm not sure about their JSON notation. (The (cp < 0x20) condition catches them below of course.) > + default: > + if (cp < 0) { > + cp = 0xFFFD; /* replacement character */ > } > + if (cp > 0xFFFF) { > + /* beyond BMP; need a surrogate pair */ > + snprintf(buf, sizeof(buf), "\\u%04X\\u%04X", > + 0xD800 + ((cp - 0x10000) >> 10), > + 0xDC00 + ((cp - 0x10000) & 0x3FF)); Seems like we write 13 bytes into buf, OK. Also cp is never greater than 0x10FFFF, hence the difference is at most 0xFFFFF. The RHS surrogate half can go up to 0xDFFF, the LHS up to 0xD800+0x3FF == 0xDBFF. Good. > + } else if (cp < 0x20 || cp >= 0x7F) { > + snprintf(buf, sizeof(buf), "\\u%04X", cp); > + } else { > + buf[0] = cp; > + buf[1] = 0; > } > - ptr++; > - } > + qstring_append(str, buf); > + } > + }; > + > qstring_append(str, "\""); > break; > } Seems OK. > diff --git a/tests/check-qjson.c b/tests/check-qjson.c > index efec1b2..595ddc0 100644 > --- a/tests/check-qjson.c > +++ b/tests/check-qjson.c I'll trust you on that one :) Reviewed-by: Laszlo Ersek <lersek@redhat.com> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite 2013-03-22 13:15 ` Laszlo Ersek @ 2013-03-22 14:51 ` Markus Armbruster 0 siblings, 0 replies; 12+ messages in thread From: Markus Armbruster @ 2013-03-22 14:51 UTC (permalink / raw) To: Laszlo Ersek; +Cc: blauwirbel, aliguori, qemu-devel Laszlo Ersek <lersek@redhat.com> writes: > comments below > > On 03/14/13 18:49, Markus Armbruster wrote: > >> diff --git a/qobject/qjson.c b/qobject/qjson.c >> index 83a6b4f..19085a1 100644 >> --- a/qobject/qjson.c >> +++ b/qobject/qjson.c >> @@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent) >> case QTYPE_QSTRING: { >> QString *val = qobject_to_qstring(obj); >> const char *ptr; >> + int cp; >> + char buf[16]; >> + char *end; >> >> ptr = qstring_get_str(val); >> qstring_append(str, "\""); >> - while (*ptr) { >> - if ((ptr[0] & 0xE0) == 0xE0 && >> - (ptr[1] & 0x80) && (ptr[2] & 0x80)) { >> - uint16_t wchar; >> - char escape[7]; >> - >> - wchar = (ptr[0] & 0x0F) << 12; >> - wchar |= (ptr[1] & 0x3F) << 6; >> - wchar |= (ptr[2] & 0x3F); >> - ptr += 2; >> - >> - snprintf(escape, sizeof(escape), "\\u%04X", wchar); >> - qstring_append(str, escape); >> - } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) { >> - uint16_t wchar; >> - char escape[7]; >> - >> - wchar = (ptr[0] & 0x1F) << 6; >> - wchar |= (ptr[1] & 0x3F); >> - ptr++; >> - >> - snprintf(escape, sizeof(escape), "\\u%04X", wchar); >> - qstring_append(str, escape); >> - } else switch (ptr[0]) { >> - case '\"': >> - qstring_append(str, "\\\""); >> - break; >> - case '\\': >> - qstring_append(str, "\\\\"); >> - break; >> - case '\b': >> - qstring_append(str, "\\b"); >> - break; >> - case '\f': >> - qstring_append(str, "\\f"); >> - break; >> - case '\n': >> - qstring_append(str, "\\n"); >> - break; >> - case '\r': >> - qstring_append(str, "\\r"); >> - break; >> - case '\t': >> - qstring_append(str, "\\t"); >> - break; >> - default: { >> - if (ptr[0] <= 0x1F) { >> - char escape[7]; >> - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]); >> - qstring_append(str, escape); >> - } else { >> - char buf[2] = { ptr[0], 0 }; >> - qstring_append(str, buf); >> - } >> - break; >> + >> + for (; *ptr; ptr = end) { >> + cp = mod_utf8_codepoint(ptr, 6, &end); > > This provides more background: you never call mod_utf8_codepoint() with > '\0' at offset 0. So handling that in mod_utf8_codepoint() may not be > that important. Yes, this caller doesn't care. Doesn't mean we shouldn't try to come up with a sane function contract. Note the use of literal 6. It means "unlimited". Perfectly safe because the string is nul-terminated. > If a '\0' is found at offset >= 1, it will correctly trigger the /* > continuation byte missing */ branch in mod_utf8_codepoint(). The retval > is -1, and *end is left pointing to the NUL byte. (This is consistent > with mod_utf8_codepoint()'s docs.) > > The -1 (incomplete sequence) produces the replacement character below, > and the next time around *ptr is '\0', so we finish the loop. Seems OK. > > ( > An alternative interface for mod_utf8_codepoint() might be something like: > > size_t alternative(const char *ptr, int *cp, size_t n); > > Resembling read() somewhat: > - the return value would be the number of bytes consumed (it can't be > negative (= fatal error), because we guarantee progress). 0 is EOF and > only possible when "n" is 0. > - "ptr" is the source, > - "cp" is the output code point, -1 if invalid, > - "n" is the bytes available in the source / requested to process at most. > > Encountering a \0 in the byte stream would be an error (*cp = -1), but > would not terminate parsing per se. > > Then the loop would look like: > > processed = 0; > while (processed < full) { > int cp; > > rd = alternative(ptr + processed, &cp, full - processed); > g_assert(rd > 0); > > /* look at cp */ > > processed += rd; > } > > But of course I'm not suggesting to rewrite the function! > ) I'll keep this in mind when deciding how I want to handle '\0'. >> + switch (cp) { >> + case '\"': >> + qstring_append(str, "\\\""); >> + break; >> + case '\\': >> + qstring_append(str, "\\\\"); >> + break; >> + case '\b': >> + qstring_append(str, "\\b"); >> + break; >> + case '\f': >> + qstring_append(str, "\\f"); >> + break; >> + case '\n': >> + qstring_append(str, "\\n"); >> + break; >> + case '\r': >> + qstring_append(str, "\\r"); >> + break; >> + case '\t': >> + qstring_append(str, "\\t"); >> + break; > > The C standard also names \a (alert) and \v (vertical tab); I'm not sure > about their JSON notation. (The (cp < 0x20) condition catches them below > of course.) JSON RFC 4627 defines only the seven above plus '\/'. Escaping '/' that way makes no sense for us, so the old code doesn't, and mine doesn't either. >> + default: >> + if (cp < 0) { >> + cp = 0xFFFD; /* replacement character */ >> } >> + if (cp > 0xFFFF) { >> + /* beyond BMP; need a surrogate pair */ >> + snprintf(buf, sizeof(buf), "\\u%04X\\u%04X", >> + 0xD800 + ((cp - 0x10000) >> 10), >> + 0xDC00 + ((cp - 0x10000) & 0x3FF)); > > Seems like we write 13 bytes into buf, OK. Also cp is never greater than > 0x10FFFF, hence the difference is at most 0xFFFFF. The RHS surrogate > half can go up to 0xDFFF, the LHS up to 0xD800+0x3FF == 0xDBFF. Good. Exactly. >> + } else if (cp < 0x20 || cp >= 0x7F) { >> + snprintf(buf, sizeof(buf), "\\u%04X", cp); >> + } else { >> + buf[0] = cp; >> + buf[1] = 0; >> } >> - ptr++; >> - } >> + qstring_append(str, buf); >> + } >> + }; >> + >> qstring_append(str, "\""); >> break; >> } > > Seems OK. > > >> diff --git a/tests/check-qjson.c b/tests/check-qjson.c >> index efec1b2..595ddc0 100644 >> --- a/tests/check-qjson.c >> +++ b/tests/check-qjson.c > > I'll trust you on that one :) Waah, you don't want another case of bleeding eyes?!? > Reviewed-by: Laszlo Ersek <lersek@redhat.com> Thanks! ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-04-13 19:54 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 2/4] check-qjson: Improve a few comments, delete bogus ones Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster 2013-04-11 16:11 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster 2013-04-11 17:03 ` Laszlo Ersek 2013-04-13 19:54 ` Blue Swirl -- strict thread matches above, loose matches on Subject: below -- 2013-03-14 17:49 Markus Armbruster 2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster 2013-03-21 20:44 ` Laszlo Ersek 2013-03-22 13:15 ` Laszlo Ersek 2013-03-22 14:51 ` Markus Armbruster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).