* [Qemu-devel] [PATCH 0/4] Fix JSON string formatter @ 2013-04-11 16:07 Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster ` (6 more replies) 0 siblings, 7 replies; 13+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek This should unbreak "make check" on machines where char is unsigned. Blue, please give it a whirl. The JSON parser is still as broken as ever. Left for another day. v2: - Rebased, trivial conflicts in PATCH 1/4. - Make mod_utf8_codepoint() treat empty input as invalid sequence of length zero (both when n==0 and when n>0 && *s==0). No code in this series passes empty input. - Some commit messages and comments improved. Markus Armbruster (4): unicode: New mod_utf8_codepoint() check-qjson: Improve a few comments, delete bogus ones check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings qjson: to_json() case QTYPE_QSTRING is buggy, rewrite include/qemu-common.h | 3 + qobject/qjson.c | 102 ++++++++--------- tests/check-qjson.c | 308 ++++++++++++++++++++++++++++++-------------------- util/Makefile.objs | 2 +- util/unicode.c | 100 ++++++++++++++++ 5 files changed, 333 insertions(+), 182 deletions(-) create mode 100644 util/unicode.c -- 1.7.11.7 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster @ 2013-04-11 16:07 ` Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 2/4] check-qjson: Improve a few comments, delete bogus ones Markus Armbruster ` (5 subsequent siblings) 6 siblings, 0 replies; 13+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Signed-off-by: Markus Armbruster <armbru@redhat.com> --- include/qemu-common.h | 3 ++ util/Makefile.objs | 2 +- util/unicode.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 104 insertions(+), 1 deletion(-) create mode 100644 util/unicode.c diff --git a/include/qemu-common.h b/include/qemu-common.h index 31fff22..3b1873e 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -442,6 +442,9 @@ int64_t pow2floor(int64_t value); int uleb128_encode_small(uint8_t *out, uint32_t n); int uleb128_decode_small(const uint8_t *in, uint32_t *n); +/* unicode.c */ +int mod_utf8_codepoint(const char *s, size_t n, char **end); + /* * Hexdump a buffer to a file. An optional string prefix is added to every line */ diff --git a/util/Makefile.objs b/util/Makefile.objs index 557bda7..c5652f5 100644 --- a/util/Makefile.objs +++ b/util/Makefile.objs @@ -1,4 +1,4 @@ -util-obj-y = osdep.o cutils.o qemu-timer-common.o +util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o event_notifier-win32.o util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o event_notifier-posix.o util-obj-y += envlist.o path.o host-utils.o cache-utils.o module.o diff --git a/util/unicode.c b/util/unicode.c new file mode 100644 index 0000000..d1c8658 --- /dev/null +++ b/util/unicode.c @@ -0,0 +1,100 @@ +/* + * Dealing with Unicode + * + * Copyright (C) 2013 Red Hat, Inc. + * + * Authors: + * Markus Armbruster <armbru@redhat.com> + * + * This work is licensed under the terms of the GNU GPL, version 2 or + * later. See the COPYING file in the top-level directory. + */ + +#include "qemu-common.h" + +/** + * mod_utf8_codepoint: + * @s: string encoded in modified UTF-8 + * @n: maximum number of bytes to read from @s, if less than 6 + * @end: set to end of sequence on return + * + * Convert the modified UTF-8 sequence at the start of @s. Modified + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as + * "\xC0\x80". + * + * If @n is zero or @s points to a zero byte, the sequence is invalid, + * and @end is set to @s. + * + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation + * byte, the sequence is invalid, and @end is set to @s + 1 + * + * Else, the first byte determines how many continuation bytes are + * expected. If there are fewer, the sequence is invalid, and @end is + * set to @s + 1 + actual number of continuation bytes. Else, the + * sequence is well-formed, and @end is set to @s + 1 + expected + * number of continuation bytes. + * + * A well-formed sequence is valid unless it encodes a codepoint + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66 + * noncharacters, a surrogate codepoint, or is overlong. Except the + * overlong sequence "\xC0\x80" is valid. + * + * Conversion succeeds if and only if the sequence is valid. + * + * Returns: the Unicode codepoint on success, -1 on failure. + */ +int mod_utf8_codepoint(const char *s, size_t n, char **end) +{ + static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 }; + const unsigned char *p; + unsigned byte, mask, len, i; + int cp; + + if (n == 0 || *s == 0) { + /* empty sequence */ + *end = (char *)s; + return -1; + } + + p = (const unsigned char *)s; + byte = *p++; + if (byte < 0x80) { + cp = byte; /* one byte sequence */ + } else if (byte >= 0xFE) { + cp = -1; /* impossible bytes 0xFE, 0xFF */ + } else if ((byte & 0x40) == 0) { + cp = -1; /* unexpected continuation byte */ + } else { + /* multi-byte sequence */ + len = 0; + for (mask = 0x80; byte & mask; mask >>= 1) { + len++; + } + assert(len > 1 && len < 7); + cp = byte & (mask - 1); + for (i = 1; i < len; i++) { + byte = i < n ? *p : 0; + if ((byte & 0xC0) != 0x80) { + cp = -1; /* continuation byte missing */ + goto out; + } + p++; + cp <<= 6; + cp |= byte & 0x3F; + } + if (cp > 0x10FFFF) { + cp = -1; /* beyond Unicode range */ + } else if ((cp >= 0xFDD0 && cp <= 0xFDEF) + || (cp & 0xFFFE) == 0xFFFE) { + cp = -1; /* noncharacter */ + } else if (cp >= 0xD800 && cp <= 0xDFFF) { + cp = -1; /* surrogate code point */ + } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) { + cp = -1; /* overlong, not \xC0\x80 */ + } + } + +out: + *end = (char *)p; + return cp; +} -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [Qemu-devel] [PATCH 2/4] check-qjson: Improve a few comments, delete bogus ones 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster @ 2013-04-11 16:07 ` Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster ` (4 subsequent siblings) 6 siblings, 0 replies; 13+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Signed-off-by: Markus Armbruster <armbru@redhat.com> --- tests/check-qjson.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/tests/check-qjson.c b/tests/check-qjson.c index ec85a0c..91b4e5d 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -4,7 +4,7 @@ * * Authors: * Anthony Liguori <aliguori@us.ibm.com> - * Markus Armbruster <armbru@redhat.com>, + * Markus Armbruster <armbru@redhat.com> * * This work is licensed under the terms of the GNU LGPL, version 2.1 or later. * See the COPYING.LIB file in the top-level directory. @@ -285,31 +285,31 @@ static void utf8_string(void) }, /* 2.3 Other boundary conditions */ { - /* U+D7FF */ + /* last one before surrogate range: U+D7FF */ "\"\xED\x9F\xBF\"", "\xED\x9F\xBF", "\"\\uD7FF\"", }, { - /* U+E000 */ + /* first one after surrogate range: U+E000 */ "\"\xEE\x80\x80\"", "\xEE\x80\x80", "\"\\uE000\"", }, { - /* U+FFFD */ + /* last one in BMP: U+FFFD */ "\"\xEF\xBF\xBD\"", "\xEF\xBF\xBD", "\"\\uFFFD\"", }, { - /* U+10FFFF */ + /* last one in last plane: U+10FFFF */ "\"\xF4\x8F\xBF\xBF\"", "\xF4\x8F\xBF\xBF", "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */ }, { - /* U+110000 */ + /* first one beyond Unicode range: U+110000 */ "\"\xF4\x90\x80\x80\"", "\xF4\x90\x80\x80", "\"\\u4400\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ @@ -462,8 +462,7 @@ static void utf8_string(void) }, /* 3.3.4 5-byte sequence with last byte missing (U+0000) */ { - /* invalid */ - "\"\xF8\x80\x80\x80\"", /* bug: not corrected */ + "\"\xF8\x80\x80\x80\"", NULL, /* bug: rejected */ "\"\\u8000\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ "\xF8\x80\x80\x80", @@ -570,7 +569,12 @@ static void utf8_string(void) "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */ "\xFC\x80\x80\x80\x80\xAF", }, - /* 4.2 Maximum overlong sequences */ + /* + * 4.2 Maximum overlong sequences + * Highest Unicode value that is still resulting in an + * overlong sequence if represented with the given number of + * bytes. This is a boundary test for safe UTF-8 decoders. + */ { /* \U+007F */ "\"\xC1\xBF\"", -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 2/4] check-qjson: Improve a few comments, delete bogus ones Markus Armbruster @ 2013-04-11 16:07 ` Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster ` (3 subsequent siblings) 6 siblings, 0 replies; 13+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Test cases cover the two noncharacters in the BMP. Add tests for the other 64 noncharacters. Three existing test cases involve noncharacters U+FFFF and U+10FFFF. Instead of deleting them as now duplicates, adjust them to use U+FFFC and U+10FFFFD. Signed-off-by: Markus Armbruster <armbru@redhat.com> --- tests/check-qjson.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 84 insertions(+), 12 deletions(-) diff --git a/tests/check-qjson.c b/tests/check-qjson.c index 91b4e5d..54074a9 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -158,7 +158,7 @@ static void utf8_string(void) * consider using overlong encoding \xC0\x80 for U+0000 ("modified * UTF-8"). * - * Test cases are scraped from Markus Kuhn's UTF-8 decoder + * Most test cases are scraped from Markus Kuhn's UTF-8 decoder * capability and stress test at * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt */ @@ -256,11 +256,19 @@ static void utf8_string(void) "\xDF\xBF", "\"\\u07FF\"", }, - /* 2.2.3 3 bytes U+FFFF */ + /* + * 2.2.3 3 bytes U+FFFC + * The last possible sequence is actually U+FFFF. But that's + * a noncharacter, and already covered by its own test case + * under 5.3. Same for U+FFFE. U+FFFD is the last character + * in the BMP, and covered under 2.3. Because of U+FFFD's + * special role as replacement character, it's worth testing + * U+FFFC here. + */ { - "\"\xEF\xBF\xBF\"", - "\xEF\xBF\xBF", - "\"\\uFFFF\"", + "\"\xEF\xBF\xBC\"", + "\xEF\xBF\xBC", + "\"\\uFFFC\"", }, /* 2.2.4 4 bytes U+1FFFFF */ { @@ -303,10 +311,10 @@ static void utf8_string(void) "\"\\uFFFD\"", }, { - /* last one in last plane: U+10FFFF */ - "\"\xF4\x8F\xBF\xBF\"", - "\xF4\x8F\xBF\xBF", - "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */ + /* last one in last plane: U+10FFFD */ + "\"\xF4\x8F\xBF\xBD\"", + "\xF4\x8F\xBF\xBD", + "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */ }, { /* first one beyond Unicode range: U+110000 */ @@ -589,9 +597,14 @@ static void utf8_string(void) "\"\\u07FF\"", }, { - /* \U+FFFF */ - "\"\xF0\x8F\xBF\xBF\"", - "\xF0\x8F\xBF\xBF", /* bug: not corrected */ + /* + * \U+FFFC + * The actual maximum would be U+FFFF, but that's a + * noncharacter. Testing U+FFFC seems more useful. See + * also 2.2.3 + */ + "\"\xF0\x8F\xBF\xBC\"", + "\xF0\x8F\xBF\xBC", /* bug: not corrected */ "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ }, { @@ -736,6 +749,7 @@ static void utf8_string(void) "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ }, /* 5.3 Other illegal code positions */ + /* BMP noncharacters */ { /* \U+FFFE */ "\"\xEF\xBF\xBE\"", @@ -748,6 +762,64 @@ static void utf8_string(void) "\xEF\xBF\xBF", /* bug: not corrected */ "\"\\uFFFF\"", /* bug: not corrected */ }, + { + /* U+FDD0 */ + "\"\xEF\xB7\x90\"", + "\xEF\xB7\x90", /* bug: not corrected */ + "\"\\uFDD0\"", /* bug: not corrected */ + }, + { + /* U+FDEF */ + "\"\xEF\xB7\xAF\"", + "\xEF\xB7\xAF", /* bug: not corrected */ + "\"\\uFDEF\"", /* bug: not corrected */ + }, + /* Plane 1 .. 16 noncharacters */ + { + /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */ + "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"", + /* bug: not corrected */ + "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF", + /* bug: not corrected */ + "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF" + "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF" + "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF" + "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF" + "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF" + "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF" + "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF" + "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"", + }, {} }; int i; -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (2 preceding siblings ...) 2013-04-11 16:07 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster @ 2013-04-11 16:07 ` Markus Armbruster 2013-04-11 16:11 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (2 subsequent siblings) 6 siblings, 0 replies; 13+ messages in thread From: Markus Armbruster @ 2013-04-11 16:07 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Known bugs in to_json(): * A start byte for a three-byte sequence followed by less than two continuation bytes is split into one-byte sequences. * Start bytes for sequences longer than three bytes get misinterpreted as start bytes for three-byte sequences. Continuation bytes beyond byte three become one-byte sequences. This means all characters outside the BMP are decoded incorrectly. * One-byte sequences with the MSB are put into the JSON string verbatim when char is unsigned, producing invalid UTF-8. When char is signed, they're replaced by "\\uFFFF" instead. This includes \xFE, \xFF, and stray continuation bytes. * Overlong sequences are happily accepted, unless screwed up by the bugs above. * Likewise, sequences encoding surrogate code points or noncharacters. * Unlike other control characters, ASCII DEL is not escaped. Except in overlong encodings. My rewrite fixes them as follows: * Malformed UTF-8 sequences are replaced. Except the overlong encoding \xC0\x80 of U+0000 is still accepted. Permits embedding NUL characters in C strings. This trick is known as "Modified UTF-8". * Sequences encoding code points beyond Unicode range are replaced. * Sequences encoding code points beyond the BMP produce a surrogate pair. * Sequences encoding surrogate code points are replaced. * Sequences encoding noncharacters are replaced. * ASCII DEL is now always escaped. The replacement character is U+FFFD. Signed-off-by: Markus Armbruster <armbru@redhat.com> --- qobject/qjson.c | 102 +++++++++++-------------- tests/check-qjson.c | 216 ++++++++++++++++++++++++---------------------------- 2 files changed, 145 insertions(+), 173 deletions(-) diff --git a/qobject/qjson.c b/qobject/qjson.c index 83a6b4f..19085a1 100644 --- a/qobject/qjson.c +++ b/qobject/qjson.c @@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent) case QTYPE_QSTRING: { QString *val = qobject_to_qstring(obj); const char *ptr; + int cp; + char buf[16]; + char *end; ptr = qstring_get_str(val); qstring_append(str, "\""); - while (*ptr) { - if ((ptr[0] & 0xE0) == 0xE0 && - (ptr[1] & 0x80) && (ptr[2] & 0x80)) { - uint16_t wchar; - char escape[7]; - - wchar = (ptr[0] & 0x0F) << 12; - wchar |= (ptr[1] & 0x3F) << 6; - wchar |= (ptr[2] & 0x3F); - ptr += 2; - - snprintf(escape, sizeof(escape), "\\u%04X", wchar); - qstring_append(str, escape); - } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) { - uint16_t wchar; - char escape[7]; - - wchar = (ptr[0] & 0x1F) << 6; - wchar |= (ptr[1] & 0x3F); - ptr++; - - snprintf(escape, sizeof(escape), "\\u%04X", wchar); - qstring_append(str, escape); - } else switch (ptr[0]) { - case '\"': - qstring_append(str, "\\\""); - break; - case '\\': - qstring_append(str, "\\\\"); - break; - case '\b': - qstring_append(str, "\\b"); - break; - case '\f': - qstring_append(str, "\\f"); - break; - case '\n': - qstring_append(str, "\\n"); - break; - case '\r': - qstring_append(str, "\\r"); - break; - case '\t': - qstring_append(str, "\\t"); - break; - default: { - if (ptr[0] <= 0x1F) { - char escape[7]; - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]); - qstring_append(str, escape); - } else { - char buf[2] = { ptr[0], 0 }; - qstring_append(str, buf); - } - break; + + for (; *ptr; ptr = end) { + cp = mod_utf8_codepoint(ptr, 6, &end); + switch (cp) { + case '\"': + qstring_append(str, "\\\""); + break; + case '\\': + qstring_append(str, "\\\\"); + break; + case '\b': + qstring_append(str, "\\b"); + break; + case '\f': + qstring_append(str, "\\f"); + break; + case '\n': + qstring_append(str, "\\n"); + break; + case '\r': + qstring_append(str, "\\r"); + break; + case '\t': + qstring_append(str, "\\t"); + break; + default: + if (cp < 0) { + cp = 0xFFFD; /* replacement character */ } + if (cp > 0xFFFF) { + /* beyond BMP; need a surrogate pair */ + snprintf(buf, sizeof(buf), "\\u%04X\\u%04X", + 0xD800 + ((cp - 0x10000) >> 10), + 0xDC00 + ((cp - 0x10000) & 0x3FF)); + } else if (cp < 0x20 || cp >= 0x7F) { + snprintf(buf, sizeof(buf), "\\u%04X", cp); + } else { + buf[0] = cp; + buf[1] = 0; } - ptr++; - } + qstring_append(str, buf); + } + }; + qstring_append(str, "\""); break; } diff --git a/tests/check-qjson.c b/tests/check-qjson.c index 54074a9..4e74548 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -144,13 +144,10 @@ static void utf8_string(void) * The JSON parser rejects some invalid sequences, but accepts * others without correcting the problem. * - * The JSON formatter replaces some invalid sequences by U+FFFF (a - * noncharacter), and goes wonky for others. - * - * For both directions, we should either reject all invalid - * sequences, or minimize overlong sequences and replace all other - * invalid sequences by a suitable replacement character. A - * common choice for replacement is U+FFFD. + * We should either reject all invalid sequences, or minimize + * overlong sequences and replace all other invalid sequences by a + * suitable replacement character. A common choice for + * replacement is U+FFFD. * * Problem: we can't easily deal with embedded U+0000. Parsing * the JSON string "this \\u0000" is fun" yields "this \0 is fun", @@ -175,16 +172,10 @@ static void utf8_string(void) * - bug: rejected * JSON parser rejects invalid sequence(s) * We may choose to define this as feature - * - bug: want "\"...\"" - * JSON formatter produces incorrect result, this is the - * correct one, assuming replacement character U+FFFF - * - bug: want "..." (no \") + * - bug: want "..." * JSON parser produces incorrect result, this is the * correct one, assuming replacement character U+FFFF * We may choose to reject instead of replace - * Not marked explicitly, but trivial to find: - * - JSON formatter replacing invalid sequence by \\uFFFF is a - * bug if we want it to fail for invalid sequences. */ /* 1 Some correct UTF-8 text */ @@ -209,7 +200,8 @@ static void utf8_string(void) { "\"\\u0000\"", "", /* bug: want overlong "\xC0\x80" */ - "\"\"", /* bug: want "\"\\u0000\"" */ + "\"\\u0000\"", + "\xC0\x80", }, /* 2.1.2 2 bytes U+0080 */ { @@ -227,20 +219,20 @@ static void utf8_string(void) { "\"\xF0\x90\x80\x80\"", "\xF0\x90\x80\x80", - "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */ + "\"\\uD800\\uDC00\"", }, /* 2.1.5 5 bytes U+200000 */ { "\"\xF8\x88\x80\x80\x80\"", - NULL, /* bug: rejected */ - "\"\\u8200\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xF8\x88\x80\x80\x80", }, /* 2.1.6 6 bytes U+4000000 */ { "\"\xFC\x84\x80\x80\x80\x80\"", - NULL, /* bug: rejected */ - "\"\\uC100\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFC\x84\x80\x80\x80\x80", }, /* 2.2 Last possible sequence of a certain length */ @@ -248,7 +240,7 @@ static void utf8_string(void) { "\"\x7F\"", "\x7F", - "\"\177\"", + "\"\\u007F\"", }, /* 2.2.2 2 bytes U+07FF */ { @@ -273,22 +265,22 @@ static void utf8_string(void) /* 2.2.4 4 bytes U+1FFFFF */ { "\"\xF7\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\u7FFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xF7\xBF\xBF\xBF", }, /* 2.2.5 5 bytes U+3FFFFFF */ { "\"\xFB\xBF\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\uBFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFB\xBF\xBF\xBF\xBF", }, /* 2.2.6 6 bytes U+7FFFFFFF */ { "\"\xFD\xBF\xBF\xBF\xBF\xBF\"", - NULL, /* bug: rejected */ - "\"\\uDFFF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + NULL, /* bug: rejected */ + "\"\\uFFFD\"", "\xFD\xBF\xBF\xBF\xBF\xBF", }, /* 2.3 Other boundary conditions */ @@ -314,13 +306,13 @@ static void utf8_string(void) /* last one in last plane: U+10FFFD */ "\"\xF4\x8F\xBF\xBD\"", "\xF4\x8F\xBF\xBD", - "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */ + "\"\\uDBFF\\uDFFD\"" }, { /* first one beyond Unicode range: U+110000 */ "\"\xF4\x90\x80\x80\"", "\xF4\x90\x80\x80", - "\"\\u4400\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3 Malformed sequences */ /* 3.1 Unexpected continuation bytes */ @@ -328,49 +320,49 @@ static void utf8_string(void) { "\"\x80\"", "\x80", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.1.2 Last continuation byte */ { "\"\xBF\"", "\xBF", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.1.3 2 continuation bytes */ { "\"\x80\xBF\"", "\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\"", }, /* 3.1.4 3 continuation bytes */ { "\"\x80\xBF\x80\"", "\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.5 4 continuation bytes */ { "\"\x80\xBF\x80\xBF\"", "\x80\xBF\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.6 5 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\"", "\x80\xBF\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.7 6 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\xBF\"", "\x80\xBF\x80\xBF\x80\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.8 7 continuation bytes */ { "\"\x80\xBF\x80\xBF\x80\xBF\x80\"", "\x80\xBF\x80\xBF\x80\xBF\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, /* 3.1.9 Sequence of all 64 possible continuation bytes */ { @@ -391,14 +383,14 @@ static void utf8_string(void) "\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF" "\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7" "\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF", - "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"" }, /* 3.2 Lonely start characters */ /* 3.2.1 All 32 first bytes of 2-byte sequences, followed by space */ @@ -408,10 +400,10 @@ static void utf8_string(void) "\xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD7 " "\xD8 \xD9 \xDA \xDB \xDC \xDD \xDE \xDF \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xC0 \xC1 \xC2 \xC3 \xC4 \xC5 \xC6 \xC7 " "\xC8 \xC9 \xCA \xCB \xCC \xCD \xCE \xCF " "\xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD7 " @@ -424,28 +416,28 @@ static void utf8_string(void) /* bug: not corrected */ "\xE0 \xE1 \xE2 \xE3 \xE4 \xE5 \xE6 \xE7 " "\xE8 \xE9 \xEA \xEB \xEC \xED \xEE \xEF ", - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF " - "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD " + "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", }, /* 3.2.3 All 8 first bytes of 4-byte sequences, followed by space */ { "\"\xF0 \xF1 \xF2 \xF3 \xF4 \xF5 \xF6 \xF7 \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xF0 \xF1 \xF2 \xF3 \xF4 \xF5 \xF6 \xF7 ", }, /* 3.2.4 All 4 first bytes of 5-byte sequences, followed by space */ { "\"\xF8 \xF9 \xFA \xFB \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \"", "\xF8 \xF9 \xFA \xFB ", }, /* 3.2.5 All 2 first bytes of 6-byte sequences, followed by space */ { "\"\xFC \xFD \"", NULL, /* bug: rejected */ - "\"\\uFFFF \\uFFFF \"", + "\"\\uFFFD \\uFFFD \"", "\xFC \xFD ", }, /* 3.3 Sequences with last continuation byte missing */ @@ -453,66 +445,66 @@ static void utf8_string(void) { "\"\xC0\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xC0", }, /* 3.3.2 3-byte sequence with last byte missing (U+0000) */ { "\"\xE0\x80\"", "\xE0\x80", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.3 4-byte sequence with last byte missing (U+0000) */ { "\"\xF0\x80\x80\"", "\xF0\x80\x80", /* bug: not corrected */ - "\"\\u0000\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.4 5-byte sequence with last byte missing (U+0000) */ { "\"\xF8\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80", }, /* 3.3.5 6-byte sequence with last byte missing (U+0000) */ { "\"\xFC\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80", }, /* 3.3.6 2-byte sequence with last byte missing (U+07FF) */ { "\"\xDF\"", "\xDF", /* bug: not corrected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", }, /* 3.3.7 3-byte sequence with last byte missing (U+FFFF) */ { "\"\xEF\xBF\"", "\xEF\xBF", /* bug: not corrected */ - "\"\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 3.3.8 4-byte sequence with last byte missing (U+1FFFFF) */ { "\"\xF7\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\u7FFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF7\xBF\xBF", }, /* 3.3.9 5-byte sequence with last byte missing (U+3FFFFFF) */ { "\"\xFB\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uBFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFB\xBF\xBF\xBF", }, /* 3.3.10 6-byte sequence with last byte missing (U+7FFFFFFF) */ { "\"\xFD\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uDFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"", */ + "\"\\uFFFD\"", "\xFD\xBF\xBF\xBF\xBF", }, /* 3.4 Concatenation of incomplete sequences */ @@ -520,10 +512,8 @@ static void utf8_string(void) "\"\xC0\xE0\x80\xF0\x80\x80\xF8\x80\x80\x80\xFC\x80\x80\x80\x80" "\xDF\xEF\xBF\xF7\xBF\xBF\xFB\xBF\xBF\xBF\xFD\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - /* bug: want "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF" - "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" */ - "\"\\u0020\\uFFFF\\u0000\\u8000\\uFFFF\\uC000\\uFFFF\\uFFFF" - "\\u07EF\\uFFFF\\u7FFF\\uBFFF\\uFFFF\\uDFFF\\uFFFF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", "\xC0\xE0\x80\xF0\x80\x80\xF8\x80\x80\x80\xFC\x80\x80\x80\x80" "\xDF\xEF\xBF\xF7\xBF\xBF\xFB\xBF\xBF\xBF\xFD\xBF\xBF\xBF\xBF", }, @@ -531,20 +521,19 @@ static void utf8_string(void) { "\"\xFE\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xFE", }, { "\"\xFF\"", NULL, /* bug: rejected */ - "\"\\uFFFF\"", + "\"\\uFFFD\"", "\xFF", }, { "\"\xFE\xFE\xFF\xFF\"", NULL, /* bug: rejected */ - /* bug: want "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" */ - "\"\\uEFBF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", "\xFE\xFE\xFF\xFF", }, /* 4 Overlong sequences */ @@ -552,29 +541,29 @@ static void utf8_string(void) { "\"\xC0\xAF\"", NULL, /* bug: rejected */ - "\"\\u002F\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xC0\xAF", }, { "\"\xE0\x80\xAF\"", "\xE0\x80\xAF", /* bug: not corrected */ - "\"\\u002F\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", }, { "\"\xF0\x80\x80\xAF\"", "\xF0\x80\x80\xAF", /* bug: not corrected */ - "\"\\u0000\\uFFFF\"" /* bug: want "\"/\"" */ + "\"\\uFFFD\"", }, { "\"\xF8\x80\x80\x80\xAF\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80\xAF", }, { "\"\xFC\x80\x80\x80\x80\xAF\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80\xAF", }, /* @@ -587,14 +576,14 @@ static void utf8_string(void) /* \U+007F */ "\"\xC1\xBF\"", NULL, /* bug: rejected */ - "\"\\u007F\"", /* bug: want "\"\177\"" */ + "\"\\uFFFD\"", "\xC1\xBF", }, { /* \U+07FF */ "\"\xE0\x9F\xBF\"", "\xE0\x9F\xBF", /* bug: not corrected */ - "\"\\u07FF\"", + "\"\\uFFFD\"", }, { /* @@ -605,20 +594,20 @@ static void utf8_string(void) */ "\"\xF0\x8F\xBF\xBC\"", "\xF0\x8F\xBF\xBC", /* bug: not corrected */ - "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+1FFFFF */ "\"\xF8\x87\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\u81FF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xF8\x87\xBF\xBF\xBF", }, { /* \U+3FFFFFF */ "\"\xFC\x83\xBF\xBF\xBF\xBF\"", NULL, /* bug: rejected */ - "\"\\uC0FF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", "\xFC\x83\xBF\xBF\xBF\xBF", }, /* 4.3 Overlong representation of the NUL character */ @@ -633,26 +622,26 @@ static void utf8_string(void) /* \U+0000 */ "\"\xE0\x80\x80\"", "\xE0\x80\x80", /* bug: not corrected */ - "\"\\u0000\"", + "\"\\uFFFD\"", }, { /* \U+0000 */ "\"\xF0\x80\x80\x80\"", "\xF0\x80\x80\x80", /* bug: not corrected */ - "\"\\u0000\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", }, { /* \U+0000 */ "\"\xF8\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", "\xF8\x80\x80\x80\x80", }, { /* \U+0000 */ "\"\xFC\x80\x80\x80\x80\x80\"", NULL, /* bug: rejected */ - "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */ + "\"\\uFFFD\"", "\xFC\x80\x80\x80\x80\x80", }, /* 5 Illegal code positions */ @@ -661,92 +650,92 @@ static void utf8_string(void) /* \U+D800 */ "\"\xED\xA0\x80\"", "\xED\xA0\x80", /* bug: not corrected */ - "\"\\uD800\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DB7F */ "\"\xED\xAD\xBF\"", "\xED\xAD\xBF", /* bug: not corrected */ - "\"\\uDB7F\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DB80 */ "\"\xED\xAE\x80\"", "\xED\xAE\x80", /* bug: not corrected */ - "\"\\uDB80\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DBFF */ "\"\xED\xAF\xBF\"", "\xED\xAF\xBF", /* bug: not corrected */ - "\"\\uDBFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DC00 */ "\"\xED\xB0\x80\"", "\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDC00\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DF80 */ "\"\xED\xBE\x80\"", "\xED\xBE\x80", /* bug: not corrected */ - "\"\\uDF80\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, { /* \U+DFFF */ "\"\xED\xBF\xBF\"", "\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDFFF\"", /* bug: want "\"\\uFFFF\"" */ + "\"\\uFFFD\"", }, /* 5.2 Paired UTF-16 surrogates */ { /* \U+D800\U+DC00 */ "\"\xED\xA0\x80\xED\xB0\x80\"", "\xED\xA0\x80\xED\xB0\x80", /* bug: not corrected */ - "\"\\uD800\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+D800\U+DFFF */ "\"\xED\xA0\x80\xED\xBF\xBF\"", "\xED\xA0\x80\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uD800\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB7F\U+DC00 */ "\"\xED\xAD\xBF\xED\xB0\x80\"", "\xED\xAD\xBF\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDB7F\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB7F\U+DFFF */ "\"\xED\xAD\xBF\xED\xBF\xBF\"", "\xED\xAD\xBF\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDB7F\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB80\U+DC00 */ "\"\xED\xAE\x80\xED\xB0\x80\"", "\xED\xAE\x80\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDB80\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DB80\U+DFFF */ "\"\xED\xAE\x80\xED\xBF\xBF\"", "\xED\xAE\x80\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDB80\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DBFF\U+DC00 */ "\"\xED\xAF\xBF\xED\xB0\x80\"", "\xED\xAF\xBF\xED\xB0\x80", /* bug: not corrected */ - "\"\\uDBFF\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, { /* \U+DBFF\U+DFFF */ "\"\xED\xAF\xBF\xED\xBF\xBF\"", "\xED\xAF\xBF\xED\xBF\xBF", /* bug: not corrected */ - "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ + "\"\\uFFFD\\uFFFD\"", }, /* 5.3 Other illegal code positions */ /* BMP noncharacters */ @@ -754,25 +743,25 @@ static void utf8_string(void) /* \U+FFFE */ "\"\xEF\xBF\xBE\"", "\xEF\xBF\xBE", /* bug: not corrected */ - "\"\\uFFFE\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, { /* \U+FFFF */ "\"\xEF\xBF\xBF\"", "\xEF\xBF\xBF", /* bug: not corrected */ - "\"\\uFFFF\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, { /* U+FDD0 */ "\"\xEF\xB7\x90\"", "\xEF\xB7\x90", /* bug: not corrected */ - "\"\\uFDD0\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, { /* U+FDEF */ "\"\xEF\xB7\xAF\"", "\xEF\xB7\xAF", /* bug: not corrected */ - "\"\\uFDEF\"", /* bug: not corrected */ + "\"\\uFFFD\"", }, /* Plane 1 .. 16 noncharacters */ { @@ -810,15 +799,10 @@ static void utf8_string(void) "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF", - /* bug: not corrected */ - "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF" - "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF" - "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF" - "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF" - "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF" - "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF" - "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF" - "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"", + "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD" + "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"", }, {} }; @@ -856,8 +840,8 @@ static void utf8_string(void) qobject_decref(obj); /* - * Disabled, because json_out currently contains the crap - * qobject_to_json() produces. + * Disabled, because qobject_from_json() is buggy, and I can't + * be bothered to add the expected incorrect results. * FIXME Enable once these bugs have been fixed. */ if (0 && json_out != json_in) { -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (3 preceding siblings ...) 2013-04-11 16:07 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster @ 2013-04-11 16:11 ` Markus Armbruster 2013-04-11 17:03 ` Laszlo Ersek 2013-04-13 19:54 ` Blue Swirl 6 siblings, 0 replies; 13+ messages in thread From: Markus Armbruster @ 2013-04-11 16:11 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori, lersek Rats, forgot --subject-prefix="PATCH v2". My apologies! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (4 preceding siblings ...) 2013-04-11 16:11 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster @ 2013-04-11 17:03 ` Laszlo Ersek 2013-04-13 19:54 ` Blue Swirl 6 siblings, 0 replies; 13+ messages in thread From: Laszlo Ersek @ 2013-04-11 17:03 UTC (permalink / raw) To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel On 04/11/13 18:07, Markus Armbruster wrote: > This should unbreak "make check" on machines where char is unsigned. > Blue, please give it a whirl. > > The JSON parser is still as broken as ever. Left for another day. > > v2: > - Rebased, trivial conflicts in PATCH 1/4. > - Make mod_utf8_codepoint() treat empty input as invalid sequence of > length zero (both when n==0 and when n>0 && *s==0). No code in this > series passes empty input. > - Some commit messages and comments improved. > > Markus Armbruster (4): > unicode: New mod_utf8_codepoint() > check-qjson: Improve a few comments, delete bogus ones > check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings > qjson: to_json() case QTYPE_QSTRING is buggy, rewrite > > include/qemu-common.h | 3 + > qobject/qjson.c | 102 ++++++++--------- > tests/check-qjson.c | 308 ++++++++++++++++++++++++++++++-------------------- > util/Makefile.objs | 2 +- > util/unicode.c | 100 ++++++++++++++++ > 5 files changed, 333 insertions(+), 182 deletions(-) > create mode 100644 util/unicode.c > I compared this v2 series patch-wise to v1. Reviewed-by: Laszlo Ersek <lersek@redhat.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster ` (5 preceding siblings ...) 2013-04-11 17:03 ` Laszlo Ersek @ 2013-04-13 19:54 ` Blue Swirl 6 siblings, 0 replies; 13+ messages in thread From: Blue Swirl @ 2013-04-13 19:54 UTC (permalink / raw) To: Markus Armbruster; +Cc: Anthony Liguori, Laszlo Ersek, qemu-devel Thanks, applied all. On Thu, Apr 11, 2013 at 4:07 PM, Markus Armbruster <armbru@redhat.com> wrote: > This should unbreak "make check" on machines where char is unsigned. > Blue, please give it a whirl. > > The JSON parser is still as broken as ever. Left for another day. > > v2: > - Rebased, trivial conflicts in PATCH 1/4. > - Make mod_utf8_codepoint() treat empty input as invalid sequence of > length zero (both when n==0 and when n>0 && *s==0). No code in this > series passes empty input. > - Some commit messages and comments improved. > > Markus Armbruster (4): > unicode: New mod_utf8_codepoint() > check-qjson: Improve a few comments, delete bogus ones > check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings > qjson: to_json() case QTYPE_QSTRING is buggy, rewrite > > include/qemu-common.h | 3 + > qobject/qjson.c | 102 ++++++++--------- > tests/check-qjson.c | 308 ++++++++++++++++++++++++++++++-------------------- > util/Makefile.objs | 2 +- > util/unicode.c | 100 ++++++++++++++++ > 5 files changed, 333 insertions(+), 182 deletions(-) > create mode 100644 util/unicode.c > > -- > 1.7.11.7 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Qemu-devel] [PATCH 0/4] Fix JSON string formatter @ 2013-03-14 17:49 Markus Armbruster 2013-03-17 19:55 ` Blue Swirl 2013-03-23 14:44 ` Blue Swirl 0 siblings, 2 replies; 13+ messages in thread From: Markus Armbruster @ 2013-03-14 17:49 UTC (permalink / raw) To: qemu-devel; +Cc: blauwirbel, aliguori This should unbreak "make check" on machines where char is unsigned. Blue, please give it a whirl. The JSON parser is still as broken as ever. Left for another day. Markus Armbruster (4): unicode: New mod_utf8_codepoint() check-qjson: Fix up a few bogus comments check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings qjson: to_json() case QTYPE_QSTRING is buggy, rewrite include/qemu-common.h | 3 + qobject/qjson.c | 102 ++++++++---------- tests/check-qjson.c | 280 +++++++++++++++++++++++++++++--------------------- util/Makefile.objs | 1 + util/unicode.c | 96 +++++++++++++++++ 5 files changed, 306 insertions(+), 176 deletions(-) create mode 100644 util/unicode.c -- 1.7.11.7 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-03-14 17:49 Markus Armbruster @ 2013-03-17 19:55 ` Blue Swirl 2013-03-18 9:58 ` Markus Armbruster 2013-03-23 14:44 ` Blue Swirl 1 sibling, 1 reply; 13+ messages in thread From: Blue Swirl @ 2013-03-17 19:55 UTC (permalink / raw) To: Markus Armbruster; +Cc: aliguori, qemu-devel On Thu, Mar 14, 2013 at 5:49 PM, Markus Armbruster <armbru@redhat.com> wrote: > This should unbreak "make check" on machines where char is unsigned. > Blue, please give it a whirl. With the patches applied there are no errors, thanks. Tested-by: Blue Swirl <blauwirbel@gmail.com> Though test-coroutine seems to hang, maybe fallout from recent coroutine changes. > > The JSON parser is still as broken as ever. Left for another day. > > Markus Armbruster (4): > unicode: New mod_utf8_codepoint() > check-qjson: Fix up a few bogus comments > check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings > qjson: to_json() case QTYPE_QSTRING is buggy, rewrite > > include/qemu-common.h | 3 + > qobject/qjson.c | 102 ++++++++---------- > tests/check-qjson.c | 280 +++++++++++++++++++++++++++++--------------------- > util/Makefile.objs | 1 + > util/unicode.c | 96 +++++++++++++++++ > 5 files changed, 306 insertions(+), 176 deletions(-) > create mode 100644 util/unicode.c > > -- > 1.7.11.7 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-03-17 19:55 ` Blue Swirl @ 2013-03-18 9:58 ` Markus Armbruster 0 siblings, 0 replies; 13+ messages in thread From: Markus Armbruster @ 2013-03-18 9:58 UTC (permalink / raw) To: Blue Swirl; +Cc: aliguori, qemu-devel Blue Swirl <blauwirbel@gmail.com> writes: > On Thu, Mar 14, 2013 at 5:49 PM, Markus Armbruster <armbru@redhat.com> wrote: >> This should unbreak "make check" on machines where char is unsigned. >> Blue, please give it a whirl. > > With the patches applied there are no errors, thanks. > Tested-by: Blue Swirl <blauwirbel@gmail.com> Thanks! > Though test-coroutine seems to hang, maybe fallout from recent > coroutine changes. I've seen rtc-test hang intermittently; haven't gotten around to digging for roots. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-03-14 17:49 Markus Armbruster 2013-03-17 19:55 ` Blue Swirl @ 2013-03-23 14:44 ` Blue Swirl 2013-04-11 16:12 ` Markus Armbruster 1 sibling, 1 reply; 13+ messages in thread From: Blue Swirl @ 2013-03-23 14:44 UTC (permalink / raw) To: Markus Armbruster; +Cc: aliguori, qemu-devel On Thu, Mar 14, 2013 at 5:49 PM, Markus Armbruster <armbru@redhat.com> wrote: > This should unbreak "make check" on machines where char is unsigned. > Blue, please give it a whirl. Patches no longer apply, please rebase. > > The JSON parser is still as broken as ever. Left for another day. > > Markus Armbruster (4): > unicode: New mod_utf8_codepoint() > check-qjson: Fix up a few bogus comments > check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings > qjson: to_json() case QTYPE_QSTRING is buggy, rewrite > > include/qemu-common.h | 3 + > qobject/qjson.c | 102 ++++++++---------- > tests/check-qjson.c | 280 +++++++++++++++++++++++++++++--------------------- > util/Makefile.objs | 1 + > util/unicode.c | 96 +++++++++++++++++ > 5 files changed, 306 insertions(+), 176 deletions(-) > create mode 100644 util/unicode.c > > -- > 1.7.11.7 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter 2013-03-23 14:44 ` Blue Swirl @ 2013-04-11 16:12 ` Markus Armbruster 0 siblings, 0 replies; 13+ messages in thread From: Markus Armbruster @ 2013-04-11 16:12 UTC (permalink / raw) To: Blue Swirl; +Cc: aliguori, qemu-devel Blue Swirl <blauwirbel@gmail.com> writes: > On Thu, Mar 14, 2013 at 5:49 PM, Markus Armbruster <armbru@redhat.com> wrote: >> This should unbreak "make check" on machines where char is unsigned. >> Blue, please give it a whirl. > > Patches no longer apply, please rebase. Sent. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-04-13 19:54 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-11 16:07 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 2/4] check-qjson: Improve a few comments, delete bogus ones Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster 2013-04-11 16:07 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster 2013-04-11 16:11 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster 2013-04-11 17:03 ` Laszlo Ersek 2013-04-13 19:54 ` Blue Swirl -- strict thread matches above, loose matches on Subject: below -- 2013-03-14 17:49 Markus Armbruster 2013-03-17 19:55 ` Blue Swirl 2013-03-18 9:58 ` Markus Armbruster 2013-03-23 14:44 ` Blue Swirl 2013-04-11 16:12 ` Markus Armbruster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).