[Qemu-devel] [PATCH 0/4] Fix JSON string formatter

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/4] Fix JSON string formatter
@ 2013-03-14 17:49 Markus Armbruster
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Markus Armbruster @ 2013-03-14 17:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, aliguori

This should unbreak "make check" on machines where char is unsigned.
Blue, please give it a whirl.

The JSON parser is still as broken as ever.  Left for another day.

Markus Armbruster (4):
  unicode: New mod_utf8_codepoint()
  check-qjson: Fix up a few bogus comments
  check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings
  qjson: to_json() case QTYPE_QSTRING is buggy, rewrite

 include/qemu-common.h |   3 +
 qobject/qjson.c       | 102 ++++++++----------
 tests/check-qjson.c   | 280 +++++++++++++++++++++++++++++---------------------
 util/Makefile.objs    |   1 +
 util/unicode.c        |  96 +++++++++++++++++
 5 files changed, 306 insertions(+), 176 deletions(-)
 create mode 100644 util/unicode.c

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint()
  2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
@ 2013-03-14 17:49 ` Markus Armbruster
  2013-03-21 19:37   ` Laszlo Ersek
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments Markus Armbruster
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Markus Armbruster @ 2013-03-14 17:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, aliguori


Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 include/qemu-common.h |  3 ++
 util/Makefile.objs    |  1 +
 util/unicode.c        | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 100 insertions(+)
 create mode 100644 util/unicode.c

diff --git a/include/qemu-common.h b/include/qemu-common.h
index 5e13708..28c10a0 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -442,4 +442,7 @@ int64_t pow2floor(int64_t value);
 int uleb128_encode_small(uint8_t *out, uint32_t n);
 int uleb128_decode_small(const uint8_t *in, uint32_t *n);
 
+/* unicode.c */
+int mod_utf8_codepoint(const char *s, size_t n, char **end);
+
 #endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index cad5ce8..65c8c12 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -9,3 +9,4 @@ util-obj-y += error.o qemu-error.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-y += iov.o aes.o qemu-config.o qemu-sockets.o uri.o notify.o
 util-obj-y += qemu-option.o qemu-progress.o
+util-obj-y += unicode.o
diff --git a/util/unicode.c b/util/unicode.c
new file mode 100644
index 0000000..954bcf7
--- /dev/null
+++ b/util/unicode.c
@@ -0,0 +1,96 @@
+/*
+ * Dealing with Unicode
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ *
+ * Authors:
+ *  Markus Armbruster <armbru@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+
+/**
+ * mod_utf8_codepoint:
+ * @s: string encoded in modified UTF-8
+ * @n: maximum number of bytes to read from @s, if less than 6
+ * @end: set to end of sequence on return
+ *
+ * Convert the modified UTF-8 sequence at the start of @s.  Modified
+ * UTF-8 is exactly like UTF-8, except U+0000 is encoded as
+ * "\xC0\x80".
+ *
+ * If @s points to an impossible byte (0xFE or 0xFF) or a continuation
+ * byte, the sequence is invalid, and @end is set to @s + 1
+ *
+ * Else, the first byte determines how many continuation bytes are
+ * expected.  If there are fewer, the sequence is invalid, and @end is
+ * set to @s + 1 + actual number of continuation bytes.  Else, the
+ * sequence is well-formed, and @end is set to @s + 1 + expected
+ * number of continuation bytes.
+ *
+ * A well-formed sequence is valid unless it encodes a code point
+ * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66
+ * noncharacters, a surrogate code point, or is overlong.  Except the
+ * overlong sequence "\xC0\x80" is valid.
+ *
+ * Conversion succeeds if and only if the sequence is valid.
+ *
+ * Returns: the Unicode code point on success, -1 on failure.
+ */
+int mod_utf8_codepoint(const char *s, size_t n, char **end)
+{
+    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
+    const unsigned char *p;
+    unsigned byte, mask, len, i;
+    int cp;
+
+    if (n == 0) {
+        *end = (char *)s;
+        return 0;
+    }
+
+    p = (const unsigned char *)s;
+    byte = *p++;
+    if (byte < 0x80) {
+        cp = byte;              /* one byte sequence */
+    } else if (byte >= 0xFE) {
+        cp = -1;                /* impossible bytes 0xFE, 0xFF */
+    } else if ((byte & 0x40) == 0) {
+        cp = -1;                /* unexpected continuation byte */
+    } else {
+        /* multi-byte sequence */
+        len = 0;
+        for (mask = 0x80; byte & mask; mask >>= 1) {
+            len++;
+        }
+        assert(len > 1 && len < 7);
+        cp = byte & (mask - 1);
+        for (i = 1; i < len; i++) {
+            byte = i < n ? *p : 0;
+            if ((byte & 0xC0) != 0x80) {
+                cp = -1;        /* continuation byte missing */
+                goto out;
+            }
+            p++;
+            cp <<= 6;
+            cp |= byte & 0x3F;
+        }
+        if (cp > 0x10FFFF) {
+            cp = -1;            /* beyond Unicode range */
+        } else if ((cp >= 0xFDD0 && cp <= 0xFDEF)
+                   || (cp & 0xFFFE) == 0xFFFE) {
+            cp = -1;            /* noncharacter */
+        } else if (cp >= 0xD800 && cp <= 0xDFFF) {
+            cp = -1;            /* surrogate code point */
+        } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) {
+            cp = -1;            /* overlong, not \xC0\x80 */
+        }
+    }
+
+out:
+    *end = (char *)p;
+    return cp;
+}
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments
  2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster
@ 2013-03-14 17:49 ` Markus Armbruster
  2013-03-21 20:06   ` Laszlo Ersek
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Markus Armbruster @ 2013-03-14 17:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, aliguori


Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 tests/check-qjson.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tests/check-qjson.c b/tests/check-qjson.c
index ec85a0c..852124a 100644
--- a/tests/check-qjson.c
+++ b/tests/check-qjson.c
@@ -4,7 +4,7 @@
  *
  * Authors:
  *  Anthony Liguori   <aliguori@us.ibm.com>
- *  Markus Armbruster <armbru@redhat.com>,
+ *  Markus Armbruster <armbru@redhat.com>
  *
  * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
  * See the COPYING.LIB file in the top-level directory.
@@ -462,8 +462,7 @@ static void utf8_string(void)
         },
         /* 3.3.4  5-byte sequence with last byte missing (U+0000) */
         {
-            /* invalid */
-            "\"\xF8\x80\x80\x80\"", /* bug: not corrected */
+            "\"\xF8\x80\x80\x80\"",
             NULL,                   /* bug: rejected */
             "\"\\u8000\\uFFFF\"",   /* bug: want "\"\\uFFFF\"" */
             "\xF8\x80\x80\x80",
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings
  2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments Markus Armbruster
@ 2013-03-14 17:49 ` Markus Armbruster
  2013-03-21 20:22   ` Laszlo Ersek
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Markus Armbruster @ 2013-03-14 17:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, aliguori

These are all broken, too.

A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
noncharacters some more instead of what they're supposed to test.  Use
U+FFFD and U+10FFFD instead.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 tests/check-qjson.c | 85 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 13 deletions(-)

diff --git a/tests/check-qjson.c b/tests/check-qjson.c
index 852124a..efec1b2 100644
--- a/tests/check-qjson.c
+++ b/tests/check-qjson.c
@@ -158,7 +158,7 @@ static void utf8_string(void)
      * consider using overlong encoding \xC0\x80 for U+0000 ("modified
      * UTF-8").
      *
-     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
+     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
      * capability and stress test at
      * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
      */
@@ -256,11 +256,11 @@ static void utf8_string(void)
             "\xDF\xBF",
             "\"\\u07FF\"",
         },
-        /* 2.2.3  3 bytes U+FFFF */
+        /* 2.2.3  3 bytes U+FFFD */
         {
-            "\"\xEF\xBF\xBF\"",
-            "\xEF\xBF\xBF",
-            "\"\\uFFFF\"",
+            "\"\xEF\xBF\xBD\"",
+            "\xEF\xBF\xBD",
+            "\"\\uFFFD\"",
         },
         /* 2.2.4  4 bytes U+1FFFFF */
         {
@@ -303,10 +303,10 @@ static void utf8_string(void)
             "\"\\uFFFD\"",
         },
         {
-            /* U+10FFFF */
-            "\"\xF4\x8F\xBF\xBF\"",
-            "\xF4\x8F\xBF\xBF",
-            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */
+            /* U+10FFFD */
+            "\"\xF4\x8F\xBF\xBD\"",
+            "\xF4\x8F\xBF\xBD",
+            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */
         },
         {
             /* U+110000 */
@@ -584,9 +584,9 @@ static void utf8_string(void)
             "\"\\u07FF\"",
         },
         {
-            /* \U+FFFF */
-            "\"\xF0\x8F\xBF\xBF\"",
-            "\xF0\x8F\xBF\xBF",   /* bug: not corrected */
+            /* \U+FFFD */
+            "\"\xF0\x8F\xBF\xBD\"",
+            "\xF0\x8F\xBF\xBD",   /* bug: not corrected */
             "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
         },
         {
@@ -731,6 +731,7 @@ static void utf8_string(void)
             "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
         },
         /* 5.3  Other illegal code positions */
+        /* BMP noncharacters */
         {
             /* \U+FFFE */
             "\"\xEF\xBF\xBE\"",
@@ -741,7 +742,65 @@ static void utf8_string(void)
             /* \U+FFFF */
             "\"\xEF\xBF\xBF\"",
             "\xEF\xBF\xBF",     /* bug: not corrected */
-            "\"\\uFFFF\"",      /* bug: not corrected */
+            "\"\\uFFFF\"",
+        },
+        {
+            /* U+FDD0 */
+            "\"\xEF\xB7\x90\"",
+            "\xEF\xB7\x90",     /* bug: not corrected */
+            "\"\\uFDD0\"",      /* bug: not corrected */
+        },
+        {
+            /* U+FDEF */
+            "\"\xEF\xB7\xAF\"",
+            "\xEF\xB7\xAF",     /* bug: not corrected */
+            "\"\\uFDEF\"",      /* bug: not corrected */
+        },
+        /* Plane 1 .. 16 noncharacters */
+        {
+            /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */
+            "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
+            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
+            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
+            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
+            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
+            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
+            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
+            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
+            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
+            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
+            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
+            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
+            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
+            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
+            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
+            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"",
+            /* bug: not corrected */
+            "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
+            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
+            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
+            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
+            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
+            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
+            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
+            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
+            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
+            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
+            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
+            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
+            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
+            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
+            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
+            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF",
+            /* bug: not corrected */
+            "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF"
+            "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF"
+            "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF"
+            "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF"
+            "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF"
+            "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF"
+            "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF"
+            "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"",
         },
         {}
     };
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite
  2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
                   ` (2 preceding siblings ...)
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster
@ 2013-03-14 17:49 ` Markus Armbruster
  2013-03-21 20:44   ` Laszlo Ersek
  2013-03-22 13:15   ` Laszlo Ersek
  2013-03-17 19:55 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Blue Swirl
  2013-03-23 14:44 ` Blue Swirl
  5 siblings, 2 replies; 20+ messages in thread
From: Markus Armbruster @ 2013-03-14 17:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, aliguori

Known bugs in to_json():

* A start byte for a three-byte sequence followed by less than two
  continuation bytes is split into one-byte sequences.

* Start bytes for sequences longer than three bytes get misinterpreted
  as start bytes for three-byte sequences.  Continuation bytes beyond
  byte three become one-byte sequences.

  This means all characters outside the BMP are decoded incorrectly.

* One-byte sequences with the MSB are put into the JSON string
  verbatim when char is unsigned, producing invalid UTF-8.  When char
  is signed, they're replaced by "\\uFFFF" instead.

  This includes \xFE, \xFF, and stray continuation bytes.

* Overlong sequences are happily accepted, unless screwed up by the
  bugs above.

* Likewise, sequences encoding surrogate code points or noncharacters.

* Unlike other control characters, ASCII DEL is not escaped.  Except
  in overlong encodings.

My rewrite fixes them as follows:

* Malformed UTF-8 sequences are replaced.

  Except the overlong encoding \xC0\x80 of U+0000 is still accepted.
  Permits embedding NUL characters in C strings.  This trick is known
  as "Modified UTF-8".

* Sequences encoding code points beyond Unicode range are replaced.

* Sequences encoding code points beyond the BMP produce a surrogate
  pair.

* Sequences encoding surrogate code points are replaced.

* Sequences encoding noncharacters are replaced.

* ASCII DEL is now always escaped.

The replacement character is U+FFFD.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 qobject/qjson.c     | 102 +++++++++++--------------
 tests/check-qjson.c | 216 ++++++++++++++++++++++++----------------------------
 2 files changed, 145 insertions(+), 173 deletions(-)

diff --git a/qobject/qjson.c b/qobject/qjson.c
index 83a6b4f..19085a1 100644
--- a/qobject/qjson.c
+++ b/qobject/qjson.c
@@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent)
     case QTYPE_QSTRING: {
         QString *val = qobject_to_qstring(obj);
         const char *ptr;
+        int cp;
+        char buf[16];
+        char *end;
 
         ptr = qstring_get_str(val);
         qstring_append(str, "\"");
-        while (*ptr) {
-            if ((ptr[0] & 0xE0) == 0xE0 &&
-                (ptr[1] & 0x80) && (ptr[2] & 0x80)) {
-                uint16_t wchar;
-                char escape[7];
-
-                wchar  = (ptr[0] & 0x0F) << 12;
-                wchar |= (ptr[1] & 0x3F) << 6;
-                wchar |= (ptr[2] & 0x3F);
-                ptr += 2;
-
-                snprintf(escape, sizeof(escape), "\\u%04X", wchar);
-                qstring_append(str, escape);
-            } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) {
-                uint16_t wchar;
-                char escape[7];
-
-                wchar  = (ptr[0] & 0x1F) << 6;
-                wchar |= (ptr[1] & 0x3F);
-                ptr++;
-
-                snprintf(escape, sizeof(escape), "\\u%04X", wchar);
-                qstring_append(str, escape);
-            } else switch (ptr[0]) {
-                case '\"':
-                    qstring_append(str, "\\\"");
-                    break;
-                case '\\':
-                    qstring_append(str, "\\\\");
-                    break;
-                case '\b':
-                    qstring_append(str, "\\b");
-                    break;
-                case '\f':
-                    qstring_append(str, "\\f");
-                    break;
-                case '\n':
-                    qstring_append(str, "\\n");
-                    break;
-                case '\r':
-                    qstring_append(str, "\\r");
-                    break;
-                case '\t':
-                    qstring_append(str, "\\t");
-                    break;
-                default: {
-                    if (ptr[0] <= 0x1F) {
-                        char escape[7];
-                        snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]);
-                        qstring_append(str, escape);
-                    } else {
-                        char buf[2] = { ptr[0], 0 };
-                        qstring_append(str, buf);
-                    }
-                    break;
+
+        for (; *ptr; ptr = end) {
+            cp = mod_utf8_codepoint(ptr, 6, &end);
+            switch (cp) {
+            case '\"':
+                qstring_append(str, "\\\"");
+                break;
+            case '\\':
+                qstring_append(str, "\\\\");
+                break;
+            case '\b':
+                qstring_append(str, "\\b");
+                break;
+            case '\f':
+                qstring_append(str, "\\f");
+                break;
+            case '\n':
+                qstring_append(str, "\\n");
+                break;
+            case '\r':
+                qstring_append(str, "\\r");
+                break;
+            case '\t':
+                qstring_append(str, "\\t");
+                break;
+            default:
+                if (cp < 0) {
+                    cp = 0xFFFD; /* replacement character */
                 }
+                if (cp > 0xFFFF) {
+                    /* beyond BMP; need a surrogate pair */
+                    snprintf(buf, sizeof(buf), "\\u%04X\\u%04X",
+                             0xD800 + ((cp - 0x10000) >> 10),
+                             0xDC00 + ((cp - 0x10000) & 0x3FF));
+                } else if (cp < 0x20 || cp >= 0x7F) {
+                    snprintf(buf, sizeof(buf), "\\u%04X", cp);
+                } else {
+                    buf[0] = cp;
+                    buf[1] = 0;
                 }
-            ptr++;
-        }
+                qstring_append(str, buf);
+            }
+        };
+
         qstring_append(str, "\"");
         break;
     }
diff --git a/tests/check-qjson.c b/tests/check-qjson.c
index efec1b2..595ddc0 100644
--- a/tests/check-qjson.c
+++ b/tests/check-qjson.c
@@ -144,13 +144,10 @@ static void utf8_string(void)
      * The JSON parser rejects some invalid sequences, but accepts
      * others without correcting the problem.
      *
-     * The JSON formatter replaces some invalid sequences by U+FFFF (a
-     * noncharacter), and goes wonky for others.
-     *
-     * For both directions, we should either reject all invalid
-     * sequences, or minimize overlong sequences and replace all other
-     * invalid sequences by a suitable replacement character.  A
-     * common choice for replacement is U+FFFD.
+     * We should either reject all invalid sequences, or minimize
+     * overlong sequences and replace all other invalid sequences by a
+     * suitable replacement character.  A common choice for
+     * replacement is U+FFFD.
      *
      * Problem: we can't easily deal with embedded U+0000.  Parsing
      * the JSON string "this \\u0000" is fun" yields "this \0 is fun",
@@ -175,16 +172,10 @@ static void utf8_string(void)
          * - bug: rejected
          *   JSON parser rejects invalid sequence(s)
          *   We may choose to define this as feature
-         * - bug: want "\"...\""
-         *   JSON formatter produces incorrect result, this is the
-         *   correct one, assuming replacement character U+FFFF
-         * - bug: want "..." (no \")
+         * - bug: want "..."
          *   JSON parser produces incorrect result, this is the
          *   correct one, assuming replacement character U+FFFF
          *   We may choose to reject instead of replace
-         * Not marked explicitly, but trivial to find:
-         * - JSON formatter replacing invalid sequence by \\uFFFF is a
-         *   bug if we want it to fail for invalid sequences.
          */
 
         /* 1  Some correct UTF-8 text */
@@ -209,7 +200,8 @@ static void utf8_string(void)
         {
             "\"\\u0000\"",
             "",                 /* bug: want overlong "\xC0\x80" */
-            "\"\"",             /* bug: want "\"\\u0000\"" */
+            "\"\\u0000\"",
+            "\xC0\x80",
         },
         /* 2.1.2  2 bytes U+0080 */
         {
@@ -227,20 +219,20 @@ static void utf8_string(void)
         {
             "\"\xF0\x90\x80\x80\"",
             "\xF0\x90\x80\x80",
-            "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */
+            "\"\\uD800\\uDC00\"",
         },
         /* 2.1.5  5 bytes U+200000 */
         {
             "\"\xF8\x88\x80\x80\x80\"",
-            NULL,                        /* bug: rejected */
-            "\"\\u8200\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            NULL,               /* bug: rejected */
+            "\"\\uFFFD\"",
             "\xF8\x88\x80\x80\x80",
         },
         /* 2.1.6  6 bytes U+4000000 */
         {
             "\"\xFC\x84\x80\x80\x80\x80\"",
-            NULL,                               /* bug: rejected */
-            "\"\\uC100\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            NULL,               /* bug: rejected */
+            "\"\\uFFFD\"",
             "\xFC\x84\x80\x80\x80\x80",
         },
         /* 2.2  Last possible sequence of a certain length */
@@ -248,7 +240,7 @@ static void utf8_string(void)
         {
             "\"\x7F\"",
             "\x7F",
-            "\"\177\"",
+            "\"\\u007F\"",
         },
         /* 2.2.2  2 bytes U+07FF */
         {
@@ -265,22 +257,22 @@ static void utf8_string(void)
         /* 2.2.4  4 bytes U+1FFFFF */
         {
             "\"\xF7\xBF\xBF\xBF\"",
-            NULL,                 /* bug: rejected */
-            "\"\\u7FFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            NULL,               /* bug: rejected */
+            "\"\\uFFFD\"",
             "\xF7\xBF\xBF\xBF",
         },
         /* 2.2.5  5 bytes U+3FFFFFF */
         {
             "\"\xFB\xBF\xBF\xBF\xBF\"",
-            NULL,                        /* bug: rejected */
-            "\"\\uBFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            NULL,               /* bug: rejected */
+            "\"\\uFFFD\"",
             "\xFB\xBF\xBF\xBF\xBF",
         },
         /* 2.2.6  6 bytes U+7FFFFFFF */
         {
             "\"\xFD\xBF\xBF\xBF\xBF\xBF\"",
-            NULL,                               /* bug: rejected */
-            "\"\\uDFFF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            NULL,               /* bug: rejected */
+            "\"\\uFFFD\"",
             "\xFD\xBF\xBF\xBF\xBF\xBF",
         },
         /* 2.3  Other boundary conditions */
@@ -306,13 +298,13 @@ static void utf8_string(void)
             /* U+10FFFD */
             "\"\xF4\x8F\xBF\xBD\"",
             "\xF4\x8F\xBF\xBD",
-            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */
+            "\"\\uDBFF\\uDFFD\""
         },
         {
             /* U+110000 */
             "\"\xF4\x90\x80\x80\"",
             "\xF4\x90\x80\x80",
-            "\"\\u4400\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         /* 3  Malformed sequences */
         /* 3.1  Unexpected continuation bytes */
@@ -320,49 +312,49 @@ static void utf8_string(void)
         {
             "\"\x80\"",
             "\x80",             /* bug: not corrected */
-            "\"\\uFFFF\"",
+            "\"\\uFFFD\"",
         },
         /* 3.1.2  Last continuation byte */
         {
             "\"\xBF\"",
             "\xBF",             /* bug: not corrected */
-            "\"\\uFFFF\"",
+            "\"\\uFFFD\"",
         },
         /* 3.1.3  2 continuation bytes */
         {
             "\"\x80\xBF\"",
             "\x80\xBF",         /* bug: not corrected */
-            "\"\\uFFFF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\"",
         },
         /* 3.1.4  3 continuation bytes */
         {
             "\"\x80\xBF\x80\"",
             "\x80\xBF\x80",     /* bug: not corrected */
-            "\"\\uFFFF\\uFFFF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\\uFFFD\"",
         },
         /* 3.1.5  4 continuation bytes */
         {
             "\"\x80\xBF\x80\xBF\"",
             "\x80\xBF\x80\xBF", /* bug: not corrected */
-            "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"",
         },
         /* 3.1.6  5 continuation bytes */
         {
             "\"\x80\xBF\x80\xBF\x80\"",
             "\x80\xBF\x80\xBF\x80", /* bug: not corrected */
-            "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"",
         },
         /* 3.1.7  6 continuation bytes */
         {
             "\"\x80\xBF\x80\xBF\x80\xBF\"",
             "\x80\xBF\x80\xBF\x80\xBF", /* bug: not corrected */
-            "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"",
         },
         /* 3.1.8  7 continuation bytes */
         {
             "\"\x80\xBF\x80\xBF\x80\xBF\x80\"",
             "\x80\xBF\x80\xBF\x80\xBF\x80", /* bug: not corrected */
-            "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"",
         },
         /* 3.1.9  Sequence of all 64 possible continuation bytes */
         {
@@ -383,14 +375,14 @@ static void utf8_string(void)
             "\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF"
             "\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7"
             "\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF",
-            "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF"
-            "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF"
-            "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF"
-            "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF"
-            "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF"
-            "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF"
-            "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF"
-            "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\""
+            "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\""
         },
         /* 3.2  Lonely start characters */
         /* 3.2.1  All 32 first bytes of 2-byte sequences, followed by space */
@@ -400,10 +392,10 @@ static void utf8_string(void)
             "\xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD7 "
             "\xD8 \xD9 \xDA \xDB \xDC \xDD \xDE \xDF \"",
             NULL,               /* bug: rejected */
-            "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF "
-            "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF "
-            "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF "
-            "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"",
+            "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD "
+            "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD "
+            "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD "
+            "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"",
             "\xC0 \xC1 \xC2 \xC3 \xC4 \xC5 \xC6 \xC7 "
             "\xC8 \xC9 \xCA \xCB \xCC \xCD \xCE \xCF "
             "\xD0 \xD1 \xD2 \xD3 \xD4 \xD5 \xD6 \xD7 "
@@ -416,28 +408,28 @@ static void utf8_string(void)
             /* bug: not corrected */
             "\xE0 \xE1 \xE2 \xE3 \xE4 \xE5 \xE6 \xE7 "
             "\xE8 \xE9 \xEA \xEB \xEC \xED \xEE \xEF ",
-            "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF "
-            "\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"",
+            "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD "
+            "\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"",
         },
         /* 3.2.3  All 8 first bytes of 4-byte sequences, followed by space */
         {
             "\"\xF0 \xF1 \xF2 \xF3 \xF4 \xF5 \xF6 \xF7 \"",
             NULL,               /* bug: rejected */
-            "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \\uFFFF \"",
+            "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \\uFFFD \"",
             "\xF0 \xF1 \xF2 \xF3 \xF4 \xF5 \xF6 \xF7 ",
         },
         /* 3.2.4  All 4 first bytes of 5-byte sequences, followed by space */
         {
             "\"\xF8 \xF9 \xFA \xFB \"",
             NULL,               /* bug: rejected */
-            "\"\\uFFFF \\uFFFF \\uFFFF \\uFFFF \"",
+            "\"\\uFFFD \\uFFFD \\uFFFD \\uFFFD \"",
             "\xF8 \xF9 \xFA \xFB ",
         },
         /* 3.2.5  All 2 first bytes of 6-byte sequences, followed by space */
         {
             "\"\xFC \xFD \"",
             NULL,               /* bug: rejected */
-            "\"\\uFFFF \\uFFFF \"",
+            "\"\\uFFFD \\uFFFD \"",
             "\xFC \xFD ",
         },
         /* 3.3  Sequences with last continuation byte missing */
@@ -445,66 +437,66 @@ static void utf8_string(void)
         {
             "\"\xC0\"",
             NULL,               /* bug: rejected */
-            "\"\\uFFFF\"",
+            "\"\\uFFFD\"",
             "\xC0",
         },
         /* 3.3.2  3-byte sequence with last byte missing (U+0000) */
         {
             "\"\xE0\x80\"",
             "\xE0\x80",           /* bug: not corrected */
-            "\"\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         /* 3.3.3  4-byte sequence with last byte missing (U+0000) */
         {
             "\"\xF0\x80\x80\"",
             "\xF0\x80\x80",     /* bug: not corrected */
-            "\"\\u0000\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         /* 3.3.4  5-byte sequence with last byte missing (U+0000) */
         {
             "\"\xF8\x80\x80\x80\"",
             NULL,                   /* bug: rejected */
-            "\"\\u8000\\uFFFF\"",   /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
             "\xF8\x80\x80\x80",
         },
         /* 3.3.5  6-byte sequence with last byte missing (U+0000) */
         {
             "\"\xFC\x80\x80\x80\x80\"",
             NULL,                        /* bug: rejected */
-            "\"\\uC000\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
             "\xFC\x80\x80\x80\x80",
         },
         /* 3.3.6  2-byte sequence with last byte missing (U+07FF) */
         {
             "\"\xDF\"",
             "\xDF",             /* bug: not corrected */
-            "\"\\uFFFF\"",
+            "\"\\uFFFD\"",
         },
         /* 3.3.7  3-byte sequence with last byte missing (U+FFFF) */
         {
             "\"\xEF\xBF\"",
             "\xEF\xBF",           /* bug: not corrected */
-            "\"\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         /* 3.3.8  4-byte sequence with last byte missing (U+1FFFFF) */
         {
             "\"\xF7\xBF\xBF\"",
             NULL,               /* bug: rejected */
-            "\"\\u7FFF\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
             "\xF7\xBF\xBF",
         },
         /* 3.3.9  5-byte sequence with last byte missing (U+3FFFFFF) */
         {
             "\"\xFB\xBF\xBF\xBF\"",
             NULL,                 /* bug: rejected */
-            "\"\\uBFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
             "\xFB\xBF\xBF\xBF",
         },
         /* 3.3.10  6-byte sequence with last byte missing (U+7FFFFFFF) */
         {
             "\"\xFD\xBF\xBF\xBF\xBF\"",
             NULL,                        /* bug: rejected */
-            "\"\\uDFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"", */
+            "\"\\uFFFD\"",
             "\xFD\xBF\xBF\xBF\xBF",
         },
         /* 3.4  Concatenation of incomplete sequences */
@@ -512,10 +504,8 @@ static void utf8_string(void)
             "\"\xC0\xE0\x80\xF0\x80\x80\xF8\x80\x80\x80\xFC\x80\x80\x80\x80"
             "\xDF\xEF\xBF\xF7\xBF\xBF\xFB\xBF\xBF\xBF\xFD\xBF\xBF\xBF\xBF\"",
             NULL,               /* bug: rejected */
-            /* bug: want "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF"
-               "\\uFFFF\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" */
-            "\"\\u0020\\uFFFF\\u0000\\u8000\\uFFFF\\uC000\\uFFFF\\uFFFF"
-            "\\u07EF\\uFFFF\\u7FFF\\uBFFF\\uFFFF\\uDFFF\\uFFFF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"",
             "\xC0\xE0\x80\xF0\x80\x80\xF8\x80\x80\x80\xFC\x80\x80\x80\x80"
             "\xDF\xEF\xBF\xF7\xBF\xBF\xFB\xBF\xBF\xBF\xFD\xBF\xBF\xBF\xBF",
         },
@@ -523,20 +513,19 @@ static void utf8_string(void)
         {
             "\"\xFE\"",
             NULL,               /* bug: rejected */
-            "\"\\uFFFF\"",
+            "\"\\uFFFD\"",
             "\xFE",
         },
         {
             "\"\xFF\"",
             NULL,               /* bug: rejected */
-            "\"\\uFFFF\"",
+            "\"\\uFFFD\"",
             "\xFF",
         },
         {
             "\"\xFE\xFE\xFF\xFF\"",
             NULL,                 /* bug: rejected */
-            /* bug: want "\"\\uFFFF\\uFFFF\\uFFFF\\uFFFF\"" */
-            "\"\\uEFBF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"",
             "\xFE\xFE\xFF\xFF",
         },
         /* 4  Overlong sequences */
@@ -544,29 +533,29 @@ static void utf8_string(void)
         {
             "\"\xC0\xAF\"",
             NULL,               /* bug: rejected */
-            "\"\\u002F\"",      /* bug: want "\"/\"" */
+            "\"\\uFFFD\"",
             "\xC0\xAF",
         },
         {
             "\"\xE0\x80\xAF\"",
             "\xE0\x80\xAF",     /* bug: not corrected */
-            "\"\\u002F\"",      /* bug: want "\"/\"" */
+            "\"\\uFFFD\"",
         },
         {
             "\"\xF0\x80\x80\xAF\"",
             "\xF0\x80\x80\xAF",  /* bug: not corrected */
-            "\"\\u0000\\uFFFF\"" /* bug: want "\"/\"" */
+            "\"\\uFFFD\"",
         },
         {
             "\"\xF8\x80\x80\x80\xAF\"",
             NULL,                        /* bug: rejected */
-            "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */
+            "\"\\uFFFD\"",
             "\xF8\x80\x80\x80\xAF",
         },
         {
             "\"\xFC\x80\x80\x80\x80\xAF\"",
             NULL,                               /* bug: rejected */
-            "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"/\"" */
+            "\"\\uFFFD\"",
             "\xFC\x80\x80\x80\x80\xAF",
         },
         /* 4.2  Maximum overlong sequences */
@@ -574,33 +563,33 @@ static void utf8_string(void)
             /* \U+007F */
             "\"\xC1\xBF\"",
             NULL,               /* bug: rejected */
-            "\"\\u007F\"",      /* bug: want "\"\177\"" */
+            "\"\\uFFFD\"",
             "\xC1\xBF",
         },
         {
             /* \U+07FF */
             "\"\xE0\x9F\xBF\"",
             "\xE0\x9F\xBF",     /* bug: not corrected */
-            "\"\\u07FF\"",
+            "\"\\uFFFD\"",
         },
         {
             /* \U+FFFD */
             "\"\xF0\x8F\xBF\xBD\"",
             "\xF0\x8F\xBF\xBD",   /* bug: not corrected */
-            "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+1FFFFF */
             "\"\xF8\x87\xBF\xBF\xBF\"",
             NULL,                        /* bug: rejected */
-            "\"\\u81FF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
             "\xF8\x87\xBF\xBF\xBF",
         },
         {
             /* \U+3FFFFFF */
             "\"\xFC\x83\xBF\xBF\xBF\xBF\"",
             NULL,                               /* bug: rejected */
-            "\"\\uC0FF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
             "\xFC\x83\xBF\xBF\xBF\xBF",
         },
         /* 4.3  Overlong representation of the NUL character */
@@ -615,26 +604,26 @@ static void utf8_string(void)
             /* \U+0000 */
             "\"\xE0\x80\x80\"",
             "\xE0\x80\x80",     /* bug: not corrected */
-            "\"\\u0000\"",
+            "\"\\uFFFD\"",
         },
         {
             /* \U+0000 */
             "\"\xF0\x80\x80\x80\"",
             "\xF0\x80\x80\x80",   /* bug: not corrected */
-            "\"\\u0000\\uFFFF\"", /* bug: want "\"\\u0000\"" */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+0000 */
             "\"\xF8\x80\x80\x80\x80\"",
             NULL,                        /* bug: rejected */
-            "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */
+            "\"\\uFFFD\"",
             "\xF8\x80\x80\x80\x80",
         },
         {
             /* \U+0000 */
             "\"\xFC\x80\x80\x80\x80\x80\"",
             NULL,                               /* bug: rejected */
-            "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */
+            "\"\\uFFFD\"",
             "\xFC\x80\x80\x80\x80\x80",
         },
         /* 5  Illegal code positions */
@@ -643,92 +632,92 @@ static void utf8_string(void)
             /* \U+D800 */
             "\"\xED\xA0\x80\"",
             "\xED\xA0\x80",     /* bug: not corrected */
-            "\"\\uD800\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+DB7F */
             "\"\xED\xAD\xBF\"",
             "\xED\xAD\xBF",     /* bug: not corrected */
-            "\"\\uDB7F\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+DB80 */
             "\"\xED\xAE\x80\"",
             "\xED\xAE\x80",     /* bug: not corrected */
-            "\"\\uDB80\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+DBFF */
             "\"\xED\xAF\xBF\"",
             "\xED\xAF\xBF",     /* bug: not corrected */
-            "\"\\uDBFF\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+DC00 */
             "\"\xED\xB0\x80\"",
             "\xED\xB0\x80",     /* bug: not corrected */
-            "\"\\uDC00\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+DF80 */
             "\"\xED\xBE\x80\"",
             "\xED\xBE\x80",     /* bug: not corrected */
-            "\"\\uDF80\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+DFFF */
             "\"\xED\xBF\xBF\"",
             "\xED\xBF\xBF",     /* bug: not corrected */
-            "\"\\uDFFF\"",      /* bug: want "\"\\uFFFF\"" */
+            "\"\\uFFFD\"",
         },
         /* 5.2  Paired UTF-16 surrogates */
         {
             /* \U+D800\U+DC00 */
             "\"\xED\xA0\x80\xED\xB0\x80\"",
             "\xED\xA0\x80\xED\xB0\x80", /* bug: not corrected */
-            "\"\\uD800\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
+            "\"\\uFFFD\\uFFFD\"",
         },
         {
             /* \U+D800\U+DFFF */
             "\"\xED\xA0\x80\xED\xBF\xBF\"",
             "\xED\xA0\x80\xED\xBF\xBF", /* bug: not corrected */
-            "\"\\uD800\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
+            "\"\\uFFFD\\uFFFD\"",
         },
         {
             /* \U+DB7F\U+DC00 */
             "\"\xED\xAD\xBF\xED\xB0\x80\"",
             "\xED\xAD\xBF\xED\xB0\x80", /* bug: not corrected */
-            "\"\\uDB7F\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
+            "\"\\uFFFD\\uFFFD\"",
         },
         {
             /* \U+DB7F\U+DFFF */
             "\"\xED\xAD\xBF\xED\xBF\xBF\"",
             "\xED\xAD\xBF\xED\xBF\xBF", /* bug: not corrected */
-            "\"\\uDB7F\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
+            "\"\\uFFFD\\uFFFD\"",
         },
         {
             /* \U+DB80\U+DC00 */
             "\"\xED\xAE\x80\xED\xB0\x80\"",
             "\xED\xAE\x80\xED\xB0\x80", /* bug: not corrected */
-            "\"\\uDB80\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
+            "\"\\uFFFD\\uFFFD\"",
         },
         {
             /* \U+DB80\U+DFFF */
             "\"\xED\xAE\x80\xED\xBF\xBF\"",
             "\xED\xAE\x80\xED\xBF\xBF", /* bug: not corrected */
-            "\"\\uDB80\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
+            "\"\\uFFFD\\uFFFD\"",
         },
         {
             /* \U+DBFF\U+DC00 */
             "\"\xED\xAF\xBF\xED\xB0\x80\"",
             "\xED\xAF\xBF\xED\xB0\x80", /* bug: not corrected */
-            "\"\\uDBFF\\uDC00\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
+            "\"\\uFFFD\\uFFFD\"",
         },
         {
             /* \U+DBFF\U+DFFF */
             "\"\xED\xAF\xBF\xED\xBF\xBF\"",
             "\xED\xAF\xBF\xED\xBF\xBF", /* bug: not corrected */
-            "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
+            "\"\\uFFFD\\uFFFD\"",
         },
         /* 5.3  Other illegal code positions */
         /* BMP noncharacters */
@@ -736,25 +725,25 @@ static void utf8_string(void)
             /* \U+FFFE */
             "\"\xEF\xBF\xBE\"",
             "\xEF\xBF\xBE",     /* bug: not corrected */
-            "\"\\uFFFE\"",      /* bug: not corrected */
+            "\"\\uFFFD\"",
         },
         {
             /* \U+FFFF */
             "\"\xEF\xBF\xBF\"",
             "\xEF\xBF\xBF",     /* bug: not corrected */
-            "\"\\uFFFF\"",
+            "\"\\uFFFD\"",
         },
         {
             /* U+FDD0 */
             "\"\xEF\xB7\x90\"",
             "\xEF\xB7\x90",     /* bug: not corrected */
-            "\"\\uFDD0\"",      /* bug: not corrected */
+            "\"\\uFFFD\"",
         },
         {
             /* U+FDEF */
             "\"\xEF\xB7\xAF\"",
             "\xEF\xB7\xAF",     /* bug: not corrected */
-            "\"\\uFDEF\"",      /* bug: not corrected */
+            "\"\\uFFFD\"",
         },
         /* Plane 1 .. 16 noncharacters */
         {
@@ -792,15 +781,10 @@ static void utf8_string(void)
             "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
             "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
             "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF",
-            /* bug: not corrected */
-            "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF"
-            "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF"
-            "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF"
-            "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF"
-            "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF"
-            "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF"
-            "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF"
-            "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"",
+            "\"\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD"
+            "\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\\uFFFD\"",
         },
         {}
     };
@@ -838,8 +822,8 @@ static void utf8_string(void)
         qobject_decref(obj);
 
         /*
-         * Disabled, because json_out currently contains the crap
-         * qobject_to_json() produces.
+         * Disabled, because qobject_from_json() is buggy, and I can't
+         * be bothered to add the expected incorrect results.
          * FIXME Enable once these bugs have been fixed.
          */
         if (0 && json_out != json_in) {
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter
  2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
                   ` (3 preceding siblings ...)
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster
@ 2013-03-17 19:55 ` Blue Swirl
  2013-03-18  9:58   ` Markus Armbruster
  2013-03-23 14:44 ` Blue Swirl
  5 siblings, 1 reply; 20+ messages in thread
From: Blue Swirl @ 2013-03-17 19:55 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: aliguori, qemu-devel

On Thu, Mar 14, 2013 at 5:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
> This should unbreak "make check" on machines where char is unsigned.
> Blue, please give it a whirl.

With the patches applied there are no errors, thanks.
Tested-by: Blue Swirl <blauwirbel@gmail.com>

Though test-coroutine seems to hang, maybe fallout from recent
coroutine changes.

>
> The JSON parser is still as broken as ever.  Left for another day.
>
> Markus Armbruster (4):
>   unicode: New mod_utf8_codepoint()
>   check-qjson: Fix up a few bogus comments
>   check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings
>   qjson: to_json() case QTYPE_QSTRING is buggy, rewrite
>
>  include/qemu-common.h |   3 +
>  qobject/qjson.c       | 102 ++++++++----------
>  tests/check-qjson.c   | 280 +++++++++++++++++++++++++++++---------------------
>  util/Makefile.objs    |   1 +
>  util/unicode.c        |  96 +++++++++++++++++
>  5 files changed, 306 insertions(+), 176 deletions(-)
>  create mode 100644 util/unicode.c
>
> --
> 1.7.11.7
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter
  2013-03-17 19:55 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Blue Swirl
@ 2013-03-18  9:58   ` Markus Armbruster
  0 siblings, 0 replies; 20+ messages in thread
From: Markus Armbruster @ 2013-03-18  9:58 UTC (permalink / raw)
  To: Blue Swirl; +Cc: aliguori, qemu-devel

Blue Swirl <blauwirbel@gmail.com> writes:

> On Thu, Mar 14, 2013 at 5:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> This should unbreak "make check" on machines where char is unsigned.
>> Blue, please give it a whirl.
>
> With the patches applied there are no errors, thanks.
> Tested-by: Blue Swirl <blauwirbel@gmail.com>

Thanks!

> Though test-coroutine seems to hang, maybe fallout from recent
> coroutine changes.

I've seen rtc-test hang intermittently; haven't gotten around to digging
for roots.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint()
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster
@ 2013-03-21 19:37   ` Laszlo Ersek
  2013-03-22  9:23     ` Markus Armbruster
  0 siblings, 1 reply; 20+ messages in thread
From: Laszlo Ersek @ 2013-03-21 19:37 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel

I've (re-)read the UTF-8 article in wikipedia now... comments below

On 03/14/13 18:49, Markus Armbruster wrote:

> diff --git a/util/unicode.c b/util/unicode.c
> new file mode 100644
> index 0000000..954bcf7

> +/**
> + * mod_utf8_codepoint:
> + * @s: string encoded in modified UTF-8
> + * @n: maximum number of bytes to read from @s, if less than 6
> + * @end: set to end of sequence on return
> + *
> + * Convert the modified UTF-8 sequence at the start of @s.  Modified
> + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + *
> + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation
> + * byte, the sequence is invalid, and @end is set to @s + 1
> + *
> + * Else, the first byte determines how many continuation bytes are
> + * expected.  If there are fewer, the sequence is invalid, and @end is
> + * set to @s + 1 + actual number of continuation bytes.  Else, the
> + * sequence is well-formed, and @end is set to @s + 1 + expected
> + * number of continuation bytes.

The wording also covers (number of cont. bytes == 0), OK.

... One point that wasn't clear to me until I read the code too is that
"there are fewer" covers both "out of bytes" and "byte available, but
it's not a cont. byte". Subtle :)

The way "*end" is set ensures progress.

> + *
> + * A well-formed sequence is valid unless it encodes a code point
> + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66
> + * noncharacters, a surrogate code point, or is overlong.  Except the
> + * overlong sequence "\xC0\x80" is valid.
> + *
> + * Conversion succeeds if and only if the sequence is valid.
> + *
> + * Returns: the Unicode code point on success, -1 on failure.
> + */
> +int mod_utf8_codepoint(const char *s, size_t n, char **end)

The type of "end" follows the strtol() prototype. I guess I'd prefer
"const char **end", and "unsigned char" all around actually. (The input
is binary garbage.) Anyway this is irrelevant. ("const char **end" would
be probably quite inconvenient for the caller,
<http://www.c-faq.com/ansi/constmismatch.html>.)

> +{
> +    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
> +    const unsigned char *p;
> +    unsigned byte, mask, len, i;
> +    int cp;
> +
> +    if (n == 0) {
> +        *end = (char *)s;
> +        return 0;
> +    }

This is a special case (we return the code point U+0000 after looking at
zero bytes); we can probably expect the caller to know about n==0.

> +
> +    p = (const unsigned char *)s;
> +    byte = *p++;

We have n>=1 bytes here, and pointing one past the last one (in case
n==1) is allowed.

> +    if (byte < 0x80) {
> +        cp = byte;              /* one byte sequence */

So, since this is modified UTF-8, the U+0000 code point is represented
with the overlong "\xC0\x80" sequence, and a bare '\0' can never be part
of the Modified UTF-8 byte stream. (According to wp, '\0' is used in C
and similar to zero-terminate the string, but I think that's outside the
scope of the wire format.)

If we find a '\0' here, we report it as code point U+0000 instead of
rejecting it. One could maybe argue that it's a violation of the
interface contract (namely, due to '\0' being at offset #0, the caller
should have passed in n==0), but I assume from the description that the
caller need not have any idea about the contents, knowing the size
should be enough.

IOW I'm not sure about the intended use of this function, but if the
caller is allowed to throw any binary data at it (with correctly
communicated size of course), then we can misreport U+0000 here. (*)

> +    } else if (byte >= 0xFE) {
> +        cp = -1;                /* impossible bytes 0xFE, 0xFF */

OK, binary 1111111x is as first byte misformed.

> +    } else if ((byte & 0x40) == 0) {
> +        cp = -1;                /* unexpected continuation byte */

We know here that byte >= 0x80, and continuation bytes look like
10xxxxxx, OK.

> +    } else {
> +        /* multi-byte sequence */

We have one of:

110xxxxx
1110xxxx
11110xxx
111110xx
1111110x

> +        len = 0;
> +        for (mask = 0x80; byte & mask; mask >>= 1) {
> +            len++;
> +        }
> +        assert(len > 1 && len < 7);

OK, [2..6].

(Maybe use g_assert()? :))

> +        cp = byte & (mask - 1);

OK, the only bit set in "mask" matches the leftmost clear bit in "byte".
(mask-1) selects the xxxx bits in "byte".

> +        for (i = 1; i < len; i++) {

Runs at least once, and as many times as we need continuation bytes. OK.

> +            byte = i < n ? *p : 0;

"p" is valid to evaluate, "*p" may not be, so we check first. OK.

> +            if ((byte & 0xC0) != 0x80) {
> +                cp = -1;        /* continuation byte missing */
> +                goto out;
> +            }

Right, if a byte is available, it must look 10xxxxxx (binary). If we're
out of bytes, we also take this branch. *end will be left one past the
actual cont. bytes.

> +            p++;
> +            cp <<= 6;
> +            cp |= byte & 0x3F;
> +        }

We consume the cont. byte: p++ is valid to evaluate, and we shift in the
low six bits from the cont byte into the codepoint.

The loop runs at most 5 times (for len==6), shifting in (len-1)*6 bits
at most, in addition to the initial cp assignment. The initial cp
assignment clusters the masked-in bits at the LSB end:

    mask == (0x80 >> len) - 1

hence the number of masked-in bits in the original cp assignment is
7-len (which difference falls into [1..5]). So the total number of bits
shifted into "cp" is

    7-len + (len-1)*6 == 1 + 5 * len

Since len <= 6, the above is <= 31, which is perfect for our "int"
(int32_t in practice).

> +        if (cp > 0x10FFFF) {
> +            cp = -1;            /* beyond Unicode range */

OK.

> +        } else if ((cp >= 0xFDD0 && cp <= 0xFDEF)
> +                   || (cp & 0xFFFE) == 0xFFFE) {
> +            cp = -1;            /* noncharacter */

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

Interesting. OK.

> +        } else if (cp >= 0xD800 && cp <= 0xDFFF) {
> +            cp = -1;            /* surrogate code point */

OK.

> +        } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) {
> +            cp = -1;            /* overlong, not \xC0\x80 */
> +        }
> +    }
> +
> +out:
> +    *end = (char *)p;
> +    return cp;
> +}
> 

I think if we want to adhere to Modified UTF-8 strictly, then (*) should
be fixed. If not, I can give my Rb; this is nice code.

Laszlo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments Markus Armbruster
@ 2013-03-21 20:06   ` Laszlo Ersek
  2013-03-22 13:27     ` Markus Armbruster
  0 siblings, 1 reply; 20+ messages in thread
From: Laszlo Ersek @ 2013-03-21 20:06 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel

I don't understand what's going on here.

On 03/14/13 18:49, Markus Armbruster wrote:
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>  tests/check-qjson.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
> index ec85a0c..852124a 100644
> --- a/tests/check-qjson.c
> +++ b/tests/check-qjson.c
> @@ -4,7 +4,7 @@
>   *
>   * Authors:
>   *  Anthony Liguori   <aliguori@us.ibm.com>
> - *  Markus Armbruster <armbru@redhat.com>,
> + *  Markus Armbruster <armbru@redhat.com>
>   *
>   * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
>   * See the COPYING.LIB file in the top-level directory.
> @@ -462,8 +462,7 @@ static void utf8_string(void)
>          },
>          /* 3.3.4  5-byte sequence with last byte missing (U+0000) */
>          {
> -            /* invalid */
> -            "\"\xF8\x80\x80\x80\"", /* bug: not corrected */
> +            "\"\xF8\x80\x80\x80\"",

In this test case, we use an invalid UTF-8 sequence in a JSON string
literal (json_in). So "/* invalid */" could be justified; perhaps it's
just too laconic.

The "/* bug: not corrected */" comment seems indeed wrong, "json_in" is
*input*, there's nothing to correct on it.

>              NULL,                   /* bug: rejected */

When the JSON parser rejects the invalid sequence, it's actually
correct. So why the "bug" comment? Are we expecting (according to the
comment in utf8_string()) U+FFFD?

>              "\"\\u8000\\uFFFF\"",   /* bug: want "\"\\uFFFF\"" */
>              "\xF8\x80\x80\x80",

In this test we're trying to format a UTF-8 byte sequence (utf8_in) as a
JSON string. The source is invalid. The JSON formatter should either
reject it, or emit an U+FFFD in its place. The actual JSON output is
probably wrong, hence the "bug" part is OK, but I thought what we
expected is not U+FFFF but U+FFFD. Hence that part of the comment is
wrong. ... Hm, OK the leading comment has some notes on this as well.

So what your patch does here is:
- remove the halfway OK comment "/* invalid */" -- I think it wasn't
really wrong, but I won't miss it,
- removes an in fact bogus comment,
- (removes a runaway comma).

My eyes are bleeding.

Reviewed-by: Laszlo Ersek <lersek@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster
@ 2013-03-21 20:22   ` Laszlo Ersek
  2013-03-22 14:37     ` Markus Armbruster
  0 siblings, 1 reply; 20+ messages in thread
From: Laszlo Ersek @ 2013-03-21 20:22 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel

On 03/14/13 18:49, Markus Armbruster wrote:
> These are all broken, too.

What are "these"? And how are they broken? And how does the patch fix them?

> 
> A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
> noncharacters some more instead of what they're supposed to test.  Use
> U+FFFD and U+10FFFD instead.
> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>  tests/check-qjson.c | 85 +++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 72 insertions(+), 13 deletions(-)

I'm confused about the commit message. There are three paragraphs in it
(the title, the first paragraph, and the 2nd paragraph). This patch
modifies different tests:

> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
> index 852124a..efec1b2 100644
> --- a/tests/check-qjson.c
> +++ b/tests/check-qjson.c
> @@ -158,7 +158,7 @@ static void utf8_string(void)
>       * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>       * UTF-8").
>       *
> -     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
> +     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
>       * capability and stress test at
>       * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>       */
> @@ -256,11 +256,11 @@ static void utf8_string(void)
>              "\xDF\xBF",
>              "\"\\u07FF\"",
>          },
> -        /* 2.2.3  3 bytes U+FFFF */
> +        /* 2.2.3  3 bytes U+FFFD */
>          {
> -            "\"\xEF\xBF\xBF\"",
> -            "\xEF\xBF\xBF",
> -            "\"\\uFFFF\"",
> +            "\"\xEF\xBF\xBD\"",
> +            "\xEF\xBF\xBD",
> +            "\"\\uFFFD\"",
>          },

This is under "2.2  Last possible sequence of a certain length". I guess
this is where you say "last possible sequence of a certain length,
encoding a character (= non-noncharacter)". OK, p#2.


>          /* 2.2.4  4 bytes U+1FFFFF */
>          {
> @@ -303,10 +303,10 @@ static void utf8_string(void)
>              "\"\\uFFFD\"",
>          },
>          {
> -            /* U+10FFFF */
> -            "\"\xF4\x8F\xBF\xBF\"",
> -            "\xF4\x8F\xBF\xBF",
> -            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */
> +            /* U+10FFFD */
> +            "\"\xF4\x8F\xBF\xBD\"",
> +            "\xF4\x8F\xBF\xBD",
> +            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */
>          },
>          {
>              /* U+110000 */

Under "2.3  Other boundary conditions". Not a non-character any longer,
but also not a boundary condition. At least not the original one. Still
covered by the ...FFFD part of the commit message, p#2.


> @@ -584,9 +584,9 @@ static void utf8_string(void)
>              "\"\\u07FF\"",
>          },
>          {
> -            /* \U+FFFF */
> -            "\"\xF0\x8F\xBF\xBF\"",
> -            "\xF0\x8F\xBF\xBF",   /* bug: not corrected */
> +            /* \U+FFFD */
> +            "\"\xF0\x8F\xBF\xBD\"",
> +            "\xF0\x8F\xBF\xBD",   /* bug: not corrected */
>              "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>          },
>          {

Under "4.2  Maximum overlong sequences". What does that even mean? "In
some sense maximum codepoints, all represented as overlong sequences"? P#2.

> @@ -731,6 +731,7 @@ static void utf8_string(void)
>              "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
>          },
>          /* 5.3  Other illegal code positions */
> +        /* BMP noncharacters */
>          {
>              /* \U+FFFE */
>              "\"\xEF\xBF\xBE\"",
> @@ -741,7 +742,65 @@ static void utf8_string(void)
>              /* \U+FFFF */
>              "\"\xEF\xBF\xBF\"",
>              "\xEF\xBF\xBF",     /* bug: not corrected */
> -            "\"\\uFFFF\"",      /* bug: not corrected */
> +            "\"\\uFFFF\"",
> +        },
> +        {
> +            /* U+FDD0 */
> +            "\"\xEF\xB7\x90\"",
> +            "\xEF\xB7\x90",     /* bug: not corrected */
> +            "\"\\uFDD0\"",      /* bug: not corrected */
> +        },
> +        {
> +            /* U+FDEF */
> +            "\"\xEF\xB7\xAF\"",
> +            "\xEF\xB7\xAF",     /* bug: not corrected */
> +            "\"\\uFDEF\"",      /* bug: not corrected */
> +        },
> +        /* Plane 1 .. 16 noncharacters */
> +        {
> +            /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */
> +            "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"",
> +            /* bug: not corrected */
> +            "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF",
> +            /* bug: not corrected */
> +            "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF"
> +            "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF"
> +            "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF"
> +            "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF"
> +            "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF"
> +            "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF"
> +            "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF"
> +            "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"",
>          },
>          {}
>      };
> 

This is probably p#0 (the title).

Ah. Have you removed the noncharacters from the other tests, but made up
for them at the end with new noncharacter tests?

Reviewed-by: Laszlo Ersek <lersek@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster
@ 2013-03-21 20:44   ` Laszlo Ersek
  2013-03-22 13:15   ` Laszlo Ersek
  1 sibling, 0 replies; 20+ messages in thread
From: Laszlo Ersek @ 2013-03-21 20:44 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel

On 03/14/13 18:49, Markus Armbruster wrote:
> Known bugs in to_json():

> My rewrite fixes them as follows:

I'll try to review this sometime later. Patch review doesn't scale *at
all*. I've spent hours on the first 3 patches. You should just be given
pull req rights. I'd need 36 hour days.

/me throws his hands up

Laszlo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint()
  2013-03-21 19:37   ` Laszlo Ersek
@ 2013-03-22  9:23     ` Markus Armbruster
  2013-03-22 11:46       ` Laszlo Ersek
  0 siblings, 1 reply; 20+ messages in thread
From: Markus Armbruster @ 2013-03-22  9:23 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: blauwirbel, aliguori, qemu-devel

Laszlo Ersek <lersek@redhat.com> writes:

> I've (re-)read the UTF-8 article in wikipedia now... comments below
>
> On 03/14/13 18:49, Markus Armbruster wrote:
>
>> diff --git a/util/unicode.c b/util/unicode.c
>> new file mode 100644
>> index 0000000..954bcf7
>
>> +/**
>> + * mod_utf8_codepoint:
>> + * @s: string encoded in modified UTF-8
>> + * @n: maximum number of bytes to read from @s, if less than 6
>> + * @end: set to end of sequence on return
>> + *
>> + * Convert the modified UTF-8 sequence at the start of @s.  Modified
>> + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as
>> + * "\xC0\x80".
>> + *
>> + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation
>> + * byte, the sequence is invalid, and @end is set to @s + 1
>> + *
>> + * Else, the first byte determines how many continuation bytes are
>> + * expected.  If there are fewer, the sequence is invalid, and @end is
>> + * set to @s + 1 + actual number of continuation bytes.  Else, the
>> + * sequence is well-formed, and @end is set to @s + 1 + expected
>> + * number of continuation bytes.
>
> The wording also covers (number of cont. bytes == 0), OK.
>
> ... One point that wasn't clear to me until I read the code too is that
> "there are fewer" covers both "out of bytes" and "byte available, but
> it's not a cont. byte". Subtle :)

Guilty as charged :)

> The way "*end" is set ensures progress.

Yes.

>> + *
>> + * A well-formed sequence is valid unless it encodes a code point
>> + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66
>> + * noncharacters, a surrogate code point, or is overlong.  Except the
>> + * overlong sequence "\xC0\x80" is valid.
>> + *
>> + * Conversion succeeds if and only if the sequence is valid.
>> + *
>> + * Returns: the Unicode code point on success, -1 on failure.
>> + */
>> +int mod_utf8_codepoint(const char *s, size_t n, char **end)
>
> The type of "end" follows the strtol() prototype. I guess I'd prefer
> "const char **end", and "unsigned char" all around actually. (The input
> is binary garbage.) Anyway this is irrelevant. ("const char **end" would
> be probably quite inconvenient for the caller,
> <http://www.c-faq.com/ansi/constmismatch.html>.)

C's type system is far too inexpressive to do const properly.  Best we
could do within the constraints, I guess.  Results in tension between
"declare unchanging things const" (to help compilers as well as human
readers) and "avoid type casts" (because they defeat the type checker
and reduce legibility).

I prefer to err on the side of avoiding casts.  In particular, I try to
make my interfaces so that they can be used without casts as much as
possible.

Thus, I made the first argument const, but not the last.  I guess
similar thinking prevailed when the library part of C was constified;
see strtol() precedence you quoted.

>> +{
>> +    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
>> +    const unsigned char *p;
>> +    unsigned byte, mask, len, i;
>> +    int cp;
>> +
>> +    if (n == 0) {
>> +        *end = (char *)s;
>> +        return 0;
>> +    }
>
> This is a special case (we return the code point U+0000 after looking at
> zero bytes); we can probably expect the caller to know about n==0.

We could make it an error instead.  What's your gut feeling?

>> +
>> +    p = (const unsigned char *)s;
>> +    byte = *p++;
>
> We have n>=1 bytes here, and pointing one past the last one (in case
> n==1) is allowed.
>
>> +    if (byte < 0x80) {
>> +        cp = byte;              /* one byte sequence */
>
> So, since this is modified UTF-8, the U+0000 code point is represented
> with the overlong "\xC0\x80" sequence, and a bare '\0' can never be part
> of the Modified UTF-8 byte stream. (According to wp, '\0' is used in C
> and similar to zero-terminate the string, but I think that's outside the
> scope of the wire format.)
>
> If we find a '\0' here, we report it as code point U+0000 instead of
> rejecting it. One could maybe argue that it's a violation of the
> interface contract (namely, due to '\0' being at offset #0, the caller
> should have passed in n==0), but I assume from the description that the
> caller need not have any idea about the contents, knowing the size
> should be enough.
>
> IOW I'm not sure about the intended use of this function, but if the
> caller is allowed to throw any binary data at it (with correctly
> communicated size of course), then we can misreport U+0000 here. (*)

Good point.

'\0' should be treated as string terminator, exactly like reaching @n.

Let's check what my current code does:

    n == 0:
        set @end to @s, and return 0

    n > 0 && *s == 0:
        set @end to @s + 1, and return 0

Conclusion: you caught a bug.  Thanks, will fix!

>> +    } else if (byte >= 0xFE) {
>> +        cp = -1;                /* impossible bytes 0xFE, 0xFF */
>
> OK, binary 1111111x is as first byte misformed.

Correct.

>> +    } else if ((byte & 0x40) == 0) {
>> +        cp = -1;                /* unexpected continuation byte */
>
> We know here that byte >= 0x80, and continuation bytes look like
> 10xxxxxx, OK.

Correct.

>> +    } else {
>> +        /* multi-byte sequence */
>
> We have one of:
>
> 110xxxxx
> 1110xxxx
> 11110xxx
> 111110xx
> 1111110x

Correct.

>> +        len = 0;
>> +        for (mask = 0x80; byte & mask; mask >>= 1) {
>> +            len++;
>> +        }
>> +        assert(len > 1 && len < 7);
>
> OK, [2..6].
>
> (Maybe use g_assert()? :))

Now you're teasing me!

>> +        cp = byte & (mask - 1);
>
> OK, the only bit set in "mask" matches the leftmost clear bit in "byte".
> (mask-1) selects the xxxx bits in "byte".

Correct.

>> +        for (i = 1; i < len; i++) {
>
> Runs at least once, and as many times as we need continuation bytes. OK.

Correct.

>> +            byte = i < n ? *p : 0;
>
> "p" is valid to evaluate, "*p" may not be, so we check first. OK.
>
>> +            if ((byte & 0xC0) != 0x80) {
>> +                cp = -1;        /* continuation byte missing */
>> +                goto out;
>> +            }
>
> Right, if a byte is available, it must look 10xxxxxx (binary). If we're
> out of bytes, we also take this branch. *end will be left one past the
> actual cont. bytes.

Correct.

>> +            p++;
>> +            cp <<= 6;
>> +            cp |= byte & 0x3F;
>> +        }
>
> We consume the cont. byte: p++ is valid to evaluate, and we shift in the
> low six bits from the cont byte into the codepoint.
>
> The loop runs at most 5 times (for len==6), shifting in (len-1)*6 bits
> at most, in addition to the initial cp assignment. The initial cp
> assignment clusters the masked-in bits at the LSB end:
>
>     mask == (0x80 >> len) - 1
>
> hence the number of masked-in bits in the original cp assignment is
> 7-len (which difference falls into [1..5]). So the total number of bits
> shifted into "cp" is
>
>     7-len + (len-1)*6 == 1 + 5 * len
>
> Since len <= 6, the above is <= 31, which is perfect for our "int"
> (int32_t in practice).

Exactly!

>> +        if (cp > 0x10FFFF) {
>> +            cp = -1;            /* beyond Unicode range */
>
> OK.
>
>> +        } else if ((cp >= 0xFDD0 && cp <= 0xFDEF)
>> +                   || (cp & 0xFFFE) == 0xFFFE) {
>> +            cp = -1;            /* noncharacter */
>
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
>
> Interesting. OK.
>
>> +        } else if (cp >= 0xD800 && cp <= 0xDFFF) {
>> +            cp = -1;            /* surrogate code point */
>
> OK.
>
>> +        } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) {
>> +            cp = -1;            /* overlong, not \xC0\x80 */
>> +        }
>> +    }
>> +
>> +out:
>> +    *end = (char *)p;
>> +    return cp;
>> +}
>> 
>
> I think if we want to adhere to Modified UTF-8 strictly, then (*) should
> be fixed. If not, I can give my Rb; this is nice code.

I think it needs fixing.

Thanks for the thorough review!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint()
  2013-03-22  9:23     ` Markus Armbruster
@ 2013-03-22 11:46       ` Laszlo Ersek
  0 siblings, 0 replies; 20+ messages in thread
From: Laszlo Ersek @ 2013-03-22 11:46 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel

On 03/22/13 10:23, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
>> On 03/14/13 18:49, Markus Armbruster wrote:

>>> +{
>>> +    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
>>> +    const unsigned char *p;
>>> +    unsigned byte, mask, len, i;
>>> +    int cp;
>>> +
>>> +    if (n == 0) {
>>> +        *end = (char *)s;
>>> +        return 0;
>>> +    }
>>
>> This is a special case (we return the code point U+0000 after looking at
>> zero bytes); we can probably expect the caller to know about n==0.
> 
> We could make it an error instead.  What's your gut feeling?

(If the question still stands -- maybe it doesn't any more, considering
future handling of '\0':) I guess this function would be called in a
loop, with increasing "s" and decreasing "n" values. "end" can only be
checked after the first call. If you write a loop that checks "end" in
the controlling expression, then accepting n==0 without error is useful.
If you write a loop that checks "n" in the controlling expression, then
refusing n==0 is OK. I'd probably write the latter kind of loop (I like
pre-testing more), but I can't say in general :)

Laszlo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite
  2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster
  2013-03-21 20:44   ` Laszlo Ersek
@ 2013-03-22 13:15   ` Laszlo Ersek
  2013-03-22 14:51     ` Markus Armbruster
  1 sibling, 1 reply; 20+ messages in thread
From: Laszlo Ersek @ 2013-03-22 13:15 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel

comments below

On 03/14/13 18:49, Markus Armbruster wrote:

> diff --git a/qobject/qjson.c b/qobject/qjson.c
> index 83a6b4f..19085a1 100644
> --- a/qobject/qjson.c
> +++ b/qobject/qjson.c
> @@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent)
>      case QTYPE_QSTRING: {
>          QString *val = qobject_to_qstring(obj);
>          const char *ptr;
> +        int cp;
> +        char buf[16];
> +        char *end;
>  
>          ptr = qstring_get_str(val);
>          qstring_append(str, "\"");
> -        while (*ptr) {
> -            if ((ptr[0] & 0xE0) == 0xE0 &&
> -                (ptr[1] & 0x80) && (ptr[2] & 0x80)) {
> -                uint16_t wchar;
> -                char escape[7];
> -
> -                wchar  = (ptr[0] & 0x0F) << 12;
> -                wchar |= (ptr[1] & 0x3F) << 6;
> -                wchar |= (ptr[2] & 0x3F);
> -                ptr += 2;
> -
> -                snprintf(escape, sizeof(escape), "\\u%04X", wchar);
> -                qstring_append(str, escape);
> -            } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) {
> -                uint16_t wchar;
> -                char escape[7];
> -
> -                wchar  = (ptr[0] & 0x1F) << 6;
> -                wchar |= (ptr[1] & 0x3F);
> -                ptr++;
> -
> -                snprintf(escape, sizeof(escape), "\\u%04X", wchar);
> -                qstring_append(str, escape);
> -            } else switch (ptr[0]) {
> -                case '\"':
> -                    qstring_append(str, "\\\"");
> -                    break;
> -                case '\\':
> -                    qstring_append(str, "\\\\");
> -                    break;
> -                case '\b':
> -                    qstring_append(str, "\\b");
> -                    break;
> -                case '\f':
> -                    qstring_append(str, "\\f");
> -                    break;
> -                case '\n':
> -                    qstring_append(str, "\\n");
> -                    break;
> -                case '\r':
> -                    qstring_append(str, "\\r");
> -                    break;
> -                case '\t':
> -                    qstring_append(str, "\\t");
> -                    break;
> -                default: {
> -                    if (ptr[0] <= 0x1F) {
> -                        char escape[7];
> -                        snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]);
> -                        qstring_append(str, escape);
> -                    } else {
> -                        char buf[2] = { ptr[0], 0 };
> -                        qstring_append(str, buf);
> -                    }
> -                    break;
> +
> +        for (; *ptr; ptr = end) {
> +            cp = mod_utf8_codepoint(ptr, 6, &end);

This provides more background: you never call mod_utf8_codepoint() with
'\0' at offset 0. So handling that in mod_utf8_codepoint() may not be
that important.

If a '\0' is found at offset >= 1, it will correctly trigger the /*
continuation byte missing */ branch in mod_utf8_codepoint(). The retval
is -1, and *end is left pointing to the NUL byte. (This is consistent
with mod_utf8_codepoint()'s docs.)

The -1 (incomplete sequence) produces the replacement character below,
and the next time around *ptr is '\0', so we finish the loop. Seems OK.

(
An alternative interface for mod_utf8_codepoint() might be something like:

    size_t alternative(const char *ptr, int *cp, size_t n);

Resembling read() somewhat:
- the return value would be the number of bytes consumed (it can't be
negative (= fatal error), because we guarantee progress). 0 is EOF and
only possible when "n" is 0.
- "ptr" is the source,
- "cp" is the output code point, -1 if invalid,
- "n" is the bytes available in the source / requested to process at most.

Encountering a \0 in the byte stream would be an error (*cp = -1), but
would not terminate parsing per se.

Then the loop would look like:

    processed = 0;
    while (processed < full) {
      int cp;

      rd = alternative(ptr + processed, &cp, full - processed);
      g_assert(rd > 0);

      /* look at cp */

      processed += rd;
  }

But of course I'm not suggesting to rewrite the function!
)

> +            switch (cp) {
> +            case '\"':
> +                qstring_append(str, "\\\"");
> +                break;
> +            case '\\':
> +                qstring_append(str, "\\\\");
> +                break;
> +            case '\b':
> +                qstring_append(str, "\\b");
> +                break;
> +            case '\f':
> +                qstring_append(str, "\\f");
> +                break;
> +            case '\n':
> +                qstring_append(str, "\\n");
> +                break;
> +            case '\r':
> +                qstring_append(str, "\\r");
> +                break;
> +            case '\t':
> +                qstring_append(str, "\\t");
> +                break;

The C standard also names \a (alert) and \v (vertical tab); I'm not sure
about their JSON notation. (The (cp < 0x20) condition catches them below
of course.)

> +            default:
> +                if (cp < 0) {
> +                    cp = 0xFFFD; /* replacement character */
>                  }
> +                if (cp > 0xFFFF) {
> +                    /* beyond BMP; need a surrogate pair */
> +                    snprintf(buf, sizeof(buf), "\\u%04X\\u%04X",
> +                             0xD800 + ((cp - 0x10000) >> 10),
> +                             0xDC00 + ((cp - 0x10000) & 0x3FF));

Seems like we write 13 bytes into buf, OK. Also cp is never greater than
0x10FFFF, hence the difference is at most 0xFFFFF. The RHS surrogate
half can go up to 0xDFFF, the LHS up to 0xD800+0x3FF == 0xDBFF. Good.


> +                } else if (cp < 0x20 || cp >= 0x7F) {
> +                    snprintf(buf, sizeof(buf), "\\u%04X", cp);
> +                } else {
> +                    buf[0] = cp;
> +                    buf[1] = 0;
>                  }
> -            ptr++;
> -        }
> +                qstring_append(str, buf);
> +            }
> +        };
> +
>          qstring_append(str, "\"");
>          break;
>      }

Seems OK.


> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
> index efec1b2..595ddc0 100644
> --- a/tests/check-qjson.c
> +++ b/tests/check-qjson.c

I'll trust you on that one :)

Reviewed-by: Laszlo Ersek <lersek@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments
  2013-03-21 20:06   ` Laszlo Ersek
@ 2013-03-22 13:27     ` Markus Armbruster
  0 siblings, 0 replies; 20+ messages in thread
From: Markus Armbruster @ 2013-03-22 13:27 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: blauwirbel, aliguori, qemu-devel

Laszlo Ersek <lersek@redhat.com> writes:

> I don't understand what's going on here.
>
> On 03/14/13 18:49, Markus Armbruster wrote:
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>  tests/check-qjson.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>> 
>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>> index ec85a0c..852124a 100644
>> --- a/tests/check-qjson.c
>> +++ b/tests/check-qjson.c
>> @@ -4,7 +4,7 @@
>>   *
>>   * Authors:
>>   *  Anthony Liguori   <aliguori@us.ibm.com>
>> - *  Markus Armbruster <armbru@redhat.com>,
>> + *  Markus Armbruster <armbru@redhat.com>
>>   *
>>   * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
>>   * See the COPYING.LIB file in the top-level directory.
>> @@ -462,8 +462,7 @@ static void utf8_string(void)
>>          },
>>          /* 3.3.4  5-byte sequence with last byte missing (U+0000) */
>>          {
>> -            /* invalid */
>> -            "\"\xF8\x80\x80\x80\"", /* bug: not corrected */
>> +            "\"\xF8\x80\x80\x80\"",
>
> In this test case, we use an invalid UTF-8 sequence in a JSON string
> literal (json_in). So "/* invalid */" could be justified; perhaps it's
> just too laconic.

There are many more invalid sequences in other test cases.  I have no
idea why I tacked /* invalid */ to this one.

For what it's worth, the comment right above should make it clear enough
that the sequence is invalid.

> The "/* bug: not corrected */" comment seems indeed wrong, "json_in" is
> *input*, there's nothing to correct on it.

Exactly.

>>              NULL,                   /* bug: rejected */
>
> When the JSON parser rejects the invalid sequence, it's actually
> correct. So why the "bug" comment? Are we expecting (according to the
> comment in utf8_string()) U+FFFD?

        /*
         * Bug markers used here:
         * - bug: not corrected
         *   JSON parser fails to correct invalid sequence(s)
         * - bug: rejected
         *   JSON parser rejects invalid sequence(s)
--->     *   We may choose to define this as feature
         * - bug: want "..."
         *   JSON parser produces incorrect result, this is the
         *   correct one, assuming replacement character U+FFFF
         *   We may choose to reject instead of replace
         */

The comments here take care not to pass judgement on what the JSON
parser should do for invalid sequences.  Obviously, it should either
consistently reject them, or consistently replace them with a suitable
replacement character.

If I understand Anthony correctly, his advice is to always reject.
That's what I intend to do should I actually get around to fixing the
parser.

>>              "\"\\u8000\\uFFFF\"",   /* bug: want "\"\\uFFFF\"" */
>>              "\xF8\x80\x80\x80",
>
> In this test we're trying to format a UTF-8 byte sequence (utf8_in) as a
> JSON string. The source is invalid. The JSON formatter should either
> reject it, or emit an U+FFFD in its place. The actual JSON output is
> probably wrong, hence the "bug" part is OK, but I thought what we
> expected is not U+FFFF but U+FFFD. Hence that part of the comment is
> wrong. ... Hm, OK the leading comment has some notes on this as well.

When I wrote this test case, I thought the JSON parser *intentionally*
replaced by U+FFFF.  Only later I learned that FFFF comes from
unintended sign extension %-}

> So what your patch does here is:
> - remove the halfway OK comment "/* invalid */" -- I think it wasn't
> really wrong, but I won't miss it,
> - removes an in fact bogus comment,
> - (removes a runaway comma).
>
> My eyes are bleeding.

/me hands over tissues

> Reviewed-by: Laszlo Ersek <lersek@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings
  2013-03-21 20:22   ` Laszlo Ersek
@ 2013-03-22 14:37     ` Markus Armbruster
  2013-03-22 14:52       ` Laszlo Ersek
  0 siblings, 1 reply; 20+ messages in thread
From: Markus Armbruster @ 2013-03-22 14:37 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: blauwirbel, aliguori, qemu-devel

Laszlo Ersek <lersek@redhat.com> writes:

> On 03/14/13 18:49, Markus Armbruster wrote:
>> These are all broken, too.
>
> What are "these"? And how are they broken? And how does the patch fix them?

"These" refers to the subject: noncharacters other than U+FFFE, U+FFFF.

I agree that I should better explain how they're broken, and what the
patch does to fix them.  Will fix on respin.

>> 
>> A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
>> noncharacters some more instead of what they're supposed to test.  Use
>> U+FFFD and U+10FFFD instead.
>> 
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>  tests/check-qjson.c | 85
>> +++++++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 72 insertions(+), 13 deletions(-)
>
> I'm confused about the commit message. There are three paragraphs in it
> (the title, the first paragraph, and the 2nd paragraph). This patch
> modifies different tests:
>
>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>> index 852124a..efec1b2 100644
>> --- a/tests/check-qjson.c
>> +++ b/tests/check-qjson.c
>> @@ -158,7 +158,7 @@ static void utf8_string(void)
>>       * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>       * UTF-8").
>>       *
>> -     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
>> +     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
>>       * capability and stress test at
>>       * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>       */
>> @@ -256,11 +256,11 @@ static void utf8_string(void)
>>              "\xDF\xBF",
>>              "\"\\u07FF\"",
>>          },
>> -        /* 2.2.3  3 bytes U+FFFF */
>> +        /* 2.2.3  3 bytes U+FFFD */
>>          {
>> -            "\"\xEF\xBF\xBF\"",
>> -            "\xEF\xBF\xBF",
>> -            "\"\\uFFFF\"",
>> +            "\"\xEF\xBF\xBD\"",
>> +            "\xEF\xBF\xBD",
>> +            "\"\\uFFFD\"",
>>          },
>
> This is under "2.2  Last possible sequence of a certain length". I guess

Which is in turn under "2  Boundary condition test cases".

> this is where you say "last possible sequence of a certain length,
> encoding a character (= non-noncharacter)". OK, p#2.

Yes.

The test's purpose is testing the upper bound of 3-byte sequences is
decoded correctly.

The upper bound is U+FFFF.  Since that's a noncharacter, the parser
should reject it (or maybe replace), the formatter should replace it.
Trouble is it could be misdecoded and then rejected / replaced.

Besides, U+FFFF already gets tested along with the other noncharacters
under "5.3  Other illegal code positions".

Next in line is U+FFFE, also a noncharacter, also under 5.3.

Next in line is U+FFFD, which I picked.

But that gets tested under "2.3  Other boundary conditions"!  I guess I
either drop it there, or make this one U+FFFC.

I think testing U+FFFC here makes sense, because U+FFFD could be
misdecoded, then replaced by U+FFFD.

What do you think?

>>          /* 2.2.4  4 bytes U+1FFFFF */
>>          {
>> @@ -303,10 +303,10 @@ static void utf8_string(void)
>>              "\"\\uFFFD\"",
>>          },
>>          {
>> -            /* U+10FFFF */
>> -            "\"\xF4\x8F\xBF\xBF\"",
>> -            "\xF4\x8F\xBF\xBF",
>> -            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */
>> +            /* U+10FFFD */
>> +            "\"\xF4\x8F\xBF\xBD\"",
>> +            "\xF4\x8F\xBF\xBD",
>> +            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */
>>          },
>>          {
>>              /* U+110000 */
>
> Under "2.3  Other boundary conditions". Not a non-character any longer,
> but also not a boundary condition. At least not the original one. Still
> covered by the ...FFFD part of the commit message, p#2.

The test's purpose is testing the upper bound of the Unicode range gets
decoded correctly.

The upper bound is U+10FFFF.  Since that's a noncharacter... same
argument as above.

Next in line is U+10FFFE, also a noncharacter.

Next in line is U+10FFFD, which I picked.

>> @@ -584,9 +584,9 @@ static void utf8_string(void)
>>              "\"\\u07FF\"",
>>          },
>>          {
>> -            /* \U+FFFF */
>> -            "\"\xF0\x8F\xBF\xBF\"",
>> -            "\xF0\x8F\xBF\xBF",   /* bug: not corrected */
>> +            /* \U+FFFD */
>> +            "\"\xF0\x8F\xBF\xBD\"",
>> +            "\xF0\x8F\xBF\xBD",   /* bug: not corrected */
>>              "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>>          },
>>          {
>
> Under "4.2  Maximum overlong sequences". What does that even mean? "In
> some sense maximum codepoints, all represented as overlong sequences"? P#2.

The headings are all stolen from the same source as the test cases.
Perhaps I should steal more of the explanatory text there as well.

The one for 4.2 is:

    Below you see the highest Unicode value that is still resulting in
    an overlong sequence if represented with the given number of bytes.
    This is a boundary test for safe UTF-8 decoders.  All five
    characters should be rejected like malformed UTF-8 sequences.

>> @@ -731,6 +731,7 @@ static void utf8_string(void)
>>              "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
>>          },
>>          /* 5.3  Other illegal code positions */
>> +        /* BMP noncharacters */
>>          {
>>              /* \U+FFFE */
>>              "\"\xEF\xBF\xBE\"",
>> @@ -741,7 +742,65 @@ static void utf8_string(void)
>>              /* \U+FFFF */
>>              "\"\xEF\xBF\xBF\"",
>>              "\xEF\xBF\xBF",     /* bug: not corrected */
>> -            "\"\\uFFFF\"",      /* bug: not corrected */
>> +            "\"\\uFFFF\"",
>> +        },
>> +        {
>> +            /* U+FDD0 */
>> +            "\"\xEF\xB7\x90\"",
>> +            "\xEF\xB7\x90",     /* bug: not corrected */
>> +            "\"\\uFDD0\"",      /* bug: not corrected */
>> +        },
>> +        {
>> +            /* U+FDEF */
>> +            "\"\xEF\xB7\xAF\"",
>> +            "\xEF\xB7\xAF",     /* bug: not corrected */
>> +            "\"\\uFDEF\"",      /* bug: not corrected */
>> +        },
>> +        /* Plane 1 .. 16 noncharacters */
>> +        {
>> +            /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */
>> +            "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
>> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
>> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
>> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
>> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
>> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
>> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
>> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
>> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
>> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
>> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
>> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
>> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
>> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
>> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
>> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"",
>> +            /* bug: not corrected */
>> +            "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
>> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
>> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
>> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
>> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
>> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
>> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
>> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
>> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
>> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
>> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
>> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
>> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
>> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
>> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
>> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF",
>> +            /* bug: not corrected */
>> +            "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF"
>> +            "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF"
>> +            "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF"
>> +            "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF"
>> +            "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF"
>> +            "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF"
>> +            "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF"
>> +            "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"",
>>          },
>>          {}
>>      };
>> 
>
> This is probably p#0 (the title).
>
> Ah. Have you removed the noncharacters from the other tests, but made up
> for them at the end with new noncharacter tests?

I added tests to cover all 66 noncharacters.  Then I noticed some
duplicate test cases elsewhere, and realized that these don't really
fully test what I want to test there.

> Reviewed-by: Laszlo Ersek <lersek@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite
  2013-03-22 13:15   ` Laszlo Ersek
@ 2013-03-22 14:51     ` Markus Armbruster
  0 siblings, 0 replies; 20+ messages in thread
From: Markus Armbruster @ 2013-03-22 14:51 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: blauwirbel, aliguori, qemu-devel

Laszlo Ersek <lersek@redhat.com> writes:

> comments below
>
> On 03/14/13 18:49, Markus Armbruster wrote:
>
>> diff --git a/qobject/qjson.c b/qobject/qjson.c
>> index 83a6b4f..19085a1 100644
>> --- a/qobject/qjson.c
>> +++ b/qobject/qjson.c
>> @@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent)
>>      case QTYPE_QSTRING: {
>>          QString *val = qobject_to_qstring(obj);
>>          const char *ptr;
>> +        int cp;
>> +        char buf[16];
>> +        char *end;
>>  
>>          ptr = qstring_get_str(val);
>>          qstring_append(str, "\"");
>> -        while (*ptr) {
>> -            if ((ptr[0] & 0xE0) == 0xE0 &&
>> -                (ptr[1] & 0x80) && (ptr[2] & 0x80)) {
>> -                uint16_t wchar;
>> -                char escape[7];
>> -
>> -                wchar  = (ptr[0] & 0x0F) << 12;
>> -                wchar |= (ptr[1] & 0x3F) << 6;
>> -                wchar |= (ptr[2] & 0x3F);
>> -                ptr += 2;
>> -
>> -                snprintf(escape, sizeof(escape), "\\u%04X", wchar);
>> -                qstring_append(str, escape);
>> -            } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) {
>> -                uint16_t wchar;
>> -                char escape[7];
>> -
>> -                wchar  = (ptr[0] & 0x1F) << 6;
>> -                wchar |= (ptr[1] & 0x3F);
>> -                ptr++;
>> -
>> -                snprintf(escape, sizeof(escape), "\\u%04X", wchar);
>> -                qstring_append(str, escape);
>> -            } else switch (ptr[0]) {
>> -                case '\"':
>> -                    qstring_append(str, "\\\"");
>> -                    break;
>> -                case '\\':
>> -                    qstring_append(str, "\\\\");
>> -                    break;
>> -                case '\b':
>> -                    qstring_append(str, "\\b");
>> -                    break;
>> -                case '\f':
>> -                    qstring_append(str, "\\f");
>> -                    break;
>> -                case '\n':
>> -                    qstring_append(str, "\\n");
>> -                    break;
>> -                case '\r':
>> -                    qstring_append(str, "\\r");
>> -                    break;
>> -                case '\t':
>> -                    qstring_append(str, "\\t");
>> -                    break;
>> -                default: {
>> -                    if (ptr[0] <= 0x1F) {
>> -                        char escape[7];
>> -                        snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]);
>> -                        qstring_append(str, escape);
>> -                    } else {
>> -                        char buf[2] = { ptr[0], 0 };
>> -                        qstring_append(str, buf);
>> -                    }
>> -                    break;
>> +
>> +        for (; *ptr; ptr = end) {
>> +            cp = mod_utf8_codepoint(ptr, 6, &end);
>
> This provides more background: you never call mod_utf8_codepoint() with
> '\0' at offset 0. So handling that in mod_utf8_codepoint() may not be
> that important.

Yes, this caller doesn't care.  Doesn't mean we shouldn't try to come up
with a sane function contract.

Note the use of literal 6.  It means "unlimited".  Perfectly safe
because the string is nul-terminated.

> If a '\0' is found at offset >= 1, it will correctly trigger the /*
> continuation byte missing */ branch in mod_utf8_codepoint(). The retval
> is -1, and *end is left pointing to the NUL byte. (This is consistent
> with mod_utf8_codepoint()'s docs.)
>
> The -1 (incomplete sequence) produces the replacement character below,
> and the next time around *ptr is '\0', so we finish the loop. Seems OK.
>
> (
> An alternative interface for mod_utf8_codepoint() might be something like:
>
>     size_t alternative(const char *ptr, int *cp, size_t n);
>
> Resembling read() somewhat:
> - the return value would be the number of bytes consumed (it can't be
> negative (= fatal error), because we guarantee progress). 0 is EOF and
> only possible when "n" is 0.
> - "ptr" is the source,
> - "cp" is the output code point, -1 if invalid,
> - "n" is the bytes available in the source / requested to process at most.
>
> Encountering a \0 in the byte stream would be an error (*cp = -1), but
> would not terminate parsing per se.
>
> Then the loop would look like:
>
>     processed = 0;
>     while (processed < full) {
>       int cp;
>
>       rd = alternative(ptr + processed, &cp, full - processed);
>       g_assert(rd > 0);
>
>       /* look at cp */
>
>       processed += rd;
>   }
>
> But of course I'm not suggesting to rewrite the function!
> )

I'll keep this in mind when deciding how I want to handle '\0'.

>> +            switch (cp) {
>> +            case '\"':
>> +                qstring_append(str, "\\\"");
>> +                break;
>> +            case '\\':
>> +                qstring_append(str, "\\\\");
>> +                break;
>> +            case '\b':
>> +                qstring_append(str, "\\b");
>> +                break;
>> +            case '\f':
>> +                qstring_append(str, "\\f");
>> +                break;
>> +            case '\n':
>> +                qstring_append(str, "\\n");
>> +                break;
>> +            case '\r':
>> +                qstring_append(str, "\\r");
>> +                break;
>> +            case '\t':
>> +                qstring_append(str, "\\t");
>> +                break;
>
> The C standard also names \a (alert) and \v (vertical tab); I'm not sure
> about their JSON notation. (The (cp < 0x20) condition catches them below
> of course.)

JSON RFC 4627 defines only the seven above plus '\/'.  Escaping '/' that
way makes no sense for us, so the old code doesn't, and mine doesn't
either.

>> +            default:
>> +                if (cp < 0) {
>> +                    cp = 0xFFFD; /* replacement character */
>>                  }
>> +                if (cp > 0xFFFF) {
>> +                    /* beyond BMP; need a surrogate pair */
>> +                    snprintf(buf, sizeof(buf), "\\u%04X\\u%04X",
>> +                             0xD800 + ((cp - 0x10000) >> 10),
>> +                             0xDC00 + ((cp - 0x10000) & 0x3FF));
>
> Seems like we write 13 bytes into buf, OK. Also cp is never greater than
> 0x10FFFF, hence the difference is at most 0xFFFFF. The RHS surrogate
> half can go up to 0xDFFF, the LHS up to 0xD800+0x3FF == 0xDBFF. Good.

Exactly.

>> +                } else if (cp < 0x20 || cp >= 0x7F) {
>> +                    snprintf(buf, sizeof(buf), "\\u%04X", cp);
>> +                } else {
>> +                    buf[0] = cp;
>> +                    buf[1] = 0;
>>                  }
>> -            ptr++;
>> -        }
>> +                qstring_append(str, buf);
>> +            }
>> +        };
>> +
>>          qstring_append(str, "\"");
>>          break;
>>      }
>
> Seems OK.
>
>
>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>> index efec1b2..595ddc0 100644
>> --- a/tests/check-qjson.c
>> +++ b/tests/check-qjson.c
>
> I'll trust you on that one :)

Waah, you don't want another case of bleeding eyes?!?

> Reviewed-by: Laszlo Ersek <lersek@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings
  2013-03-22 14:37     ` Markus Armbruster
@ 2013-03-22 14:52       ` Laszlo Ersek
  0 siblings, 0 replies; 20+ messages in thread
From: Laszlo Ersek @ 2013-03-22 14:52 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: blauwirbel, aliguori, qemu-devel

On 03/22/13 15:37, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
> 
>> On 03/14/13 18:49, Markus Armbruster wrote:
>>> These are all broken, too.
>>
>> What are "these"? And how are they broken? And how does the patch fix them?
> 
> "These" refers to the subject: noncharacters other than U+FFFE, U+FFFF.
> 
> I agree that I should better explain how they're broken, and what the
> patch does to fix them.  Will fix on respin.
> 
>>>
>>> A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
>>> noncharacters some more instead of what they're supposed to test.  Use
>>> U+FFFD and U+10FFFD instead.
>>>
>>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>>> ---
>>>  tests/check-qjson.c | 85
>>> +++++++++++++++++++++++++++++++++++++++++++++--------
>>>  1 file changed, 72 insertions(+), 13 deletions(-)
>>
>> I'm confused about the commit message. There are three paragraphs in it
>> (the title, the first paragraph, and the 2nd paragraph). This patch
>> modifies different tests:
>>
>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>>> index 852124a..efec1b2 100644
>>> --- a/tests/check-qjson.c
>>> +++ b/tests/check-qjson.c
>>> @@ -158,7 +158,7 @@ static void utf8_string(void)
>>>       * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>>       * UTF-8").
>>>       *
>>> -     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
>>> +     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
>>>       * capability and stress test at
>>>       * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>       */
>>> @@ -256,11 +256,11 @@ static void utf8_string(void)
>>>              "\xDF\xBF",
>>>              "\"\\u07FF\"",
>>>          },
>>> -        /* 2.2.3  3 bytes U+FFFF */
>>> +        /* 2.2.3  3 bytes U+FFFD */
>>>          {
>>> -            "\"\xEF\xBF\xBF\"",
>>> -            "\xEF\xBF\xBF",
>>> -            "\"\\uFFFF\"",
>>> +            "\"\xEF\xBF\xBD\"",
>>> +            "\xEF\xBF\xBD",
>>> +            "\"\\uFFFD\"",
>>>          },
>>
>> This is under "2.2  Last possible sequence of a certain length". I guess
> 
> Which is in turn under "2  Boundary condition test cases".
> 
>> this is where you say "last possible sequence of a certain length,
>> encoding a character (= non-noncharacter)". OK, p#2.
> 
> Yes.
> 
> The test's purpose is testing the upper bound of 3-byte sequences is
> decoded correctly.
> 
> The upper bound is U+FFFF.  Since that's a noncharacter, the parser
> should reject it (or maybe replace), the formatter should replace it.
> Trouble is it could be misdecoded and then rejected / replaced.
> 
> Besides, U+FFFF already gets tested along with the other noncharacters
> under "5.3  Other illegal code positions".
> 
> Next in line is U+FFFE, also a noncharacter, also under 5.3.
> 
> Next in line is U+FFFD, which I picked.
> 
> But that gets tested under "2.3  Other boundary conditions"!  I guess I
> either drop it there, or make this one U+FFFC.
> 
> I think testing U+FFFC here makes sense, because U+FFFD could be
> misdecoded, then replaced by U+FFFD.
> 
> What do you think?

I think that we're extending Markus Kuhn's test suite, basically taking
random shots at where one specific parser's/formatter's weak spots might
be :)

That said, with intelligent fuzzing out of scope / capacity, U+FFFC
could be a good pick.

I also think I'm a quite a useless person to ask for thoughts in this
area :)

Thanks,
Laszlo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter
  2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
                   ` (4 preceding siblings ...)
  2013-03-17 19:55 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Blue Swirl
@ 2013-03-23 14:44 ` Blue Swirl
  2013-04-11 16:12   ` Markus Armbruster
  5 siblings, 1 reply; 20+ messages in thread
From: Blue Swirl @ 2013-03-23 14:44 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: aliguori, qemu-devel

On Thu, Mar 14, 2013 at 5:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
> This should unbreak "make check" on machines where char is unsigned.
> Blue, please give it a whirl.

Patches no longer apply, please rebase.

>
> The JSON parser is still as broken as ever.  Left for another day.
>
> Markus Armbruster (4):
>   unicode: New mod_utf8_codepoint()
>   check-qjson: Fix up a few bogus comments
>   check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings
>   qjson: to_json() case QTYPE_QSTRING is buggy, rewrite
>
>  include/qemu-common.h |   3 +
>  qobject/qjson.c       | 102 ++++++++----------
>  tests/check-qjson.c   | 280 +++++++++++++++++++++++++++++---------------------
>  util/Makefile.objs    |   1 +
>  util/unicode.c        |  96 +++++++++++++++++
>  5 files changed, 306 insertions(+), 176 deletions(-)
>  create mode 100644 util/unicode.c
>
> --
> 1.7.11.7
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] Fix JSON string formatter
  2013-03-23 14:44 ` Blue Swirl
@ 2013-04-11 16:12   ` Markus Armbruster
  0 siblings, 0 replies; 20+ messages in thread
From: Markus Armbruster @ 2013-04-11 16:12 UTC (permalink / raw)
  To: Blue Swirl; +Cc: aliguori, qemu-devel

Blue Swirl <blauwirbel@gmail.com> writes:

> On Thu, Mar 14, 2013 at 5:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> This should unbreak "make check" on machines where char is unsigned.
>> Blue, please give it a whirl.
>
> Patches no longer apply, please rebase.

Sent.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-04-11 16:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
2013-03-14 17:49 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster
2013-03-21 19:37   ` Laszlo Ersek
2013-03-22  9:23     ` Markus Armbruster
2013-03-22 11:46       ` Laszlo Ersek
2013-03-14 17:49 ` [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments Markus Armbruster
2013-03-21 20:06   ` Laszlo Ersek
2013-03-22 13:27     ` Markus Armbruster
2013-03-14 17:49 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster
2013-03-21 20:22   ` Laszlo Ersek
2013-03-22 14:37     ` Markus Armbruster
2013-03-22 14:52       ` Laszlo Ersek
2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster
2013-03-21 20:44   ` Laszlo Ersek
2013-03-22 13:15   ` Laszlo Ersek
2013-03-22 14:51     ` Markus Armbruster
2013-03-17 19:55 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Blue Swirl
2013-03-18  9:58   ` Markus Armbruster
2013-03-23 14:44 ` Blue Swirl
2013-04-11 16:12   ` Markus Armbruster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).