Re: [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Markus Armbruster <armbru@redhat.com>
To: Laszlo Ersek <lersek@redhat.com>
Cc: blauwirbel@gmail.com, aliguori@us.ibm.com, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite
Date: Fri, 22 Mar 2013 15:51:46 +0100	[thread overview]
Message-ID: <87fvzntrlp.fsf@blackfin.pond.sub.org> (raw)
In-Reply-To: <514C5981.9010703@redhat.com> (Laszlo Ersek's message of "Fri, 22 Mar 2013 14:15:45 +0100")

Laszlo Ersek <lersek@redhat.com> writes:

> comments below
>
> On 03/14/13 18:49, Markus Armbruster wrote:
>
>> diff --git a/qobject/qjson.c b/qobject/qjson.c
>> index 83a6b4f..19085a1 100644
>> --- a/qobject/qjson.c
>> +++ b/qobject/qjson.c
>> @@ -136,68 +136,56 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent)
>>      case QTYPE_QSTRING: {
>>          QString *val = qobject_to_qstring(obj);
>>          const char *ptr;
>> +        int cp;
>> +        char buf[16];
>> +        char *end;
>>  
>>          ptr = qstring_get_str(val);
>>          qstring_append(str, "\"");
>> -        while (*ptr) {
>> -            if ((ptr[0] & 0xE0) == 0xE0 &&
>> -                (ptr[1] & 0x80) && (ptr[2] & 0x80)) {
>> -                uint16_t wchar;
>> -                char escape[7];
>> -
>> -                wchar  = (ptr[0] & 0x0F) << 12;
>> -                wchar |= (ptr[1] & 0x3F) << 6;
>> -                wchar |= (ptr[2] & 0x3F);
>> -                ptr += 2;
>> -
>> -                snprintf(escape, sizeof(escape), "\\u%04X", wchar);
>> -                qstring_append(str, escape);
>> -            } else if ((ptr[0] & 0xE0) == 0xC0 && (ptr[1] & 0x80)) {
>> -                uint16_t wchar;
>> -                char escape[7];
>> -
>> -                wchar  = (ptr[0] & 0x1F) << 6;
>> -                wchar |= (ptr[1] & 0x3F);
>> -                ptr++;
>> -
>> -                snprintf(escape, sizeof(escape), "\\u%04X", wchar);
>> -                qstring_append(str, escape);
>> -            } else switch (ptr[0]) {
>> -                case '\"':
>> -                    qstring_append(str, "\\\"");
>> -                    break;
>> -                case '\\':
>> -                    qstring_append(str, "\\\\");
>> -                    break;
>> -                case '\b':
>> -                    qstring_append(str, "\\b");
>> -                    break;
>> -                case '\f':
>> -                    qstring_append(str, "\\f");
>> -                    break;
>> -                case '\n':
>> -                    qstring_append(str, "\\n");
>> -                    break;
>> -                case '\r':
>> -                    qstring_append(str, "\\r");
>> -                    break;
>> -                case '\t':
>> -                    qstring_append(str, "\\t");
>> -                    break;
>> -                default: {
>> -                    if (ptr[0] <= 0x1F) {
>> -                        char escape[7];
>> -                        snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]);
>> -                        qstring_append(str, escape);
>> -                    } else {
>> -                        char buf[2] = { ptr[0], 0 };
>> -                        qstring_append(str, buf);
>> -                    }
>> -                    break;
>> +
>> +        for (; *ptr; ptr = end) {
>> +            cp = mod_utf8_codepoint(ptr, 6, &end);
>
> This provides more background: you never call mod_utf8_codepoint() with
> '\0' at offset 0. So handling that in mod_utf8_codepoint() may not be
> that important.

Yes, this caller doesn't care.  Doesn't mean we shouldn't try to come up
with a sane function contract.

Note the use of literal 6.  It means "unlimited".  Perfectly safe
because the string is nul-terminated.

> If a '\0' is found at offset >= 1, it will correctly trigger the /*
> continuation byte missing */ branch in mod_utf8_codepoint(). The retval
> is -1, and *end is left pointing to the NUL byte. (This is consistent
> with mod_utf8_codepoint()'s docs.)
>
> The -1 (incomplete sequence) produces the replacement character below,
> and the next time around *ptr is '\0', so we finish the loop. Seems OK.
>
> (
> An alternative interface for mod_utf8_codepoint() might be something like:
>
>     size_t alternative(const char *ptr, int *cp, size_t n);
>
> Resembling read() somewhat:
> - the return value would be the number of bytes consumed (it can't be
> negative (= fatal error), because we guarantee progress). 0 is EOF and
> only possible when "n" is 0.
> - "ptr" is the source,
> - "cp" is the output code point, -1 if invalid,
> - "n" is the bytes available in the source / requested to process at most.
>
> Encountering a \0 in the byte stream would be an error (*cp = -1), but
> would not terminate parsing per se.
>
> Then the loop would look like:
>
>     processed = 0;
>     while (processed < full) {
>       int cp;
>
>       rd = alternative(ptr + processed, &cp, full - processed);
>       g_assert(rd > 0);
>
>       /* look at cp */
>
>       processed += rd;
>   }
>
> But of course I'm not suggesting to rewrite the function!
> )

I'll keep this in mind when deciding how I want to handle '\0'.

>> +            switch (cp) {
>> +            case '\"':
>> +                qstring_append(str, "\\\"");
>> +                break;
>> +            case '\\':
>> +                qstring_append(str, "\\\\");
>> +                break;
>> +            case '\b':
>> +                qstring_append(str, "\\b");
>> +                break;
>> +            case '\f':
>> +                qstring_append(str, "\\f");
>> +                break;
>> +            case '\n':
>> +                qstring_append(str, "\\n");
>> +                break;
>> +            case '\r':
>> +                qstring_append(str, "\\r");
>> +                break;
>> +            case '\t':
>> +                qstring_append(str, "\\t");
>> +                break;
>
> The C standard also names \a (alert) and \v (vertical tab); I'm not sure
> about their JSON notation. (The (cp < 0x20) condition catches them below
> of course.)

JSON RFC 4627 defines only the seven above plus '\/'.  Escaping '/' that
way makes no sense for us, so the old code doesn't, and mine doesn't
either.

>> +            default:
>> +                if (cp < 0) {
>> +                    cp = 0xFFFD; /* replacement character */
>>                  }
>> +                if (cp > 0xFFFF) {
>> +                    /* beyond BMP; need a surrogate pair */
>> +                    snprintf(buf, sizeof(buf), "\\u%04X\\u%04X",
>> +                             0xD800 + ((cp - 0x10000) >> 10),
>> +                             0xDC00 + ((cp - 0x10000) & 0x3FF));
>
> Seems like we write 13 bytes into buf, OK. Also cp is never greater than
> 0x10FFFF, hence the difference is at most 0xFFFFF. The RHS surrogate
> half can go up to 0xDFFF, the LHS up to 0xD800+0x3FF == 0xDBFF. Good.

Exactly.

>> +                } else if (cp < 0x20 || cp >= 0x7F) {
>> +                    snprintf(buf, sizeof(buf), "\\u%04X", cp);
>> +                } else {
>> +                    buf[0] = cp;
>> +                    buf[1] = 0;
>>                  }
>> -            ptr++;
>> -        }
>> +                qstring_append(str, buf);
>> +            }
>> +        };
>> +
>>          qstring_append(str, "\"");
>>          break;
>>      }
>
> Seems OK.
>
>
>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>> index efec1b2..595ddc0 100644
>> --- a/tests/check-qjson.c
>> +++ b/tests/check-qjson.c
>
> I'll trust you on that one :)

Waah, you don't want another case of bleeding eyes?!?

> Reviewed-by: Laszlo Ersek <lersek@redhat.com>

Thanks!

next prev parent reply	other threads:[~2013-03-22 14:52 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
2013-03-14 17:49 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster
2013-03-21 19:37   ` Laszlo Ersek
2013-03-22  9:23     ` Markus Armbruster
2013-03-22 11:46       ` Laszlo Ersek
2013-03-14 17:49 ` [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments Markus Armbruster
2013-03-21 20:06   ` Laszlo Ersek
2013-03-22 13:27     ` Markus Armbruster
2013-03-14 17:49 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster
2013-03-21 20:22   ` Laszlo Ersek
2013-03-22 14:37     ` Markus Armbruster
2013-03-22 14:52       ` Laszlo Ersek
2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster
2013-03-21 20:44   ` Laszlo Ersek
2013-03-22 13:15   ` Laszlo Ersek
2013-03-22 14:51     ` Markus Armbruster [this message]
2013-03-17 19:55 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Blue Swirl
2013-03-18  9:58   ` Markus Armbruster
2013-03-23 14:44 ` Blue Swirl
2013-04-11 16:12   ` Markus Armbruster
  -- strict thread matches above, loose matches on Subject: below --
2013-04-11 16:07 Markus Armbruster
2013-04-11 16:07 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fvzntrlp.fsf@blackfin.pond.sub.org \
    --to=armbru@redhat.com \
    --cc=aliguori@us.ibm.com \
    --cc=blauwirbel@gmail.com \
    --cc=lersek@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.