Re: [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Markus Armbruster <armbru@redhat.com>
To: Laszlo Ersek <lersek@redhat.com>
Cc: blauwirbel@gmail.com, aliguori@us.ibm.com, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings
Date: Fri, 22 Mar 2013 15:37:04 +0100	[thread overview]
Message-ID: <87txo3tsa7.fsf@blackfin.pond.sub.org> (raw)
In-Reply-To: <514B6C1B.7020400@redhat.com> (Laszlo Ersek's message of "Thu, 21 Mar 2013 21:22:51 +0100")

Laszlo Ersek <lersek@redhat.com> writes:

> On 03/14/13 18:49, Markus Armbruster wrote:
>> These are all broken, too.
>
> What are "these"? And how are they broken? And how does the patch fix them?

"These" refers to the subject: noncharacters other than U+FFFE, U+FFFF.

I agree that I should better explain how they're broken, and what the
patch does to fix them.  Will fix on respin.

>> 
>> A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
>> noncharacters some more instead of what they're supposed to test.  Use
>> U+FFFD and U+10FFFD instead.
>> 
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>  tests/check-qjson.c | 85
>> +++++++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 72 insertions(+), 13 deletions(-)
>
> I'm confused about the commit message. There are three paragraphs in it
> (the title, the first paragraph, and the 2nd paragraph). This patch
> modifies different tests:
>
>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>> index 852124a..efec1b2 100644
>> --- a/tests/check-qjson.c
>> +++ b/tests/check-qjson.c
>> @@ -158,7 +158,7 @@ static void utf8_string(void)
>>       * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>       * UTF-8").
>>       *
>> -     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
>> +     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
>>       * capability and stress test at
>>       * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>       */
>> @@ -256,11 +256,11 @@ static void utf8_string(void)
>>              "\xDF\xBF",
>>              "\"\\u07FF\"",
>>          },
>> -        /* 2.2.3  3 bytes U+FFFF */
>> +        /* 2.2.3  3 bytes U+FFFD */
>>          {
>> -            "\"\xEF\xBF\xBF\"",
>> -            "\xEF\xBF\xBF",
>> -            "\"\\uFFFF\"",
>> +            "\"\xEF\xBF\xBD\"",
>> +            "\xEF\xBF\xBD",
>> +            "\"\\uFFFD\"",
>>          },
>
> This is under "2.2  Last possible sequence of a certain length". I guess

Which is in turn under "2  Boundary condition test cases".

> this is where you say "last possible sequence of a certain length,
> encoding a character (= non-noncharacter)". OK, p#2.

Yes.

The test's purpose is testing the upper bound of 3-byte sequences is
decoded correctly.

The upper bound is U+FFFF.  Since that's a noncharacter, the parser
should reject it (or maybe replace), the formatter should replace it.
Trouble is it could be misdecoded and then rejected / replaced.

Besides, U+FFFF already gets tested along with the other noncharacters
under "5.3  Other illegal code positions".

Next in line is U+FFFE, also a noncharacter, also under 5.3.

Next in line is U+FFFD, which I picked.

But that gets tested under "2.3  Other boundary conditions"!  I guess I
either drop it there, or make this one U+FFFC.

I think testing U+FFFC here makes sense, because U+FFFD could be
misdecoded, then replaced by U+FFFD.

What do you think?

>>          /* 2.2.4  4 bytes U+1FFFFF */
>>          {
>> @@ -303,10 +303,10 @@ static void utf8_string(void)
>>              "\"\\uFFFD\"",
>>          },
>>          {
>> -            /* U+10FFFF */
>> -            "\"\xF4\x8F\xBF\xBF\"",
>> -            "\xF4\x8F\xBF\xBF",
>> -            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */
>> +            /* U+10FFFD */
>> +            "\"\xF4\x8F\xBF\xBD\"",
>> +            "\xF4\x8F\xBF\xBD",
>> +            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */
>>          },
>>          {
>>              /* U+110000 */
>
> Under "2.3  Other boundary conditions". Not a non-character any longer,
> but also not a boundary condition. At least not the original one. Still
> covered by the ...FFFD part of the commit message, p#2.

The test's purpose is testing the upper bound of the Unicode range gets
decoded correctly.

The upper bound is U+10FFFF.  Since that's a noncharacter... same
argument as above.

Next in line is U+10FFFE, also a noncharacter.

Next in line is U+10FFFD, which I picked.

>> @@ -584,9 +584,9 @@ static void utf8_string(void)
>>              "\"\\u07FF\"",
>>          },
>>          {
>> -            /* \U+FFFF */
>> -            "\"\xF0\x8F\xBF\xBF\"",
>> -            "\xF0\x8F\xBF\xBF",   /* bug: not corrected */
>> +            /* \U+FFFD */
>> +            "\"\xF0\x8F\xBF\xBD\"",
>> +            "\xF0\x8F\xBF\xBD",   /* bug: not corrected */
>>              "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>>          },
>>          {
>
> Under "4.2  Maximum overlong sequences". What does that even mean? "In
> some sense maximum codepoints, all represented as overlong sequences"? P#2.

The headings are all stolen from the same source as the test cases.
Perhaps I should steal more of the explanatory text there as well.

The one for 4.2 is:

    Below you see the highest Unicode value that is still resulting in
    an overlong sequence if represented with the given number of bytes.
    This is a boundary test for safe UTF-8 decoders.  All five
    characters should be rejected like malformed UTF-8 sequences.

>> @@ -731,6 +731,7 @@ static void utf8_string(void)
>>              "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
>>          },
>>          /* 5.3  Other illegal code positions */
>> +        /* BMP noncharacters */
>>          {
>>              /* \U+FFFE */
>>              "\"\xEF\xBF\xBE\"",
>> @@ -741,7 +742,65 @@ static void utf8_string(void)
>>              /* \U+FFFF */
>>              "\"\xEF\xBF\xBF\"",
>>              "\xEF\xBF\xBF",     /* bug: not corrected */
>> -            "\"\\uFFFF\"",      /* bug: not corrected */
>> +            "\"\\uFFFF\"",
>> +        },
>> +        {
>> +            /* U+FDD0 */
>> +            "\"\xEF\xB7\x90\"",
>> +            "\xEF\xB7\x90",     /* bug: not corrected */
>> +            "\"\\uFDD0\"",      /* bug: not corrected */
>> +        },
>> +        {
>> +            /* U+FDEF */
>> +            "\"\xEF\xB7\xAF\"",
>> +            "\xEF\xB7\xAF",     /* bug: not corrected */
>> +            "\"\\uFDEF\"",      /* bug: not corrected */
>> +        },
>> +        /* Plane 1 .. 16 noncharacters */
>> +        {
>> +            /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */
>> +            "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
>> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
>> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
>> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
>> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
>> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
>> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
>> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
>> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
>> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
>> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
>> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
>> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
>> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
>> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
>> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"",
>> +            /* bug: not corrected */
>> +            "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
>> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
>> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
>> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
>> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
>> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
>> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
>> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
>> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
>> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
>> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
>> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
>> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
>> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
>> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
>> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF",
>> +            /* bug: not corrected */
>> +            "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF"
>> +            "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF"
>> +            "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF"
>> +            "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF"
>> +            "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF"
>> +            "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF"
>> +            "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF"
>> +            "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"",
>>          },
>>          {}
>>      };
>> 
>
> This is probably p#0 (the title).
>
> Ah. Have you removed the noncharacters from the other tests, but made up
> for them at the end with new noncharacter tests?

I added tests to cover all 66 noncharacters.  Then I noticed some
duplicate test cases elsewhere, and realized that these don't really
fully test what I want to test there.

> Reviewed-by: Laszlo Ersek <lersek@redhat.com>

Thanks!

next prev parent reply	other threads:[~2013-03-22 14:37 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-14 17:49 [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Markus Armbruster
2013-03-14 17:49 ` [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() Markus Armbruster
2013-03-21 19:37   ` Laszlo Ersek
2013-03-22  9:23     ` Markus Armbruster
2013-03-22 11:46       ` Laszlo Ersek
2013-03-14 17:49 ` [Qemu-devel] [PATCH 2/4] check-qjson: Fix up a few bogus comments Markus Armbruster
2013-03-21 20:06   ` Laszlo Ersek
2013-03-22 13:27     ` Markus Armbruster
2013-03-14 17:49 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster
2013-03-21 20:22   ` Laszlo Ersek
2013-03-22 14:37     ` Markus Armbruster [this message]
2013-03-22 14:52       ` Laszlo Ersek
2013-03-14 17:49 ` [Qemu-devel] [PATCH 4/4] qjson: to_json() case QTYPE_QSTRING is buggy, rewrite Markus Armbruster
2013-03-21 20:44   ` Laszlo Ersek
2013-03-22 13:15   ` Laszlo Ersek
2013-03-22 14:51     ` Markus Armbruster
2013-03-17 19:55 ` [Qemu-devel] [PATCH 0/4] Fix JSON string formatter Blue Swirl
2013-03-18  9:58   ` Markus Armbruster
2013-03-23 14:44 ` Blue Swirl
2013-04-11 16:12   ` Markus Armbruster
  -- strict thread matches above, loose matches on Subject: below --
2013-04-11 16:07 Markus Armbruster
2013-04-11 16:07 ` [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings Markus Armbruster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87txo3tsa7.fsf@blackfin.pond.sub.org \
    --to=armbru@redhat.com \
    --cc=aliguori@us.ibm.com \
    --cc=blauwirbel@gmail.com \
    --cc=lersek@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.