From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:60440)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1UJ45u-0003wz-Mi
	for qemu-devel@nongnu.org; Fri, 22 Mar 2013 11:41:34 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1UJ45q-000381-1o
	for qemu-devel@nongnu.org; Fri, 22 Mar 2013 11:41:26 -0400
Received: from mx1.redhat.com ([209.132.183.28]:4814)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1UJ45p-00037o-Pe
	for qemu-devel@nongnu.org; Fri, 22 Mar 2013 11:41:21 -0400
Message-ID: <514C7019.60800@redhat.com>
Date: Fri, 22 Mar 2013 15:52:09 +0100
From: Laszlo Ersek <lersek@redhat.com>
MIME-Version: 1.0
References: <1363283360-26220-1-git-send-email-armbru@redhat.com>
	<1363283360-26220-4-git-send-email-armbru@redhat.com>
	<514B6C1B.7020400@redhat.com>
	<87txo3tsa7.fsf@blackfin.pond.sub.org>
In-Reply-To: <87txo3tsa7.fsf@blackfin.pond.sub.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other
 than U+FFFE, U+FFFF in strings
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Markus Armbruster <armbru@redhat.com>
Cc: blauwirbel@gmail.com, aliguori@us.ibm.com, qemu-devel@nongnu.org

On 03/22/13 15:37, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
> 
>> On 03/14/13 18:49, Markus Armbruster wrote:
>>> These are all broken, too.
>>
>> What are "these"? And how are they broken? And how does the patch fix them?
> 
> "These" refers to the subject: noncharacters other than U+FFFE, U+FFFF.
> 
> I agree that I should better explain how they're broken, and what the
> patch does to fix them.  Will fix on respin.
> 
>>>
>>> A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
>>> noncharacters some more instead of what they're supposed to test.  Use
>>> U+FFFD and U+10FFFD instead.
>>>
>>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>>> ---
>>>  tests/check-qjson.c | 85
>>> +++++++++++++++++++++++++++++++++++++++++++++--------
>>>  1 file changed, 72 insertions(+), 13 deletions(-)
>>
>> I'm confused about the commit message. There are three paragraphs in it
>> (the title, the first paragraph, and the 2nd paragraph). This patch
>> modifies different tests:
>>
>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>>> index 852124a..efec1b2 100644
>>> --- a/tests/check-qjson.c
>>> +++ b/tests/check-qjson.c
>>> @@ -158,7 +158,7 @@ static void utf8_string(void)
>>>       * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>>       * UTF-8").
>>>       *
>>> -     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
>>> +     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
>>>       * capability and stress test at
>>>       * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>       */
>>> @@ -256,11 +256,11 @@ static void utf8_string(void)
>>>              "\xDF\xBF",
>>>              "\"\\u07FF\"",
>>>          },
>>> -        /* 2.2.3  3 bytes U+FFFF */
>>> +        /* 2.2.3  3 bytes U+FFFD */
>>>          {
>>> -            "\"\xEF\xBF\xBF\"",
>>> -            "\xEF\xBF\xBF",
>>> -            "\"\\uFFFF\"",
>>> +            "\"\xEF\xBF\xBD\"",
>>> +            "\xEF\xBF\xBD",
>>> +            "\"\\uFFFD\"",
>>>          },
>>
>> This is under "2.2  Last possible sequence of a certain length". I guess
> 
> Which is in turn under "2  Boundary condition test cases".
> 
>> this is where you say "last possible sequence of a certain length,
>> encoding a character (= non-noncharacter)". OK, p#2.
> 
> Yes.
> 
> The test's purpose is testing the upper bound of 3-byte sequences is
> decoded correctly.
> 
> The upper bound is U+FFFF.  Since that's a noncharacter, the parser
> should reject it (or maybe replace), the formatter should replace it.
> Trouble is it could be misdecoded and then rejected / replaced.
> 
> Besides, U+FFFF already gets tested along with the other noncharacters
> under "5.3  Other illegal code positions".
> 
> Next in line is U+FFFE, also a noncharacter, also under 5.3.
> 
> Next in line is U+FFFD, which I picked.
> 
> But that gets tested under "2.3  Other boundary conditions"!  I guess I
> either drop it there, or make this one U+FFFC.
> 
> I think testing U+FFFC here makes sense, because U+FFFD could be
> misdecoded, then replaced by U+FFFD.
> 
> What do you think?

I think that we're extending Markus Kuhn's test suite, basically taking
random shots at where one specific parser's/formatter's weak spots might
be :)

That said, with intelligent fuzzing out of scope / capacity, U+FFFC
could be a good pick.

I also think I'm a quite a useless person to ask for thoughts in this
area :)

Thanks,
Laszlo