From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:60440) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UJ45u-0003wz-Mi for qemu-devel@nongnu.org; Fri, 22 Mar 2013 11:41:34 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UJ45q-000381-1o for qemu-devel@nongnu.org; Fri, 22 Mar 2013 11:41:26 -0400 Received: from mx1.redhat.com ([209.132.183.28]:4814) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UJ45p-00037o-Pe for qemu-devel@nongnu.org; Fri, 22 Mar 2013 11:41:21 -0400 Message-ID: <514C7019.60800@redhat.com> Date: Fri, 22 Mar 2013 15:52:09 +0100 From: Laszlo Ersek MIME-Version: 1.0 References: <1363283360-26220-1-git-send-email-armbru@redhat.com> <1363283360-26220-4-git-send-email-armbru@redhat.com> <514B6C1B.7020400@redhat.com> <87txo3tsa7.fsf@blackfin.pond.sub.org> In-Reply-To: <87txo3tsa7.fsf@blackfin.pond.sub.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Markus Armbruster Cc: blauwirbel@gmail.com, aliguori@us.ibm.com, qemu-devel@nongnu.org On 03/22/13 15:37, Markus Armbruster wrote: > Laszlo Ersek writes: > >> On 03/14/13 18:49, Markus Armbruster wrote: >>> These are all broken, too. >> >> What are "these"? And how are they broken? And how does the patch fix them? > > "These" refers to the subject: noncharacters other than U+FFFE, U+FFFF. > > I agree that I should better explain how they're broken, and what the > patch does to fix them. Will fix on respin. > >>> >>> A few test cases use noncharacters U+FFFF and U+10FFFF. Risks testing >>> noncharacters some more instead of what they're supposed to test. Use >>> U+FFFD and U+10FFFD instead. >>> >>> Signed-off-by: Markus Armbruster >>> --- >>> tests/check-qjson.c | 85 >>> +++++++++++++++++++++++++++++++++++++++++++++-------- >>> 1 file changed, 72 insertions(+), 13 deletions(-) >> >> I'm confused about the commit message. There are three paragraphs in it >> (the title, the first paragraph, and the 2nd paragraph). This patch >> modifies different tests: >> >>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c >>> index 852124a..efec1b2 100644 >>> --- a/tests/check-qjson.c >>> +++ b/tests/check-qjson.c >>> @@ -158,7 +158,7 @@ static void utf8_string(void) >>> * consider using overlong encoding \xC0\x80 for U+0000 ("modified >>> * UTF-8"). >>> * >>> - * Test cases are scraped from Markus Kuhn's UTF-8 decoder >>> + * Most test cases are scraped from Markus Kuhn's UTF-8 decoder >>> * capability and stress test at >>> * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>> */ >>> @@ -256,11 +256,11 @@ static void utf8_string(void) >>> "\xDF\xBF", >>> "\"\\u07FF\"", >>> }, >>> - /* 2.2.3 3 bytes U+FFFF */ >>> + /* 2.2.3 3 bytes U+FFFD */ >>> { >>> - "\"\xEF\xBF\xBF\"", >>> - "\xEF\xBF\xBF", >>> - "\"\\uFFFF\"", >>> + "\"\xEF\xBF\xBD\"", >>> + "\xEF\xBF\xBD", >>> + "\"\\uFFFD\"", >>> }, >> >> This is under "2.2 Last possible sequence of a certain length". I guess > > Which is in turn under "2 Boundary condition test cases". > >> this is where you say "last possible sequence of a certain length, >> encoding a character (= non-noncharacter)". OK, p#2. > > Yes. > > The test's purpose is testing the upper bound of 3-byte sequences is > decoded correctly. > > The upper bound is U+FFFF. Since that's a noncharacter, the parser > should reject it (or maybe replace), the formatter should replace it. > Trouble is it could be misdecoded and then rejected / replaced. > > Besides, U+FFFF already gets tested along with the other noncharacters > under "5.3 Other illegal code positions". > > Next in line is U+FFFE, also a noncharacter, also under 5.3. > > Next in line is U+FFFD, which I picked. > > But that gets tested under "2.3 Other boundary conditions"! I guess I > either drop it there, or make this one U+FFFC. > > I think testing U+FFFC here makes sense, because U+FFFD could be > misdecoded, then replaced by U+FFFD. > > What do you think? I think that we're extending Markus Kuhn's test suite, basically taking random shots at where one specific parser's/formatter's weak spots might be :) That said, with intelligent fuzzing out of scope / capacity, U+FFFC could be a good pick. I also think I'm a quite a useless person to ask for thoughts in this area :) Thanks, Laszlo