Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
To: "Weber\, Olaf \(HPC Data Management & Storage\)" <olaf.weber@hpe.com>
Cc: "tytso\@mit.edu" <tytso@mit.edu>,
	"david\@fromorbit.com" <david@fromorbit.com>,
	"linux-ext4\@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"linux-fsdevel\@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"kernel\@lists.collabora.co.uk" <kernel@lists.collabora.co.uk>,
	"alvaro.soliverez\@collabora.co.uk"
	<alvaro.soliverez@collabora.co.uk>
Subject: Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library
Date: Tue, 23 Jan 2018 01:33:06 -0200	[thread overview]
Message-ID: <87bmhl8d1p.fsf@collabora.co.uk> (raw)
In-Reply-To: <DF4PR8401MB0812536DFB46D5F762D5D75585EA0@DF4PR8401MB0812.NAMPRD84.PROD.OUTLOOK.COM> (Olaf Weber's message of "Tue, 16 Jan 2018 22:19:24 +0000")

"Weber, Olaf (HPC Data Management & Storage)" <olaf.weber@hpe.com>
writes:


>> One question that I have, on the other hand: Take the version you
>> shared, I want to avoid the -EINVAL for the case when strings s1
>> and s2 should match as equal, but strlen(s1) < strlen (s2).  In this
>> case:
>> 
>> strncmp (s1, s2, strlen (s2)) => Returns 0.  Matches Ok
>> strncmp (s1, s2, strlen (s1)) => Returns -EINVAL
>> 
>> I know -EINVAL doesn't mean they don't match, but this case seems too
>> error prone.
>
> If I understand your question correctly, the case of interest is
>
> 	strncmp(s1, s2, len), where len <= strlen(s1) and len <= strlen(s2)
>
> As far as I can tell the code I sketched above handles that case in the
> way you expect/want, when taking the complications introduced by
> Unicode into account. Using utf8ncursor() ensures we do get an -EINVAL
> if, and only if, we read beyond the end (len) of the source string as part of
> the normalization process. But if we are at an acceptable boundary in the
> source string when we see the end of the string, utf8byte() returns 0,
> indicating a normal/non-error end of the scan.

Hey Olaf,

Sorry for the delay.

It is not quite that scenario.  The version that requires only 1 length
fails when utf8ncursor receives a len that is smaller than one of the
strings, which is a common case when something decomposes to a larger
string:

Take this case, for instance:

s1 = {0xc2, 0xbc, 0x00},   /* 'VULGAR FRACTION ONE QUARTER' decomposes to */
s2 = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00},  /* 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */

If we do strncmp(s1, s2, strlen(s2)), it works fine.  But if we use
strlen(s1) on the third parameter, it fails.  As far as I understand,
the issue happens because utf8lookup will read to the maximum of len
characters, aborting the lookup in the middle of a sequence. Since we
don't hit a leaf for that code-point, it assumes an invalid sequence and
utf8byte aborts.

The easiest way to solve it is by receiving the two lens in strncmp.

> I think it may be worth to write some tests to (hopefully) confirm that
> the code really does what I intended it to do. The most likely case to
> fail would be where you hit the len-imposed end after a codepoint
> with CCC != 0.

The only test I had for this scenario happened to have strlen(s1) ==
strlen(s2).  I added the following, which I think catches this scenario:

/* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK'  decomposes to */
/* 'LETTER A' + 'COMBINING  OGONEK' + 'COMBINING DIAERESIS' */
  s1 = {0xc3, 0xa4, 0xCC, 0xA8, 0x00},
  s2 = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00},

I tested your version, and it works correctly for this scenario too, as
long as we set the len parameter to use the largest string, s2, instead
of s1.

>> I suppose we just could:
>> 
>>  (1) let the caller deal with it, which is error prone.  Or,
>
> The caller does have to do something when it gets -EINVAL. You have to
> define the desired semantics of that case.
>
> In the original XFS-filesystem code my choice was to treat invalid UTF-8
> sequences as binary blobs for the sake of comparisons.
>
>>  (2) Require two lens on strncmp, one for each string, Or,
>
> As a general rule this is certainly correct: each string has its
> own associated maximum length within which it should have
> been null-terminated.  So whether you need one len per
> string depends on the sources of the strings. In the original
> XFS-based code there are some tricks related to this.

I've applied this solution and it is solving every test case correctly,
including those I mentioned above.  Since it looks like the best
approach, I applied the other things you commented, and modified the
comparison functions code to receive 2 lens.  I should submit a v2
shortly, once I'm done with dealing with some changes to the fs part.

Thanks!

-- 
Gabriel Krisman Bertazi

next prev parent reply	other threads:[~2018-01-23  3:33 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-12  7:12 [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 01/13] charsets: Introduce middle-layer for character encoding Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 02/13] charsets: ascii: Wrap ascii functions to charsets library Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 03/13] charsets: utf8: Add unicode character database files Gabriel Krisman Bertazi
2018-01-12 16:59   ` Darrick J. Wong
2018-01-12 20:29     ` Weber, Olaf (HPC Data Management & Storage)
2018-01-13  0:24   ` Theodore Ts'o
2018-01-13  4:28     ` Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 04/13] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 05/13] charsets: utf8: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 06/13] charsets: utf8: reduce the size of utf8data[] Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library Gabriel Krisman Bertazi
2018-01-12 10:38   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-16 16:50     ` Gabriel Krisman Bertazi
2018-01-16 22:19       ` Weber, Olaf (HPC Data Management & Storage)
2018-01-23  3:33         ` Gabriel Krisman Bertazi [this message]
2018-01-12  7:12 ` [PATCH RFC 08/13] charsets: utf8: Introduce test module for kernel UTF-8 implementation Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 09/13] ext4: Add ignorecase mount option Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 10/13] ext4: Include encoding information on the superblock Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 11/13] fscrypt: Introduce charset-based matching functions Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 12/13] ext4: Support charset name matching Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 13/13] ext4: Implement ext4 dcache hooks for custom charsets Gabriel Krisman Bertazi
2018-01-12 10:52   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-12 16:56 ` [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Jeremy Allison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bmhl8d1p.fsf@collabora.co.uk \
    --to=krisman@collabora.co.uk \
    --cc=alvaro.soliverez@collabora.co.uk \
    --cc=david@fromorbit.com \
    --cc=kernel@lists.collabora.co.uk \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=olaf.weber@hpe.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.