All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gabriel Krisman Bertazi <krisman@suse.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	 "hanqi@vivo.com" <hanqi@vivo.com>,
	"Theodore Ts'o" <tytso@mit.edu>
Subject: Re: Unicode conversion issue
Date: Wed, 11 Dec 2024 16:10:58 -0500	[thread overview]
Message-ID: <875xnqudr1.fsf@mailhost.krisman.be> (raw)
In-Reply-To: <CAHk-=wiC3evUXq8QTcOBFTMu1wsUR_dYiS8eGxy0Hh7VbL55yA@mail.gmail.com> (Linus Torvalds's message of "Wed, 11 Dec 2024 12:18:25 -0800")

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 11 Dec 2024 at 11:58, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> The problem is that all the filesystems basically do some variation of
>>
>>         if (IS_CASEFOLDED(dir) ..) {
>>
>>                 len = utf8_casefold(sb->s_encoding, orig_name,
>>                         new_name, MAXLEN);
>>
>> and then they use that "new_name" for both hashing and for comparisons.
>
> Oh, actually, f2fs does pass in the original name to
> generic_ci_match(), so I think this is solvable.
>
> The solution involves just telling f2fs to ignore the hash if it has
> seen odd characters.
>
> So I think f2fs could actually do something like this:
>
>   --- a/fs/f2fs/dir.c
>   +++ b/fs/f2fs/dir.c
>   @@ -67,6 +67,7 @@ int f2fs_init_casefolded_name(const struct inode *dir,
>                         /* fall back to treating name as opaque byte sequence */
>                         return 0;
>                 }
>   +             fname->ignore_hash = utf8_oddname(fname->usr_fname);
>                 fname->cf_name.name = buf;
>                 fname->cf_name.len = len;
>         }
>   @@ -231,7 +232,7 @@ struct f2fs_dir_entry
> *f2fs_find_target_dentry(const struct f2fs_dentry_ptr *d,
>                         continue;
>                 }
>
>   -             if (de->hash_code == fname->hash) {
>   +             if (fname->ignore_hash || de->hash_code == fname->hash) {
>                         res = f2fs_match_name(d->inode, fname,
>                                               d->filename[bit_pos],
>                                               le16_to_cpu(de->name_len));

This solves it for directories with inlined dirents
(FI_INLINE_DENTRY). but for large directories, we use fname->hash to
find the right block to start the search.  So, we'd need to walk through
the entire case-insensitive directory.  In ext4, the issue only exists
on large directories, because we don't care about the hash on small
directories.


>   --- a/fs/f2fs/f2fs.h
>   +++ b/fs/f2fs/f2fs.h
>   @@ -521,6 +521,7 @@ struct f2fs_filename {
>
>         /* The dirhash of this filename */
>         f2fs_hash_t hash;
>   +     bool ignore_hash;
>
>    #ifdef CONFIG_FS_ENCRYPTION
>         /*
>
> where that "utf8_oddname()" is the one that goes "this filename
> contains unhashable characters".
>
> I didn't look very closely at what ext4 does, but it seems to already
> have a pattern for "don't even look at the hash because it's not
> reliable", so I think ext4 can do something similar.

> So then all you actually need is that utf8_oddname() that recognizes
> those ignored code-points.
>
> So I take it all back: option (1) actually doesn't look that bad, and
> would make reverting commit 5c26d2f1d3f5 ("unicode: Don't special case
> ignorable code points") unnecessary.

I think we really need to revert it. The simplest way to implement
utf8_oddname is having the full database with the Ignorable code points
available. We can then add a flag in the same data structure indicating
this is an Ignorable codepoint that should be dismissed by the
utf8_strncasecmp when doing the casefold, while still using the full
string for the hash.

-- 
Gabriel Krisman Bertazi

  reply	other threads:[~2024-12-11 21:11 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-11 15:46 Unicode conversion issue Jaegeuk Kim
2024-12-11 16:08 ` Gabriel Krisman Bertazi
2024-12-11 17:08   ` Jaegeuk Kim
2024-12-11 19:45     ` Gabriel Krisman Bertazi
2024-12-11 19:58       ` Linus Torvalds
2024-12-11 20:18         ` Linus Torvalds
2024-12-11 21:10           ` Gabriel Krisman Bertazi [this message]
2024-12-11 21:25             ` Linus Torvalds
2024-12-11 21:53               ` Jaegeuk Kim
2024-12-11 21:56                 ` Linus Torvalds
2024-12-11 22:01                   ` Jaegeuk Kim
2024-12-11 22:09                     ` Linus Torvalds
2024-12-11 21:13           ` Jaegeuk Kim
2024-12-11 19:22   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875xnqudr1.fsf@mailhost.krisman.be \
    --to=krisman@suse.de \
    --cc=hanqi@vivo.com \
    --cc=jaegeuk@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.