Unicode conversion issue

All of lore.kernel.org
 help / color / mirror / Atom feed

* Unicode conversion issue
@ 2024-12-11 15:46 Jaegeuk Kim
  2024-12-11 16:08 ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2024-12-11 15:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, krisman

Hi Linus/Gabriel,

Once Android applied the below patch [1], some special characters started to be
converted differently resulting in different length, so that f2fs cannot find
the filename correctly which was created when the kernel didn't have [1].

There is one bug report in [2] where describes more details. In order to avoid
this, could you please consider reverting [1] asap? Or, is there any other
way to keep the conversion while addressing CVE? It's very hard for f2fs to
distinguish two valid converted lengths before/after [1].

[1] 5c26d2f1d3f5 ("unicode: Don't special case ignorable code points")
[2] https://bugzilla.kernel.org/show_bug.cgi?id=219586

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 15:46 Unicode conversion issue Jaegeuk Kim
@ 2024-12-11 16:08 ` Gabriel Krisman Bertazi
  2024-12-11 17:08   ` Jaegeuk Kim
  2024-12-11 19:22   ` Linus Torvalds
  0 siblings, 2 replies; 14+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-11 16:08 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: Linus Torvalds, Linux Kernel Mailing List, hanqi@vivo.com

Jaegeuk Kim <jaegeuk@kernel.org> writes:

> Hi Linus/Gabriel,
>
> Once Android applied the below patch [1], some special characters started to be
> converted differently resulting in different length, so that f2fs cannot find
> the filename correctly which was created when the kernel didn't have [1].
>
> There is one bug report in [2] where describes more details. In order to avoid
> this, could you please consider reverting [1] asap? Or, is there any other
> way to keep the conversion while addressing CVE? It's very hard for f2fs to
> distinguish two valid converted lengths before/after [1].

I got this report yesterday. I'm looking into it.

It seems commit 5c26d2f1d3f5 ("unicode: Don't special case ignorable
code points") has affected more than ignorable code points, because that
U+2764 is not marked as Ignorable in the unicode database.

I still think the solution to the original issue is eliminating
ignorable code points, and that should be fine.  Let me look at why this
block of characters is mishandled.

>
> [1] 5c26d2f1d3f5 ("unicode: Don't special case ignorable code points")
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=219586

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 16:08 ` Gabriel Krisman Bertazi
@ 2024-12-11 17:08   ` Jaegeuk Kim
  2024-12-11 19:45     ` Gabriel Krisman Bertazi
  2024-12-11 19:22   ` Linus Torvalds
  1 sibling, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2024-12-11 17:08 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Linus Torvalds, Linux Kernel Mailing List, hanqi@vivo.com

On 12/11, Gabriel Krisman Bertazi wrote:
> Jaegeuk Kim <jaegeuk@kernel.org> writes:
> 
> > Hi Linus/Gabriel,
> >
> > Once Android applied the below patch [1], some special characters started to be
> > converted differently resulting in different length, so that f2fs cannot find
> > the filename correctly which was created when the kernel didn't have [1].
> >
> > There is one bug report in [2] where describes more details. In order to avoid
> > this, could you please consider reverting [1] asap? Or, is there any other
> > way to keep the conversion while addressing CVE? It's very hard for f2fs to
> > distinguish two valid converted lengths before/after [1].
> 
> I got this report yesterday. I'm looking into it.
> 
> It seems commit 5c26d2f1d3f5 ("unicode: Don't special case ignorable
> code points") has affected more than ignorable code points, because that
> U+2764 is not marked as Ignorable in the unicode database.
> 
> I still think the solution to the original issue is eliminating
> ignorable code points, and that should be fine.  Let me look at why this
> block of characters is mishandled.

Thank you so much. If it takes some time to find the root cause, may I
propose the revert first to unblock production? The problem is quite severe
as users cannot access their files.

> 
> >
> > [1] 5c26d2f1d3f5 ("unicode: Don't special case ignorable code points")
> > [2] https://bugzilla.kernel.org/show_bug.cgi?id=219586
> 
> -- 
> Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 16:08 ` Gabriel Krisman Bertazi
  2024-12-11 17:08   ` Jaegeuk Kim
@ 2024-12-11 19:22   ` Linus Torvalds
  1 sibling, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2024-12-11 19:22 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Jaegeuk Kim, Linux Kernel Mailing List, hanqi@vivo.com

On Wed, 11 Dec 2024 at 08:08, Gabriel Krisman Bertazi <krisman@suse.de> wrote:
>
> It seems commit 5c26d2f1d3f5 ("unicode: Don't special case ignorable
> code points") has affected more than ignorable code points, because that
> U+2764 is not marked as Ignorable in the unicode database.

It's not U+2764 - "Heavy Black Heart".

It's U+2764 _and_ U+FE0F - "Variation Selector-16 (VS16)"

And VS16 asks that the heart be shown as an emoji, which in turn turns
that black heart red.

And presumably that VS16 is one of those idiotic "ignorable" characters.

Christ, I don't understand why some people still think that
casefolding is sane.  It damn well isn't, exactly because it causes
these kinds of insane situations, because the "case folding" of "mark
it as an emoji" is damn well undefined.

> I still think the solution to the original issue is eliminating
> ignorable code points, and that should be fine.  Let me look at why this
> block of characters is mishandled.

I suspect we'll have to revert, and re-examine.

Of course, in the meantime, somebody has probably already created
files with the *new* hashing, so even reverting might not "fix" the
issue.

The real fix is to not do casefolding, or at least to never *EVER*
trust the hashing of case-folded crap, because the hash is
fundamentally not reliable.

What a case-folding filesystem should do is

 (a) preserve case and hash with that preserved case (which is
equivalent to NOT DOING CASE FOLDING! The user gave you binary data,
you *treat* it as binary sacred data instead of corrupting it)

 (b) only using case folding for "I didn't find the exact case, let's
do an approximate search".

but decades of history has shown that filesystem people seem to be
unable to understand the whole notion of "you don't screw with peoples
data".

That (a) guarantees that you get sane semantics for 1:1 names and that
you can *always* access the file using the preserved case.

And (b) is the "you get the insane case folded semantics for the
insane situation where it's needed, and never anywhere else".

Alternatively,  case-folding should only fold the really damn obvious
cases. That was the problem with the horrendous "ignorable code
points", where case-folding reacted to non-case characters by simply
ignoring them.

Damn how I hate broken filesystems that "interpret" the data that they
are given. Pure unadulterated garbage.

If I wanted made-up random crap and hallucinations, I'd ask ChatGPT,
not my filesystem.

                Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 17:08   ` Jaegeuk Kim
@ 2024-12-11 19:45     ` Gabriel Krisman Bertazi
  2024-12-11 19:58       ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-11 19:45 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Linus Torvalds, Linux Kernel Mailing List, hanqi@vivo.com,
	Theodore Ts'o

Jaegeuk Kim <jaegeuk@kernel.org> writes:

> On 12/11, Gabriel Krisman Bertazi wrote:
>> Jaegeuk Kim <jaegeuk@kernel.org> writes:
>> 
>> > Hi Linus/Gabriel,
>> >
>> > Once Android applied the below patch [1], some special characters started to be
>> > converted differently resulting in different length, so that f2fs cannot find
>> > the filename correctly which was created when the kernel didn't have [1].
>> >
>> > There is one bug report in [2] where describes more details. In order to avoid
>> > this, could you please consider reverting [1] asap? Or, is there any other
>> > way to keep the conversion while addressing CVE? It's very hard for f2fs to
>> > distinguish two valid converted lengths before/after [1].
>> 
>> I got this report yesterday. I'm looking into it.
>> 
>> It seems commit 5c26d2f1d3f5 ("unicode: Don't special case ignorable
>> code points") has affected more than ignorable code points, because that
>> U+2764 is not marked as Ignorable in the unicode database.
>> 
>> I still think the solution to the original issue is eliminating
>> ignorable code points, and that should be fine.  Let me look at why this
>> block of characters is mishandled.

I was struggling to reproduce it, until I copy-pasted the character
directly from the bugzilla:

The character the user has is ❤️, which is different than just ❤.  This
is a combination of:

U+2764 + U+FE0F  (Heavy Black Heart + Variation Selector-16)

Variation Selector-16 is an ignorable character with zero length,
exactly what we wanted to ignore with that patch.  What I didn't
consider in the original submission was that, differently from other
ignorable code-points, this block might be used intentionally in a filename.

> Thank you so much. If it takes some time to find the root cause, may I
> propose the revert first to unblock production? The problem is quite severe
> as users cannot access their files.

We have 3 ways forward.

1) The first is to revert the patch and fix the original issue in a
different way.  That would be: We would restore the original database
and treat Ignorable codepoints as folding to themselves only when doing
string comparisons, but not when calculating hashes.  This way, the hash
will be the same, but filenames with Ignorable codepoints will be
handled as byte sequences.

2) We keep the original patch and add support in fsck to update the
hashes in volumes like the above.

3) We regenerate the database to Ignore codepoints in the code-block
FE00..FE0F.  That would be the simplest, solution, but there might be
more cases that need fixing later.

At this point, I'd be pending torwards 1 or 3.  Both of them can be done
after reverting my original patch, so I'm fine with that.  Thoughts?

> Thank you so much. If it takes some time to find the root cause, may I
> propose the revert first to unblock production? The problem is quite
> severe as users cannot access their files.

I don't oppose this, considering the case at hand.  I'll base the new patch
on top of the revert.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 19:45     ` Gabriel Krisman Bertazi
@ 2024-12-11 19:58       ` Linus Torvalds
  2024-12-11 20:18         ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2024-12-11 19:58 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Jaegeuk Kim, Linux Kernel Mailing List, hanqi@vivo.com,
	Theodore Ts'o

On Wed, 11 Dec 2024 at 11:46, Gabriel Krisman Bertazi <krisman@suse.de> wrote:
>
> 1) The first is to revert the patch and fix the original issue in a
> different way.  That would be: We would restore the original database
> and treat Ignorable codepoints as folding to themselves only when doing
> string comparisons, but not when calculating hashes.  This way, the hash
> will be the same, but filenames with Ignorable codepoints will be
> handled as byte sequences.

The problem is that all the filesystems basically do some variation of

        if (IS_CASEFOLDED(dir) ..) {

                len = utf8_casefold(sb->s_encoding, orig_name,
                        new_name, MAXLEN);

and then they use that "new_name" for both hashing and for comparisons.

Which is really really annoying, not just because you hash all these
meaningless conversions (aka "I write buggy crap and call it a
filesystem"), but it also means that now your *comparison* has to deal
with the fact that the name you are comparing against isn't the
original name, so the code in generic_ci_match() tyhat says "let's try
the simple exact match first" will fail too, because the string
comparison isn't comparing the original raw data - that almost
certainly matched.

Because even on a casefolding filesystem, 99.9% of the time you give
the *matching* name (think extremely common things like "readdir ->
stat" that is done by not just "ls -l", but *any* filesystem tree
traveral).

Soi the whole "let's corrupt the filename and then hash and compare
that corrupted version" is doubly wrong.

Christ. I really really hate case-folding filesystems with a passion.
The incompetence just fills me with red-hot rage.

               Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 19:58       ` Linus Torvalds
@ 2024-12-11 20:18         ` Linus Torvalds
  2024-12-11 21:10           ` Gabriel Krisman Bertazi
  2024-12-11 21:13           ` Jaegeuk Kim
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2024-12-11 20:18 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Jaegeuk Kim, Linux Kernel Mailing List, hanqi@vivo.com,
	Theodore Ts'o

On Wed, 11 Dec 2024 at 11:58, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The problem is that all the filesystems basically do some variation of
>
>         if (IS_CASEFOLDED(dir) ..) {
>
>                 len = utf8_casefold(sb->s_encoding, orig_name,
>                         new_name, MAXLEN);
>
> and then they use that "new_name" for both hashing and for comparisons.

Oh, actually, f2fs does pass in the original name to
generic_ci_match(), so I think this is solvable.

The solution involves just telling f2fs to ignore the hash if it has
seen odd characters.

So I think f2fs could actually do something like this:

  --- a/fs/f2fs/dir.c
  +++ b/fs/f2fs/dir.c
  @@ -67,6 +67,7 @@ int f2fs_init_casefolded_name(const struct inode *dir,
                        /* fall back to treating name as opaque byte sequence */
                        return 0;
                }
  +             fname->ignore_hash = utf8_oddname(fname->usr_fname);
                fname->cf_name.name = buf;
                fname->cf_name.len = len;
        }
  @@ -231,7 +232,7 @@ struct f2fs_dir_entry
*f2fs_find_target_dentry(const struct f2fs_dentry_ptr *d,
                        continue;
                }

  -             if (de->hash_code == fname->hash) {
  +             if (fname->ignore_hash || de->hash_code == fname->hash) {
                        res = f2fs_match_name(d->inode, fname,
                                              d->filename[bit_pos],
                                              le16_to_cpu(de->name_len));
  --- a/fs/f2fs/f2fs.h
  +++ b/fs/f2fs/f2fs.h
  @@ -521,6 +521,7 @@ struct f2fs_filename {

        /* The dirhash of this filename */
        f2fs_hash_t hash;
  +     bool ignore_hash;

   #ifdef CONFIG_FS_ENCRYPTION
        /*

where that "utf8_oddname()" is the one that goes "this filename
contains unhashable characters".

I didn't look very closely at what ext4 does, but it seems to already
have a pattern for "don't even look at the hash because it's not
reliable", so I think ext4 can do something similar.

So then all you actually need is that utf8_oddname() that recognizes
those ignored code-points.

So I take it all back: option (1) actually doesn't look that bad, and
would make reverting commit 5c26d2f1d3f5 ("unicode: Don't special case
ignorable code points") unnecessary.

                Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 20:18         ` Linus Torvalds
@ 2024-12-11 21:10           ` Gabriel Krisman Bertazi
  2024-12-11 21:25             ` Linus Torvalds
  2024-12-11 21:13           ` Jaegeuk Kim
  1 sibling, 1 reply; 14+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-11 21:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jaegeuk Kim, Linux Kernel Mailing List, hanqi@vivo.com,
	Theodore Ts'o

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 11 Dec 2024 at 11:58, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> The problem is that all the filesystems basically do some variation of
>>
>>         if (IS_CASEFOLDED(dir) ..) {
>>
>>                 len = utf8_casefold(sb->s_encoding, orig_name,
>>                         new_name, MAXLEN);
>>
>> and then they use that "new_name" for both hashing and for comparisons.
>
> Oh, actually, f2fs does pass in the original name to
> generic_ci_match(), so I think this is solvable.
>
> The solution involves just telling f2fs to ignore the hash if it has
> seen odd characters.
>
> So I think f2fs could actually do something like this:
>
>   --- a/fs/f2fs/dir.c
>   +++ b/fs/f2fs/dir.c
>   @@ -67,6 +67,7 @@ int f2fs_init_casefolded_name(const struct inode *dir,
>                         /* fall back to treating name as opaque byte sequence */
>                         return 0;
>                 }
>   +             fname->ignore_hash = utf8_oddname(fname->usr_fname);
>                 fname->cf_name.name = buf;
>                 fname->cf_name.len = len;
>         }
>   @@ -231,7 +232,7 @@ struct f2fs_dir_entry
> *f2fs_find_target_dentry(const struct f2fs_dentry_ptr *d,
>                         continue;
>                 }
>
>   -             if (de->hash_code == fname->hash) {
>   +             if (fname->ignore_hash || de->hash_code == fname->hash) {
>                         res = f2fs_match_name(d->inode, fname,
>                                               d->filename[bit_pos],
>                                               le16_to_cpu(de->name_len));

This solves it for directories with inlined dirents
(FI_INLINE_DENTRY). but for large directories, we use fname->hash to
find the right block to start the search.  So, we'd need to walk through
the entire case-insensitive directory.  In ext4, the issue only exists
on large directories, because we don't care about the hash on small
directories.


>   --- a/fs/f2fs/f2fs.h
>   +++ b/fs/f2fs/f2fs.h
>   @@ -521,6 +521,7 @@ struct f2fs_filename {
>
>         /* The dirhash of this filename */
>         f2fs_hash_t hash;
>   +     bool ignore_hash;
>
>    #ifdef CONFIG_FS_ENCRYPTION
>         /*
>
> where that "utf8_oddname()" is the one that goes "this filename
> contains unhashable characters".
>
> I didn't look very closely at what ext4 does, but it seems to already
> have a pattern for "don't even look at the hash because it's not
> reliable", so I think ext4 can do something similar.

> So then all you actually need is that utf8_oddname() that recognizes
> those ignored code-points.
>
> So I take it all back: option (1) actually doesn't look that bad, and
> would make reverting commit 5c26d2f1d3f5 ("unicode: Don't special case
> ignorable code points") unnecessary.

I think we really need to revert it. The simplest way to implement
utf8_oddname is having the full database with the Ignorable code points
available. We can then add a flag in the same data structure indicating
this is an Ignorable codepoint that should be dismissed by the
utf8_strncasecmp when doing the casefold, while still using the full
string for the hash.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 20:18         ` Linus Torvalds
  2024-12-11 21:10           ` Gabriel Krisman Bertazi
@ 2024-12-11 21:13           ` Jaegeuk Kim
  1 sibling, 0 replies; 14+ messages in thread
From: Jaegeuk Kim @ 2024-12-11 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gabriel Krisman Bertazi, Linux Kernel Mailing List,
	hanqi@vivo.com, Theodore Ts'o

On 12/11, Linus Torvalds wrote:
> On Wed, 11 Dec 2024 at 11:58, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The problem is that all the filesystems basically do some variation of
> >
> >         if (IS_CASEFOLDED(dir) ..) {
> >
> >                 len = utf8_casefold(sb->s_encoding, orig_name,
> >                         new_name, MAXLEN);
> >
> > and then they use that "new_name" for both hashing and for comparisons.
> 
> Oh, actually, f2fs does pass in the original name to
> generic_ci_match(), so I think this is solvable.
> 
> The solution involves just telling f2fs to ignore the hash if it has
> seen odd characters.

But, the hash is not just used when matching the dentry, but gives a block
location withiin multi-level hash tables for faster lookup as well. If the
filename length is also changed by the unicode patch, utf8_strncasecmp_folded()
will also give an error?

> 
> So I think f2fs could actually do something like this:
> 
>   --- a/fs/f2fs/dir.c
>   +++ b/fs/f2fs/dir.c
>   @@ -67,6 +67,7 @@ int f2fs_init_casefolded_name(const struct inode *dir,
>                         /* fall back to treating name as opaque byte sequence */
>                         return 0;
>                 }
>   +             fname->ignore_hash = utf8_oddname(fname->usr_fname);
>                 fname->cf_name.name = buf;
>                 fname->cf_name.len = len;
>         }
>   @@ -231,7 +232,7 @@ struct f2fs_dir_entry
> *f2fs_find_target_dentry(const struct f2fs_dentry_ptr *d,
>                         continue;
>                 }
> 
>   -             if (de->hash_code == fname->hash) {
>   +             if (fname->ignore_hash || de->hash_code == fname->hash) {
>                         res = f2fs_match_name(d->inode, fname,
>                                               d->filename[bit_pos],
>                                               le16_to_cpu(de->name_len));
>   --- a/fs/f2fs/f2fs.h
>   +++ b/fs/f2fs/f2fs.h
>   @@ -521,6 +521,7 @@ struct f2fs_filename {
> 
>         /* The dirhash of this filename */
>         f2fs_hash_t hash;
>   +     bool ignore_hash;
> 
>    #ifdef CONFIG_FS_ENCRYPTION
>         /*
> 
> where that "utf8_oddname()" is the one that goes "this filename
> contains unhashable characters".
> 
> I didn't look very closely at what ext4 does, but it seems to already
> have a pattern for "don't even look at the hash because it's not
> reliable", so I think ext4 can do something similar.
> 
> So then all you actually need is that utf8_oddname() that recognizes
> those ignored code-points.
> 
> So I take it all back: option (1) actually doesn't look that bad, and
> would make reverting commit 5c26d2f1d3f5 ("unicode: Don't special case
> ignorable code points") unnecessary.
> 
>                 Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 21:10           ` Gabriel Krisman Bertazi
@ 2024-12-11 21:25             ` Linus Torvalds
  2024-12-11 21:53               ` Jaegeuk Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2024-12-11 21:25 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Jaegeuk Kim, Linux Kernel Mailing List, hanqi@vivo.com,
	Theodore Ts'o

On Wed, 11 Dec 2024 at 13:11, Gabriel Krisman Bertazi <krisman@suse.de> wrote:
>
> This solves it for directories with inlined dirents
> (FI_INLINE_DENTRY). but for large directories, we use fname->hash to
> find the right block to start the search.

Grr. Dammit, the hash should always have been the original hash of the
original actual case-preserving entry.

Oh well. I'll continue to just absolutely hate case-folding, because
while I suspect that it *could* be done correctly, I have yet to ever
actually see any filesystem that did so.

            Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 21:25             ` Linus Torvalds
@ 2024-12-11 21:53               ` Jaegeuk Kim
  2024-12-11 21:56                 ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2024-12-11 21:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gabriel Krisman Bertazi, Linux Kernel Mailing List,
	hanqi@vivo.com, Theodore Ts'o

On 12/11, Linus Torvalds wrote:
> On Wed, 11 Dec 2024 at 13:11, Gabriel Krisman Bertazi <krisman@suse.de> wrote:
> >
> > This solves it for directories with inlined dirents
> > (FI_INLINE_DENTRY). but for large directories, we use fname->hash to
> > find the right block to start the search.
> 
> Grr. Dammit, the hash should always have been the original hash of the
> original actual case-preserving entry.
> 
> Oh well. I'll continue to just absolutely hate case-folding, because
> while I suspect that it *could* be done correctly, I have yet to ever
> actually see any filesystem that did so.

Casefolding supports f2fs and ext4 per Android request, and only f2fs
constructs hash-based directory structure. If we use hash of the
case-preserving entry, we had no easy solution to distinguish file_A and file_a.

One possible way might be searching only filename sequentially through
the entire dentry list, if we fail to find the new encoded entry. But, it may
need a huge surgery to make it work.

> 
>             Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 21:53               ` Jaegeuk Kim
@ 2024-12-11 21:56                 ` Linus Torvalds
  2024-12-11 22:01                   ` Jaegeuk Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2024-12-11 21:56 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Gabriel Krisman Bertazi, Linux Kernel Mailing List,
	hanqi@vivo.com, Theodore Ts'o

On Wed, 11 Dec 2024 at 13:53, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
>
> Casefolding supports f2fs and ext4 per Android request, and only f2fs
> constructs hash-based directory structure. If we use hash of the
> case-preserving entry, we had no easy solution to distinguish file_A and file_a.

I really wish people had just done case-folding as a slow case, and
not used the hash at all.

Does that mean that you have to walk the directory linearly? Yes it
does. But that's my point: you shouldn't optimize for the idiocy of
case-folding. You should optimize for the sane case, and actively try
to discourage people from doing stupid bad things.

Oh well. Too late now.

             Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 21:56                 ` Linus Torvalds
@ 2024-12-11 22:01                   ` Jaegeuk Kim
  2024-12-11 22:09                     ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2024-12-11 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gabriel Krisman Bertazi, Linux Kernel Mailing List,
	hanqi@vivo.com, Theodore Ts'o

On 12/11, Linus Torvalds wrote:
> On Wed, 11 Dec 2024 at 13:53, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> >
> > Casefolding supports f2fs and ext4 per Android request, and only f2fs
> > constructs hash-based directory structure. If we use hash of the
> > case-preserving entry, we had no easy solution to distinguish file_A and file_a.
> 
> I really wish people had just done case-folding as a slow case, and
> not used the hash at all.
> 
> Does that mean that you have to walk the directory linearly? Yes it
> does. But that's my point: you shouldn't optimize for the idiocy of
> case-folding. You should optimize for the sane case, and actively try
> to discourage people from doing stupid bad things.
> 
> Oh well. Too late now.

Ok, well understood. I'll work on how we can implement the linear search for
case-folding. Meanwhile, yea, quite late so, may I ask for its revert?

> 
>              Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unicode conversion issue
  2024-12-11 22:01                   ` Jaegeuk Kim
@ 2024-12-11 22:09                     ` Linus Torvalds
  0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2024-12-11 22:09 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Gabriel Krisman Bertazi, Linux Kernel Mailing List,
	hanqi@vivo.com, Theodore Ts'o

On Wed, 11 Dec 2024 at 14:01, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> Ok, well understood. I'll work on how we can implement the linear search for
> case-folding. Meanwhile, yea, quite late so, may I ask for its revert?

Yes, I'll revert.

Also, given that there are case-folded hashes out there, I guess we're
stuck with them, and might as well use them for lookup.

But think of all the random path-based decisions that applications do,
and that ignorable characters basically invalidate - because people
can insert characters that "don't matter" and bypass all those checks.

It's a security nightmare.

            Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-12-11 22:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 15:46 Unicode conversion issue Jaegeuk Kim
2024-12-11 16:08 ` Gabriel Krisman Bertazi
2024-12-11 17:08   ` Jaegeuk Kim
2024-12-11 19:45     ` Gabriel Krisman Bertazi
2024-12-11 19:58       ` Linus Torvalds
2024-12-11 20:18         ` Linus Torvalds
2024-12-11 21:10           ` Gabriel Krisman Bertazi
2024-12-11 21:25             ` Linus Torvalds
2024-12-11 21:53               ` Jaegeuk Kim
2024-12-11 21:56                 ` Linus Torvalds
2024-12-11 22:01                   ` Jaegeuk Kim
2024-12-11 22:09                     ` Linus Torvalds
2024-12-11 21:13           ` Jaegeuk Kim
2024-12-11 19:22   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.