Questions about Unicode Normalization Form

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Questions about Unicode Normalization Form
@ 2024-04-06  9:54 HAN Yuwei
  2024-04-06 13:26 ` James Bottomley
  0 siblings, 1 reply; 7+ messages in thread
From: HAN Yuwei @ 2024-04-06  9:54 UTC (permalink / raw)
  To: linux-fsdevel

[-- Attachment #1.1: Type: text/plain, Size: 653 bytes --]

Hi, all.

I have encountered someone else's Unicode Normalization Form(NF) problem 
today. And I wonder how Linux process filenames in Unicode.

After some search I found that everybody seems like processed it on user 
input level, and nothing is mentioned about how vfs or specific 
filesystem treated this problem. ZFS treated it with a option 
"normalization" explicitly. Windows (or NTFS?) says "There is no need to 
perform any Unicode normalization on path and file name strings".

Unicode have a dedicated FAQ about this: 
https://unicode.org/faq/normalization.html

Is there any conclusion or discussion I missed?

HAN Yuwei.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about Unicode Normalization Form
  2024-04-06  9:54 Questions about Unicode Normalization Form HAN Yuwei
@ 2024-04-06 13:26 ` James Bottomley
  2024-04-06 15:15   ` HAN Yuwei
  0 siblings, 1 reply; 7+ messages in thread
From: James Bottomley @ 2024-04-06 13:26 UTC (permalink / raw)
  To: HAN Yuwei, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 887 bytes --]

On Sat, 2024-04-06 at 17:54 +0800, HAN Yuwei wrote:
> Hi, all.
> 
> I have encountered someone else's Unicode Normalization Form(NF)
> problem today. And I wonder how Linux process filenames in Unicode.
> 
> After some search I found that everybody seems like processed it on
> user input level, and nothing is mentioned about how vfs or specific 
> filesystem treated this problem. ZFS treated it with a option 
> "normalization" explicitly. Windows (or NTFS?) says "There is no need
> to perform any Unicode normalization on path and file name strings".
> 
> Unicode have a dedicated FAQ about this: 
> https://unicode.org/faq/normalization.html
> 
> Is there any conclusion or discussion I missed?

This question is way to broad to answer.  Why don't you look in

fs/unicode

and see where the helpers are used and then ask a more specific
question.

James


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about Unicode Normalization Form
  2024-04-06 13:26 ` James Bottomley
@ 2024-04-06 15:15   ` HAN Yuwei
  2024-04-08  1:39     ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: HAN Yuwei @ 2024-04-06 15:15 UTC (permalink / raw)
  To: James Bottomley, linux-fsdevel


[-- Attachment #1.1: Type: text/plain, Size: 1348 bytes --]

在 2024/4/6 21:26, James Bottomley 写道:
> On Sat, 2024-04-06 at 17:54 +0800, HAN Yuwei wrote:
>> Hi, all.
>>
>> I have encountered someone else's Unicode Normalization Form(NF)
>> problem today. And I wonder how Linux process filenames in Unicode.
>>
>> After some search I found that everybody seems like processed it on
>> user input level, and nothing is mentioned about how vfs or specific
>> filesystem treated this problem. ZFS treated it with a option
>> "normalization" explicitly. Windows (or NTFS?) says "There is no need
>> to perform any Unicode normalization on path and file name strings".
>>
>> Unicode have a dedicated FAQ about this:
>> https://unicode.org/faq/normalization.html
>>
>> Is there any conclusion or discussion I missed?
> This question is way to broad to answer.  Why don't you look in
>
> fs/unicode

Sorry, I am not very familiar with Unicode nor kernel. Correct me if wrong.

As to what I have read, kernel seems like using NFD when processing all 
UTF-8 related string.
If fs is using these helper function, then I can be sure kernel is 
applying NFD to every UTF-8 filenames.
But I can't find any references to these helper function on Github 
mirror, how are they used by fs code?

> and see where the helpers are used and then ask a more specific
> question.
>
> James
>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about Unicode Normalization Form
  2024-04-06 15:15   ` HAN Yuwei
@ 2024-04-08  1:39     ` Theodore Ts'o
  2024-04-08  1:57       ` HAN Yuwei
  2024-04-08  3:30       ` Matthew Wilcox
  0 siblings, 2 replies; 7+ messages in thread
From: Theodore Ts'o @ 2024-04-08  1:39 UTC (permalink / raw)
  To: HAN Yuwei; +Cc: James Bottomley, linux-fsdevel

On Sat, Apr 06, 2024 at 11:15:36PM +0800, HAN Yuwei wrote:
> 
> Sorry, I am not very familiar with Unicode nor kernel. Correct me if wrong.
> 
> As to what I have read, kernel seems like using NFD when processing all
> UTF-8 related string.
> If fs is using these helper function, then I can be sure kernel is applying
> NFD to every UTF-8 filenames.
> But I can't find any references to these helper function on Github mirror,
> how are they used by fs code?

For the most part, the kernel's file stysem code doesn't do anything
special for Unicode.  The exception is that the ext4 and f2fs file
systems can have an optional feature which is mostly only used by
Android systems to support case insensitive lookups.  This is called
the "casefold" feature, which is not enabled by default by most
desktop or server systems.

The casefold feature was developed because Android has a requirement
to support case-insensitive lookups, and it had to support Unicode
character sets (for example, XFS has support for case insensitive
lookups back from the Irix days, but it only supports ASCII), and the
alternative to adding support in the kernel for case fodling was this
terrible out-of-tree kernel module which use a file system wrapping
that was deadlock-prone (which is why the case-folding wrapfs would
never be accepted upstream; it was a trash fire).  Anyway, I got tired
of being asked to debug file system deadlocks which was not the VFS's
fault, but was rather caused by this terrible wrapfs kludge used by
Android, so I instigated proper case-folding support (ala Windows and
MacOS) for the file system types commonly used by Android, namely ext4
and f2fs.

So *if* you are using ext4 or f2fs, *and* the file system is specially
created with the file system feature flag "casefold", *and* the
directory has the casefold flag set, *then* the file system will
support case-preserving, case-insensitive lookups.  As a side effect
of using utf8_strcasecmp, it will also do string comparisons where
even if you have not normalized the file banes, so that the filename
contained some Unicode character, such as (for example) the NFC form
of the Anstrom Sign character (00C5), and you try to look it up using
the NFD form of the character (0041 030A), the lookup will succeed,
because we use utf8_strcasecmp().   However, this is *only* if case
folding is enabled, and in general, it isn't.

Aside from this exception (which as I said, is in general only enabled
for Android, because most other use cases such as for Desktop, Server,
etc. don't really care about MacOS / Windows style case insensitive
filename lookups), the Linux VFS in general treats UTF-8 characters as
null-terminated byte streams.  So the kernel doesn't validate to make
sure that a file name is composed of valid UTF-8 code points (e.g., so
we don't prohibit the use of Klingon characters which are not
recognized by the Unicode consortium), nor does the kernel do any kind
of Unicode normalization.  So for example, if casefolding is not
enabled, 0041 030A and 00C5 will be considered different, and kernel
will not force the NFC form (00C5) to the NFD form (0041 030A) or vice
versa.

Now, because the kernel tries very hard to be blissfully ignorant
about the nightmare which is I18N, it is up to the userspace Unicode
libraries to normalize strings before passing them to the kernel ---
either as data in text files, or as file names.  I am very glad that I
don't worry about whether the standard normalization form used by the
various GNOME, KDE, Unicode, etc., userspace libraries is NFD, NFC,
NFKD, or NFKC.  That's someone else's problem, and if you don't have
casefolding enabled, we will do the filename comparisons using the
strcmp() function.

Fundamentally, unicode and normalization is a userspace problem, not a
kernel problem, except when we don't have a choice (such as for casse
insensitive lookups).  And there we solve just the smallest part of
the problem, and make it userspace's problem for everything else.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about Unicode Normalization Form
  2024-04-08  1:39     ` Theodore Ts'o
@ 2024-04-08  1:57       ` HAN Yuwei
  2024-04-08  3:30       ` Matthew Wilcox
  1 sibling, 0 replies; 7+ messages in thread
From: HAN Yuwei @ 2024-04-08  1:57 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: James Bottomley, linux-fsdevel


[-- Attachment #1.1: Type: text/plain, Size: 4491 bytes --]


在 2024/4/8 9:39, Theodore Ts'o 写道:
> On Sat, Apr 06, 2024 at 11:15:36PM +0800, HAN Yuwei wrote:
>> Sorry, I am not very familiar with Unicode nor kernel. Correct me if wrong.
>>
>> As to what I have read, kernel seems like using NFD when processing all
>> UTF-8 related string.
>> If fs is using these helper function, then I can be sure kernel is applying
>> NFD to every UTF-8 filenames.
>> But I can't find any references to these helper function on Github mirror,
>> how are they used by fs code?
> For the most part, the kernel's file stysem code doesn't do anything
> special for Unicode.  The exception is that the ext4 and f2fs file
> systems can have an optional feature which is mostly only used by
> Android systems to support case insensitive lookups.  This is called
> the "casefold" feature, which is not enabled by default by most
> desktop or server systems.
>
> The casefold feature was developed because Android has a requirement
> to support case-insensitive lookups, and it had to support Unicode
> character sets (for example, XFS has support for case insensitive
> lookups back from the Irix days, but it only supports ASCII), and the
> alternative to adding support in the kernel for case fodling was this
> terrible out-of-tree kernel module which use a file system wrapping
> that was deadlock-prone (which is why the case-folding wrapfs would
> never be accepted upstream; it was a trash fire).  Anyway, I got tired
> of being asked to debug file system deadlocks which was not the VFS's
> fault, but was rather caused by this terrible wrapfs kludge used by
> Android, so I instigated proper case-folding support (ala Windows and
> MacOS) for the file system types commonly used by Android, namely ext4
> and f2fs.
>
> So *if* you are using ext4 or f2fs, *and* the file system is specially
> created with the file system feature flag "casefold", *and* the
> directory has the casefold flag set, *then* the file system will
> support case-preserving, case-insensitive lookups.  As a side effect
> of using utf8_strcasecmp, it will also do string comparisons where
> even if you have not normalized the file banes, so that the filename
> contained some Unicode character, such as (for example) the NFC form
> of the Anstrom Sign character (00C5), and you try to look it up using
> the NFD form of the character (0041 030A), the lookup will succeed,
> because we use utf8_strcasecmp().   However, this is *only* if case
> folding is enabled, and in general, it isn't.
>
> Aside from this exception (which as I said, is in general only enabled
> for Android, because most other use cases such as for Desktop, Server,
> etc. don't really care about MacOS / Windows style case insensitive
> filename lookups), the Linux VFS in general treats UTF-8 characters as
> null-terminated byte streams.  So the kernel doesn't validate to make
> sure that a file name is composed of valid UTF-8 code points (e.g., so
> we don't prohibit the use of Klingon characters which are not
> recognized by the Unicode consortium), nor does the kernel do any kind
> of Unicode normalization.  So for example, if casefolding is not
> enabled, 0041 030A and 00C5 will be considered different, and kernel
> will not force the NFC form (00C5) to the NFD form (0041 030A) or vice
> versa.
>
> Now, because the kernel tries very hard to be blissfully ignorant
> about the nightmare which is I18N, it is up to the userspace Unicode
> libraries to normalize strings before passing them to the kernel ---
> either as data in text files, or as file names.  I am very glad that I
> don't worry about whether the standard normalization form used by the
> various GNOME, KDE, Unicode, etc., userspace libraries is NFD, NFC,
> NFKD, or NFKC.  That's someone else's problem, and if you don't have
> casefolding enabled, we will do the filename comparisons using the
> strcmp() function.
>
> Fundamentally, unicode and normalization is a userspace problem, not a
> kernel problem, except when we don't have a choice (such as for casse
> insensitive lookups).  And there we solve just the smallest part of
> the problem, and make it userspace's problem for everything else.
>
> Cheers,
>
> 					- Ted

Thanks for you time and patient explanation. I have learned a lot about 
these "history".

Do you think it is appropriate to add these to kernel documentation? If 
so I can composite a patch about this.

HAN Yuwei


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about Unicode Normalization Form
  2024-04-08  1:39     ` Theodore Ts'o
  2024-04-08  1:57       ` HAN Yuwei
@ 2024-04-08  3:30       ` Matthew Wilcox
  2024-04-08 14:15         ` Theodore Ts'o
  1 sibling, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2024-04-08  3:30 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: HAN Yuwei, James Bottomley, linux-fsdevel

On Sun, Apr 07, 2024 at 09:39:28PM -0400, Theodore Ts'o wrote:
> On Sat, Apr 06, 2024 at 11:15:36PM +0800, HAN Yuwei wrote:
> > 
> > Sorry, I am not very familiar with Unicode nor kernel. Correct me if wrong.
> > 
> > As to what I have read, kernel seems like using NFD when processing all
> > UTF-8 related string.
> > If fs is using these helper function, then I can be sure kernel is applying
> > NFD to every UTF-8 filenames.
> > But I can't find any references to these helper function on Github mirror,
> > how are they used by fs code?
> 
> For the most part, the kernel's file stysem code doesn't do anything
> special for Unicode.  The exception is that the ext4 and f2fs file
> systems can have an optional feature which is mostly only used by
> Android systems to support case insensitive lookups.  This is called
> the "casefold" feature, which is not enabled by default by most
> desktop or server systems.

As I understand it, an important usecase for the casefold feature is
running Windows games under WINE.  I don't do this myself (sgt-puzzles
is more my speed), but there's a pretty important market for this.
Wasn't this why Gabriel was funded to work on it (eg commit b886ee3e778e)?
Or was that the Android usecase?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about Unicode Normalization Form
  2024-04-08  3:30       ` Matthew Wilcox
@ 2024-04-08 14:15         ` Theodore Ts'o
  0 siblings, 0 replies; 7+ messages in thread
From: Theodore Ts'o @ 2024-04-08 14:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: HAN Yuwei, James Bottomley, linux-fsdevel

On Mon, Apr 08, 2024 at 04:30:55AM +0100, Matthew Wilcox wrote:
> As I understand it, an important usecase for the casefold feature is
> running Windows games under WINE.  I don't do this myself (sgt-puzzles
> is more my speed), but there's a pretty important market for this.
> Wasn't this why Gabriel was funded to work on it (eg commit b886ee3e778e)?
> Or was that the Android usecase?

Good point.  Your history is correct; the other use case, which Gabriel
was funded to do the work for, was for Steam for Linux, which uses a
fork of Wine called Stream Play.

The other potential use case for casefold is that it would accelarate
Samba servers, which will first try to do a lookup on the filename but
if it gets ENOENT, has to do an O(1) readdir search to see if there is
a case insensitive match to the given lookup.  I haven't heard of
anyone who has actually configured their CIFS server to do this, but
it should work.

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-08 14:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-06  9:54 Questions about Unicode Normalization Form HAN Yuwei
2024-04-06 13:26 ` James Bottomley
2024-04-06 15:15   ` HAN Yuwei
2024-04-08  1:39     ` Theodore Ts'o
2024-04-08  1:57       ` HAN Yuwei
2024-04-08  3:30       ` Matthew Wilcox
2024-04-08 14:15         ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).