casefold is using unsuitable case mapping table

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* casefold is using unsuitable case mapping table
@ 2025-04-22 12:31 Björn JACKE
  2025-04-24 19:53 ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 4+ messages in thread
From: Björn JACKE @ 2025-04-22 12:31 UTC (permalink / raw)
  To: linux-fsdevel

Hi,

I started to experiment with the casefold feature of ext4 and some other
filesystems. I was hoping to get some significant performance gains for Samba
server with large directories.

It turns out though that the case insensitive feature is not usable because it
does not match the case mapping tables that other operating systems use. More
specifically, the german letter "ß" is treated as a case equivanten of "ss".

There is an equivalent of "ß" and "ss in some other scopes, also AD LDAP treats
them as an equivante. For systems that requires "lossless" case conversion
however should not treat ß and ss as equivalent. This is also why a filesystem
should never ever do that

Since 2017 there is a well-defined uppercase version of the codepoint (U+00DF)
of the "ß" letter in Unicode: U+1E9E, this could eventually be used but I
haven't seen any filesystem using that so far. This would be a possible and
lossless case equivalent, but well, that's actually another thing to discuss.

The important point is to _not_ use the ß/ss case equicalent. The casefold
feature is mainly useless otherwise.

Can this be changed without causing too much hassle?

Cheers
Björn

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: casefold is using unsuitable case mapping table
  2025-04-22 12:31 casefold is using unsuitable case mapping table Björn JACKE
@ 2025-04-24 19:53 ` Gabriel Krisman Bertazi
  2025-04-25 11:40   ` Björn JACKE
  0 siblings, 1 reply; 4+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-04-24 19:53 UTC (permalink / raw)
  To: Björn JACKE; +Cc: linux-fsdevel

Björn JACKE <bjacke@SerNet.DE> writes:

> It turns out though that the case insensitive feature is not usable because it
> does not match the case mapping tables that other operating systems use. More
> specifically, the german letter "ß" is treated as a case equivanten of "ss".
>
> There is an equivalent of "ß" and "ss in some other scopes, also AD LDAP treats
> them as an equivante. For systems that requires "lossless" case conversion
> however should not treat ß and ss as equivalent. This is also why a filesystem
> should never ever do that

Well, filesystems should never ever have filename encoding.  Once
they do, we have to make semantics decisions and they are all apparently
stupid to someone.  And any kind of Casefolding is an inherently lossy
operation in this sense, as multiple byte sequences will map to the
same file.

The big problem is that each of the big OS vendors chose specific
semantics of what to casefold.  APFS does NFD + full casefolding[1],
right?  except for "some code-points". I'm not sure what they do with ß,
tbh. I could never find any documentation on the specific code-points
they add/ignore.

In ext4, we decided to have no exceptions. Just do plain NFD + CF.  That
means we do C+F from the table below:

  https://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt

Which includes ß->SS.  We could argue forever whether that doesn't make
sense for language X, such as German.  I'm not a German speaker but
friends said it would be common to see straße uppercased to STRASSE there,
even though the 2017 agreement abolished it in favor of ẞ.  So what is
the right way?

My point is we can't rely fully on languages to argue the right
semantics.  There are no right semantics.  And Languages are also alive
and changing. There are many other examples where full casefold will
look stupid; for instance, one would argue we should also translate the
T column (i.e non-Turkish languages).  We can produce all sorts of
stupid examples with combining characters in Portuguese/Spanish too.
Linux is not broken beyond the fact the whole idea is broken.  These are
just the semantics we agreed were slightly less insane back in 2017
(considering we don't want to have locales).  And, apart from the
ignorable code points issue, I still think our implementation is
relatively sane.

> Since 2017 there is a well-defined uppercase version of the codepoint (U+00DF)
> of the "ß" letter in Unicode: U+1E9E, this could eventually be used but I
> haven't seen any filesystem using that so far. This would be a possible and
> lossless case equivalent, but well, that's actually another thing to
> discuss.
>
> The important point is to _not_ use the ß/ss case equicalent. The casefold
> feature is mainly useless otherwise.

It is not useless.  Android and Wine emulators have been using it just
fine for years.  We also cannot break compatibility for them.

> Can this be changed without causing too much hassle?

We attempted to do a much smaller change recently in commit
5c26d2f1d3f5, because we assumed no one would be trying to create files
with silly stuff like ZWSP (U+200B). Turns out there is a reasonable
use-case for that with Variation Selectors, and we had to revert it.  So
we need to be very careful with any changes here, so people don't lose
access to their files on a kernel update.  Even with that, more
casefolding flavor will cause all sorts of compatibility issues when
moving data across volumes, so I'd be very wary of having more than one
flavor.

What are the exact requirements for samba?  Do you only fold the C
column? Do you need stuff like compatibility normalization?

 [1] https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: casefold is using unsuitable case mapping table
  2025-04-24 19:53 ` Gabriel Krisman Bertazi
@ 2025-04-25 11:40   ` Björn JACKE
  2025-06-09 18:12     ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 4+ messages in thread
From: Björn JACKE @ 2025-04-25 11:40 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: linux-fsdevel

On 2025-04-24 at 15:53 -0400 Gabriel Krisman Bertazi sent off:
> The big problem is that each of the big OS vendors chose specific
> semantics of what to casefold.  APFS does NFD + full casefolding[1],
> right?  except for "some code-points". I'm not sure what they do with ß,
> tbh. I could never find any documentation on the specific code-points
> they add/ignore.

Apple basically stores the files in NFD and do casefolding but not those lossy
folding rules that make "ß" and "ss" equal. I have an overview of filesystems
and their encodings written up at
https://www.j3e.de/linux/convmv/man/#Filesystem-issues - that might be
interesting for the discussion also.

> In ext4, we decided to have no exceptions. Just do plain NFD + CF.  That
> means we do C+F from the table below:
> 
>   https://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt
> 
> Which includes ß->SS.  We could argue forever whether that doesn't make
> sense for language X, such as German.  I'm not a German speaker but
> friends said it would be common to see straße uppercased to STRASSE there,
> even though the 2017 agreement abolished it in favor of ẞ.  So what is
> the right way?

I am a German speaker, so I can shed light on that. "ß" and "ss" are definetely
not equal. If your Name is "Groß" this is a different Name than "Gross". The
word "Ma0e" exists and the word "Masse" existist, they are something completely
different. The only thing to say here is that people without that letter on the
keyboard often use "ss" as a fallback, just like writing "ae" is a common
fallback for writing "ä". In a filesystem they should not be projected on the
same file.

The main problem that was made when the casefolding was introduced in the Linux
kernel was to use all of the cases listed in
https://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt
If you grep for all the F flagged cases there (grep " F;") you will get 104
"casefold" rules, which are essentially bogous for filesystem casefolding. They
mainly reduce the number of valid codepoints for filenames. Apart of the German
"ß" they also contain ligatures and combinations of greek letters, which are
being "equalized". All of those reduced codepoints can be unique characters of
filenames on ci Windows or Apple filesystems, they are not considered for
casefolding in any way, except for the "simple" (S flagged) casefolding of the
corresponding codepoint.

Those F flagged casefolding make sense for cases like CTRL-F in browsers, there
you want to find places, where a "fi" ligature (ﬁ) is used if you search for
"fi" but in filenames you need to be able to use both. At least this is what
all operating systems with case-insensitive filesystems do (except for Linux
till now).

> My point is we can't rely fully on languages to argue the right
> semantics.  There are no right semantics.  And Languages are also alive
> and changing. There are many other examples where full casefold will
> look stupid; for instance, one would argue we should also translate the
> T column (i.e non-Turkish languages).

The Turkish language with the dottet/dotless i/I is a very special and
exceptional case, ci is not being done for that in any other ci filesystem
implementation. The i/I case doesn't really matter in this discussion.

> It is not useless.  Android and Wine emulators have been using it just
> fine for years.  We also cannot break compatibility for them.

I understand that we can't break compatibility with it but we should try to
find a way to improve the current situation, which is far from being good.

> > Can this be changed without causing too much hassle?
> 
> We attempted to do a much smaller change recently in commit
> 5c26d2f1d3f5, because we assumed no one would be trying to create files
> with silly stuff like ZWSP (U+200B). Turns out there is a reasonable
> use-case for that with Variation Selectors, and we had to revert it.  So
> we need to be very careful with any changes here, so people don't lose
> access to their files on a kernel update.  Even with that, more
> casefolding flavor will cause all sorts of compatibility issues when
> moving data across volumes, so I'd be very wary of having more than one
> flavor.

especially becasue files should be movable also from other platforms also, we
should be very close to what other platforms do here. The fact that our
casefolding is significantly recuding the number of possible codepoints (the
104 F flagged ones), causes a major interoperability problem.

> What are the exact requirements for samba?  Do you only fold the C
> column? Do you need stuff like compatibility normalization?

For Samba it's required that we don't have a reduced set of valid Unicode
characters. And that means that the F flaged mappings are not being used. The
Turkish T mapping should also not be used.
Mappings we should use:
- the "C"ommonand  and
- the "S"imple
flagged mappings from the Unicode mapping table only.

I understand that it's difficult to change this as we store hashes of the
current lowercase version of the filenames. I'm not an expert enough in the
filesystem code to come up with a good idea how to solve this though.
Eventually we can use different versions of casefolding tables and store in the
filesystem, which version to use?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: casefold is using unsuitable case mapping table
  2025-04-25 11:40   ` Björn JACKE
@ 2025-06-09 18:12     ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 4+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-06-09 18:12 UTC (permalink / raw)
  To: Björn JACKE; +Cc: linux-fsdevel

Björn JACKE <bjacke@SerNet.DE> writes:

> I understand that it's difficult to change this as we store hashes of the
> current lowercase version of the filenames. I'm not an expert enough in the
> filesystem code to come up with a good idea how to solve this though.
> Eventually we can use different versions of casefolding tables and store in the
> filesystem, which version to use?

Regardless of the endless discussion about which code-points to fold or
not fold, which we've been having for years already, we must preserve
existing behavior for existing users, i.e. preserve semantics and disk
names and hashes.  Since the different semantics are a requirement for
SMB, we should envision a way to provide both maps side-by-side.

I suggest we do it as a separate unicode_map that filesystems can opt-in
through a flag in utf8_load.  It should be easy to generate the extra
map and this guarantees we won't break existing users.  I'll can take
look in the next weeks.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-06-09 18:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 12:31 casefold is using unsuitable case mapping table Björn JACKE
2025-04-24 19:53 ` Gabriel Krisman Bertazi
2025-04-25 11:40   ` Björn JACKE
2025-06-09 18:12     ` Gabriel Krisman Bertazi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox