linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk>,
	linux-ext4@vger.kernel.org
Subject: Re: [PATCH RESEND v2 00/25] Ext4 Encoding and Case-insensitive support
Date: Fri, 12 Oct 2018 15:24:01 -0400	[thread overview]
Message-ID: <20181012192401.GA20322@thunk.org> (raw)
In-Reply-To: <20181011222359.GB24824@magnolia>

On Thu, Oct 11, 2018 at 03:23:59PM -0700, Darrick J. Wong wrote:
> 
> Hmmm, I'm curious, why pick NFKD specifically?  AFAICT Linux userspace
> environments (I only tried with GNOME and KDE) use NF[K]C....
>
> Is there a particular reason you picked NFKD?  Ohhh, right, because this
> series is a derivative of the ~2014 XFS case folding patchset.  Hmm, so
> looking at the ext4 changes, I guess what you do is add a custom ->d_hash
> function so that the dentries are hashed by hash(nfkd(fname))?  Which
> makes it easy to have link() look for names that will conflict after
> normalization?

This would be true for NFKC or NFC as well though, right?  So the
tradeoff of NF[K]C vs NF[K]D is that NFC is more efficient from an
encoding perspective.  For e with a grave accent, NFC would encode it
as C3 A9, while NFD would encode it as 65 CC 81.  So from an encoding
perspective there would be a benefit to use 'C' versus 'D'.  But MacOS
X by default canonicalizes to 'D', not 'C'.  I assume that's the
rationale for using NFKD versus NFKC?

As far as the 'K' versus "non-K" distinction, I imagine the main issue
is that a user could cut and paste something like "She\uFB03eld" which
it makes sense to canonicalize this to "Sheffield".  This is *not* a
canoncalization which MacOS X does (it uses NFD, not NFKD) but from a
compatibility perspective, it's not a problem since:

NFD:	Sheffield -> Sheffield
	She\uFB03eld -> She\uFB03eld

NFKD:	Sheffield -> Sheffield
	She\uFB03eld -> Sheffield

Given it's really painful to type the string She\uFB03eld into a
terminal, it seems to make sense that even if the user tries to create
a file with that string, that the actual file name that should get
created should be "Sheffield".

And hence, that's the argument for why the best on-disk encoding for
Linux file systems should be NFKD.

Does that seem right to everyone?

						- Ted

      parent reply	other threads:[~2018-10-13  2:58 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-24 21:56 [PATCH RESEND v2 00/25] Ext4 Encoding and Case-insensitive support Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 01/25] nls: Wrap uni2char/char2uni callers Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 02/25] nls: Wrap charset field access Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 03/25] nls: Wrap charset hooks in ops structure Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 04/25] nls: Split default charset from NLS core Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 05/25] nls: Split struct nls_charset from struct nls_table Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 06/25] nls: Add support for multiple versions of an encoding Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 07/25] nls: Implement NLS_STRICT_MODE flag Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 08/25] nls: Let charsets define the behavior of tolower/toupper Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 09/25] nls: Add new interface for string comparisons Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 10/25] nls: Add optional normalization and casefold hooks Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 11/25] nls: ascii: Support validation and normalization operations Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 12/25] nls: utf8n: Add unicode character database files Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 13/25] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 14/25] nls: utf8: Move nls-utf8{,-core}.c Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 15/25] nls: utf8: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 16/25] nls: utf8n: reduce the size of utf8data[] Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 17/25] nls: utf8: Integrate utf8 normalization code with utf8 charset Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 18/25] nls: utf8: Introduce test module for normalized utf8 implementation Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 19/25] vfs: Handle case-exact lookup in d_add_ci Gabriel Krisman Bertazi
2018-10-07 18:09   ` Theodore Y. Ts'o
2018-10-09 14:40     ` Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 20/25] ext4: Include encoding information in the superblock Gabriel Krisman Bertazi
2018-10-11 22:26   ` Darrick J. Wong
2018-10-12 15:36     ` Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 21/25] ext4: Add encoding mount options Gabriel Krisman Bertazi
2018-10-07 19:22   ` Theodore Y. Ts'o
2018-10-09 14:53     ` Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 22/25] ext4: Support encoding-aware file name lookups Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 23/25] ext4: Implement encoding-aware dcache hooks Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 24/25] ext4: Implement EXT4_CASEFOLD_FL flag Gabriel Krisman Bertazi
2018-09-24 21:56 ` [PATCH RESEND v2 25/25] docs: ext4.rst: Document encoding and case-insensitive lookups Gabriel Krisman Bertazi
2018-10-11 22:23 ` [PATCH RESEND v2 00/25] Ext4 Encoding and Case-insensitive support Darrick J. Wong
2018-10-12 15:29   ` Gabriel Krisman Bertazi
2018-10-12 19:24   ` Theodore Y. Ts'o [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181012192401.GA20322@thunk.org \
    --to=tytso@mit.edu \
    --cc=darrick.wong@oracle.com \
    --cc=krisman@collabora.co.uk \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).