public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
To: Dave Chinner <david@fromorbit.com>
Cc: Ben Myers <bpm@sgi.com>, tinguely@sgi.com, olaf@sgi.com, xfs@oss.sgi.com
Subject: Re: [RFC] Unicode/UTF-8 support for XFS
Date: Fri, 12 Sep 2014 13:45:39 -0400	[thread overview]
Message-ID: <20140912174538.GD978@meili> (raw)
In-Reply-To: <20140912100230.GB4267@dastard>

On Fri, Sep 12, 2014 at 08:02:30PM +1000, Dave Chinner wrote:
> On Thu, Sep 11, 2014 at 03:37:35PM -0500, Ben Myers wrote:
...
> > When comparing unicode strings for equality, normalization comes into play:
> > we must compare the normalized forms of strings, not just the raw sequences
> > of bytes. There are a number of defined normalization forms for unicode.
> > We decided on a variant of NFKD we call NFKDI. NFD was chosed over NFC,
> > because calculating NFC requires calculating NFD first, followed by an
> > additional step. NFKD was chosen over NFD because this makes filenames
> > that ought to be equal compare as equal.
> 
> But are they really equal?
> 
> Choosing *compatibility* decomposition over *canonical*
> decomposition means that compound characters and formatting
> distinctions don't affect the hash. i.e. "of'fi'ce", "o'ffi'ce" and
> "office" all hash and compare as the same name, but then they get
> stored on disk unnormalised. So they are the "same" in memory, but
> very different on disk.
> 
> I note that the unicode spec says this for normalised forms
> (11.1):
> 
> "A normalized string is guaranteed to be stable; that is, once
> normalized, a string is normalized according to all future versions
> of Unicode."
> 
> So if we store normalised strings on disk, they are guaranteed to
> be compatible with all future versions of unicode and anything that
> goes to use them. So why wouldn't we store normalised forms on disk?

I've had a very similar discussion about normalization in ZFS.  Sadly, I
can't find where it happened so I can't point you to it.  One interesting
point that I remember is that storing the original form may be less
surprising to an application.  Specifically, the name it reads back is the
same it supplied during the creation.  (Granted, if the file already exists,
the application will read back the new form.)

Just FWIW.

Jeff.

-- 
Only two things are infinite, the universe and human stupidity, and I'm not
sure about the former.
		- Albert Einstein

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2014-09-12 17:45 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-11 20:37 [RFC] Unicode/UTF-8 support for XFS Ben Myers
2014-09-11 20:40 ` [PATCH 1/9] xfs: return the first match during case-insensitive lookup Ben Myers
2014-09-11 20:41 ` [PATCH 2/9] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-11 20:42 ` [PATCH 3/9] xfs: add xfs_nameops.normhash Ben Myers
2014-09-11 20:43 ` [PATCH 4/9] xfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-11 20:46 ` [PATCH 5/9] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-11 20:47 ` [PATCH 6/9] xfs: add unicode character database files Ben Myers
2014-09-11 20:48 ` [PATCH 7/9] xfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-11 20:49 ` [PATCH 8/9] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-11 20:50 ` [PATCH 9/9] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-11 20:51 ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers
2014-09-11 20:52 ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-11 20:53 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
2014-09-11 20:55 ` [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-11 20:56 ` [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-11 20:57 ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers
2014-09-11 20:59 ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-11 21:00 ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-11 21:01 ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-11 21:02 ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers
2014-09-11 21:03 ` [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
2014-09-11 21:04 ` [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair Ben Myers
2014-09-11 21:06 ` [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support Ben Myers
2014-09-12 10:02 ` [RFC] Unicode/UTF-8 support for XFS Dave Chinner
2014-09-12 11:55   ` Olaf Weber
2014-09-12 20:55     ` Christoph Hellwig
2014-09-15  7:16       ` Olaf Weber
2014-09-16 20:54         ` Dave Chinner
2014-09-16 21:02           ` Christoph Hellwig
2014-09-16 21:42             ` Ben Myers
2014-09-12 17:45   ` Josef 'Jeff' Sipek [this message]
2014-09-12 20:53   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140912174538.GD978@meili \
    --to=jeffpc@josefsipek.net \
    --cc=bpm@sgi.com \
    --cc=david@fromorbit.com \
    --cc=olaf@sgi.com \
    --cc=tinguely@sgi.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox