From: Theodore Tso <tytso@MIT.EDU>
To: Kevin Ballard <kevin@sb.org>
Cc: git@vger.kernel.org
Subject: Re: git on MacOSX and files with decomposed utf-8 file names
Date: Tue, 22 Jan 2008 21:06:07 -0500 [thread overview]
Message-ID: <20080123020607.GC1320@mit.edu> (raw)
In-Reply-To: <E6F76F93-24C9-4D10-813C-770A9C3A9828@sb.org>
On Tue, Jan 22, 2008 at 07:38:04PM -0500, Kevin Ballard wrote:
> * Any new characters added to Unicode will only have one form (decomposed),
> so HFS+ will always accept new characters as they will be NFD. The only
> exception is case-sensitivity, as the case-folding tables in HFS+ are
> static, so new characters with case variants will be treated in a
> case-sensitive manner. However, as they are already decomposed, the NFD
> algorithm will not change their encoding. This means that no, there are
> zero problems moving HFS+ drives between versions of OS X.
Except there *are* problems, because this promise doesn't apply to
Unicode 2.1 (Mac OS 10.2 and before) and Unicode 3.2 (Mac OS 10.3 and
above). And there were changes between the normalization algorithm
between Unicode 3.2 and the Unicode version 4.1. So taking a hard
drive between Mac OS X 10.2 and 10.3 *will* cause problems. The
guarantees of Unicode stability didn't come until well past Unicode
2.1.
Also, I know of no guarantee that there will be no more new
compositions. According to Unicode Stnadard Annex #15
(http://unicode.org/reports/tr15/), new characters that can be
decomposed are strongly discouraged, but "It would be possible to add
more compositions in a future version of Unicode". Got a reference to
back up your claim that there will never be any more?
> * At the time HFS+ was developed, there was no one common standard for
> normalization. The HFS+ developers picked NFD because they thought it was
> "a more flexible, future-looking form", but Microsoft ended up picking the
> opposite just a short time later. Interestingly, NFC is a weird hybrid form
> which only has composed forms for pre-existing characters, and decomposed
> forms for all new characters (as they only have one form). So in a sense
> NFD is more sane then NFC.
NFC is better if you care about compatibility with existing legacy
character sets, where you want round-trip conversions to be
idempotent. On the other hand, given that Mac OS has historically
never cared about being compatible with the rest of the world, it
makes sense that it would choose NFD.
> * The core issue here, which is why you think HFS+ is so stupid, is that
> you guys see no problem with having 2 files "Märchen" (NFC) and "Märchen"
> (NFD), whereas the HFS+ developers don't consider it acceptable to have 2
> visually identical names as independent files.
Yep. No problems to do that. You seem to think that supporting
Unicode requires imposing this constraint, but that's simply not true,
except maybe in some kind of religious sense.
> Unfortunately, the only way
> to do this matching is to store the normalized form in the filesystem,
> because it would be a performance nightmare to try and do this matching any
> other way.
Nope. They were just not clever enough. If they use a hashed key for
their b-tree and used a hash which had the property that two strings
that were equivalent in the Unicode sense have the same hash value,
it's quite possible to do Unicode-equivalence lookups quickly. Yeah,
calculating the hash algorithm takes a bit amount of time, but it gets
called no more than the normalization routine, and its performance
overhead is no worse than the normalizing a string.
I know how to do it in a Linux filesystem; it's just an insane thing
to do, and so I choose not to do it. But it is doable; if you must
persue the course of filesystem insanity, it's possible to do it in a
performant way, without normalization; it's the same way that you can
use b-tree lookups in a case insensitive way.
> I must say it is shocking that someone as smart as you is still more
> interested in finding ways to prove me wrong then to actually address the
> problem. It's obvious that the only research you did was intended to find
> ways to call me stupid.
No, I did the research to try to find the HFS-specific filename
mangling algorithm. And given that's based on an back-level, old
version of Unicode, you can't just use NFD algorithm from the latest
Unicode spec. As I did that research, I came across the evidence that
claims you had made (i.e., that HFS had never changed the Unicode
version for its Normalization algorithm), was directly contradicted by
the Apple TechNote.
- Ted
next prev parent reply other threads:[~2008-01-23 2:06 UTC|newest]
Thread overview: 260+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-16 15:17 git on MacOSX and files with decomposed utf-8 file names Mark Junker
2008-01-16 15:34 ` Johannes Schindelin
2008-01-16 15:43 ` Kevin Ballard
2008-01-16 16:32 ` Johannes Schindelin
2008-01-16 16:46 ` Jakub Narebski
2008-01-16 20:39 ` Kevin Ballard
2008-01-16 21:51 ` Jakub Narebski
2008-01-16 22:06 ` Kevin Ballard
2008-01-16 22:23 ` Johannes Schindelin
2008-01-16 23:16 ` Kevin Ballard
2008-01-16 22:32 ` Linus Torvalds
2008-01-16 22:52 ` Linus Torvalds
2008-01-16 23:11 ` Kevin Ballard
2008-01-16 23:38 ` Linus Torvalds
2008-01-16 23:57 ` Pedro Melo
2008-01-17 0:16 ` Linus Torvalds
2008-01-17 0:27 ` Pedro Melo
2008-01-17 0:32 ` David Kastrup
2008-01-17 0:40 ` Pedro Melo
2008-01-17 0:54 ` Wincent Colaiuta
2008-01-17 1:08 ` Johannes Schindelin
2008-01-17 1:41 ` Linus Torvalds
2008-01-17 4:07 ` Kevin Ballard
2008-01-17 0:35 ` Johannes Schindelin
2008-01-17 0:45 ` Pedro Melo
2008-01-18 8:29 ` Peter Karlsson
2008-01-18 11:16 ` Jakub Narebski
2008-01-16 23:58 ` David Kastrup
2008-01-17 0:19 ` Linus Torvalds
2008-01-17 0:09 ` Kevin Ballard
2008-01-17 0:25 ` Linus Torvalds
2008-01-17 0:33 ` Johannes Schindelin
2008-01-17 0:43 ` Pedro Melo
2008-01-17 0:57 ` Johannes Schindelin
2008-01-17 1:06 ` Linus Torvalds
2008-01-17 1:16 ` Linus Torvalds
2008-01-17 3:52 ` Kevin Ballard
2008-01-17 4:08 ` Linus Torvalds
2008-01-17 4:30 ` Kevin Ballard
2008-01-17 4:51 ` Martin Langhoff
2008-01-17 5:23 ` Kevin Ballard
2008-01-17 6:13 ` Geert Bosch
2008-01-17 7:11 ` Mitch Tishmack
2008-01-17 10:22 ` Wincent Colaiuta
2008-01-17 13:44 ` Kevin Ballard
2008-01-17 15:57 ` Johannes Schindelin
2008-01-17 16:53 ` Kevin Ballard
2008-01-18 0:44 ` Robin Rosenberg
2008-01-17 14:02 ` Andrew Heybey
2008-01-17 15:04 ` Kevin Ballard
2008-01-19 19:29 ` Kyle Moffett
2008-01-19 19:57 ` Kevin Ballard
2008-01-17 10:08 ` Wincent Colaiuta
2008-01-17 16:43 ` Linus Torvalds
2008-01-17 18:09 ` Mark Junker
2008-01-17 18:12 ` Pedro Melo
2008-01-17 18:18 ` Johannes Schindelin
2008-01-17 18:36 ` Mark Junker
2008-01-17 18:38 ` Pedro Melo
2008-01-17 18:44 ` Linus Torvalds
2008-01-17 19:02 ` Pedro Melo
2008-01-17 18:42 ` Linus Torvalds
2008-01-17 18:50 ` Mark Junker
2008-01-17 18:52 ` Pedro Melo
[not found] ` <alpine.LFD.1.00.0801 171100330.14959@woody.linux-foundation.org>
2008-01-17 19:01 ` Theodore Tso
2008-01-17 19:11 ` Linus Torvalds
2008-01-18 0:18 ` Kevin Ballard
2008-01-18 0:35 ` Linus Torvalds
2008-01-18 1:05 ` Robin Rosenberg
2008-01-18 1:24 ` Linus Torvalds
2008-01-18 4:08 ` Brian Dessent
2008-01-18 8:49 ` Dmitry Potapov
2008-01-18 9:42 ` Robin Rosenberg
2008-01-18 10:30 ` Dmitry Potapov
2008-01-18 15:37 ` Peter Karlsson
2008-01-18 17:24 ` Jakub Narebski
2008-01-18 10:19 ` Peter Karlsson
2008-01-18 10:50 ` Dmitry Potapov
2008-01-18 15:30 ` Peter Karlsson
2008-01-18 17:11 ` Linus Torvalds
2008-01-18 20:24 ` Kevin Ballard
2008-01-19 8:48 ` Dmitry Potapov
2008-01-19 14:55 ` Kevin Ballard
2008-01-19 21:17 ` Dmitry Potapov
2008-01-19 18:58 ` Linus Torvalds
2008-01-19 20:39 ` Mark Junker
2008-01-19 22:58 ` Johannes Schindelin
2008-01-20 6:14 ` Dmitry Potapov
2008-01-20 6:53 ` Linus Torvalds
2008-01-20 13:15 ` Johannes Schindelin
2008-01-20 0:11 ` Wincent Colaiuta
2008-01-20 1:04 ` Linus Torvalds
2008-01-20 5:27 ` Mike Hommey
2008-01-20 5:45 ` Linus Torvalds
2008-01-20 7:00 ` Mike Hommey
2008-01-20 7:26 ` Linus Torvalds
2008-01-20 8:00 ` Dmitry Potapov
2008-01-20 8:12 ` Dmitry Potapov
2008-01-20 9:34 ` Wincent Colaiuta
2008-01-18 20:28 ` Junio C Hamano
2008-01-18 20:50 ` Johannes Schindelin
2008-01-23 2:46 ` Eric W. Biederman
2008-01-23 2:57 ` Junio C Hamano
2008-01-23 14:26 ` Nicolas Pitre
2008-01-23 21:19 ` Junio C Hamano
2008-01-21 14:14 ` Peter Karlsson
2008-01-21 16:43 ` Kevin Ballard
2008-01-21 16:48 ` David Kastrup
2008-01-21 16:59 ` Kevin Ballard
2008-01-21 20:43 ` Dmitry Potapov
2008-01-21 20:53 ` Kevin Ballard
2008-01-21 21:05 ` David Kastrup
2008-01-21 23:01 ` Dmitry Potapov
2008-01-21 16:53 ` Jeff King
2008-01-21 17:08 ` Nicolas Pitre
2008-01-21 17:25 ` Kevin Ballard
2008-01-21 20:35 ` David Kastrup
2008-01-21 20:32 ` David Kastrup
2008-01-21 18:12 ` Linus Torvalds
2008-01-21 19:05 ` Kevin Ballard
2008-01-21 19:41 ` Linus Torvalds
2008-01-21 19:58 ` Kevin Ballard
2008-01-21 20:33 ` Linus Torvalds
2008-01-21 20:53 ` Kevin Ballard
[not found] ` <alpine.LFD.1.0! 0.0801211323120.2957@woody.linux-foundation.org>
2008-01-21 20:58 ` David Kastrup
2008-01-21 21:17 ` Martin Langhoff
2008-01-21 21:28 ` Kevin Ballard
2008-01-21 21:43 ` Martin Langhoff
2008-01-21 21:33 ` Linus Torvalds
2008-01-21 21:49 ` Kevin Ballard
2008-01-21 22:34 ` Linus Torvalds
2008-01-21 22:46 ` Kevin Ballard
2008-01-21 22:56 ` Martin Langhoff
[not found] ` <53C76BEA-2232-4940-8776-9DF1880089A4@sb.org>
2008-01-21 23:05 ` Kevin Ballard
2008-01-21 23:16 ` Martin Langhoff
2008-01-22 0:30 ` Kevin Ballard
2008-01-21 23:00 ` Theodore Tso
2008-01-21 23:09 ` Kevin Ballard
2008-01-21 23:44 ` Linus Torvalds
2008-01-22 0:47 ` Kevin Ballard
2008-01-22 1:01 ` Linus Torvalds
2008-01-22 1:13 ` Linus Torvalds
2008-01-22 2:33 ` Kevin Ballard
2008-01-22 2:50 ` Linus Torvalds
2008-01-22 3:04 ` Kevin Ballard
2008-01-22 3:17 ` Linus Torvalds
2008-01-22 3:21 ` Martin Langhoff
2008-01-22 4:22 ` Kevin Ballard
[not found] ` <20080122133427.GB17804@mit.edu>
2008-01-23 0:08 ` Theodore Tso
2008-01-23 0:38 ` Kevin Ballard
2008-01-23 1:47 ` Martin Langhoff
2008-01-23 2:06 ` Theodore Tso [this message]
2008-01-23 8:45 ` David Kastrup
2008-01-23 0:38 ` Linus Torvalds
2008-01-23 1:14 ` Martin Langhoff
2008-01-23 1:16 ` Kevin Ballard
2008-01-23 1:27 ` Martin Langhoff
2008-01-23 1:33 ` Theodore Tso
2008-01-23 1:56 ` Linus Torvalds
2008-01-23 2:02 ` Kevin Ballard
2008-01-23 6:41 ` Mike Hommey
2008-01-23 8:15 ` Kevin Ballard
2008-01-23 8:43 ` Dmitry Potapov
2008-01-23 9:02 ` Jonathan del Strother
2008-01-23 9:12 ` Dmitry Potapov
2008-01-23 9:19 ` Mike Hommey
2008-01-23 9:32 ` Dmitry Potapov
2008-01-23 9:40 ` Mike Hommey
2008-01-23 13:38 ` Theodore Tso
2008-01-23 16:16 ` Linus Torvalds
2008-01-23 17:12 ` Theodore Tso
2008-01-23 17:19 ` Kevin Ballard
2008-01-23 17:32 ` Linus Torvalds
2008-01-24 21:02 ` On pathnames Junio C Hamano
2008-01-24 22:31 ` Nicolas Pitre
2008-01-25 3:55 ` Martin Langhoff
2008-01-25 4:18 ` Junio C Hamano
2008-01-25 4:12 ` Junio C Hamano
2008-01-25 8:08 ` Pedro Melo
2008-01-25 12:25 ` Johannes Schindelin
2008-01-25 12:50 ` David Kastrup
2008-01-25 12:53 ` Wincent Colaiuta
2008-01-24 23:56 ` Sean
2008-01-25 0:36 ` Johannes Schindelin
2008-01-25 4:00 ` Daniel Barkalow
2008-01-25 4:21 ` Junio C Hamano
2008-01-25 11:36 ` Johannes Schindelin
2008-01-25 16:25 ` Daniel Barkalow
2008-01-25 17:34 ` Johannes Schindelin
2008-01-25 5:59 ` Jeff King
2008-01-23 20:18 ` git on MacOSX and files with decomposed utf-8 file names Jay Soffian
[not found] ` <1DC841ED-634F-412C-9560-F37E4172A4CD@sb.org>
[not found] ` <76718490801231421l7b6552f8sec13f570360198b@mail.gmail.com>
[not found] ` <4F906435-A186-4E98-8865-F185D75F14D4@sb.org>
[not found] ` <76718490801231517h6d57e5bfkc19d394d38ad19db@mail.gmail.com>
2008-01-24 2:05 ` Kevin Ballard
2008-01-24 3:11 ` Junio C Hamano
2008-01-24 4:37 ` Martin Langhoff
2008-01-24 5:30 ` Kevin Ballard
2008-01-24 6:39 ` Steffen Prohaska
2008-01-24 18:17 ` Mitch Tishmack
2008-01-24 18:52 ` Mitch Tishmack
2008-01-24 19:58 ` Kevin Ballard
2008-01-23 23:37 ` Martin Langhoff
2008-01-23 16:58 ` Kevin Ballard
2008-01-23 17:39 ` Dmitry Potapov
2008-01-23 17:47 ` Kevin Ballard
2008-01-21 19:57 ` Theodore Tso
2008-01-21 20:01 ` Kevin Ballard
2008-01-21 20:15 ` Theodore Tso
2008-01-21 20:31 ` Kevin Ballard
2008-01-21 20:46 ` Theodore Tso
2008-01-21 20:59 ` Kevin Ballard
[not found] ` <6E303071-82A4-4D69-AA0C-EC41168B9AFE@sb.org>
2008-01-21 21:18 ` Theodore Tso
2008-01-21 21:43 ` Kevin Ballard
2008-01-21 21:49 ` Martin Langhoff
2008-01-21 21:57 ` Kevin Ballard
2008-01-22 0:36 ` Johannes Schindelin
2008-01-22 0:42 ` Kevin Ballard
2008-01-22 0:48 ` David Kastrup
2008-01-22 1:06 ` Martin Langhoff
2008-01-22 1:34 ` Johannes Schindelin
2008-01-22 1:53 ` Martin Langhoff
2008-01-22 2:03 ` Johannes Schindelin
2008-01-21 22:38 ` David Kastrup
2008-01-22 2:34 ` Kevin Ballard
2008-01-22 7:51 ` David Kastrup
2008-01-21 20:56 ` Dmitry Potapov
2008-01-21 21:07 ` Kevin Ballard
2008-01-21 22:41 ` Dmitry Potapov
2008-01-21 22:53 ` Kevin Ballard
2008-01-21 23:21 ` Dmitry Potapov
2008-01-21 19:44 ` Mike Hommey
2008-01-21 20:36 ` Dmitry Potapov
2008-01-21 21:06 ` Martin Langhoff
2008-01-21 21:09 ` David Kastrup
2008-01-21 21:42 ` Linus Torvalds
2008-01-21 22:45 ` Martin Langhoff
2008-01-21 20:30 ` Dmitry Potapov
2008-01-21 18:16 ` Linus Torvalds
2008-01-17 21:27 ` Dmitry Potapov
2008-01-17 22:01 ` JM Ibanez
2008-01-17 22:09 ` Johannes Schindelin
2008-01-18 1:27 ` Robin Rosenberg
2008-01-17 23:05 ` Linus Torvalds
2008-01-17 23:10 ` Dmitry Potapov
2008-01-16 23:52 ` Dmitry Potapov
2008-01-16 22:37 ` Eyvind Bernhardsen
2008-01-16 23:03 ` Wincent Colaiuta
2008-01-17 7:29 ` Miles Bader
2008-01-17 4:43 ` Jay Soffian
2008-01-17 4:59 ` Jay Soffian
2008-01-17 5:15 ` Junio C Hamano
2008-01-17 10:28 ` Wincent Colaiuta
2008-01-17 11:10 ` Johannes Schindelin
2008-01-17 11:23 ` Pedro Melo
2008-01-17 11:51 ` Wincent Colaiuta
2008-01-17 12:53 ` Johannes Schindelin
2008-01-17 13:40 ` Wincent Colaiuta
2008-01-17 17:58 ` Junio C Hamano
2008-01-17 18:22 ` Johan Herland
2008-01-17 13:05 ` Johannes Schindelin
2008-01-17 11:46 ` Wincent Colaiuta
2008-01-17 5:11 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080123020607.GC1320@mit.edu \
--to=tytso@mit.edu \
--cc=git@vger.kernel.org \
--cc=kevin@sb.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).